Autoresearch, Claude and Constrained Optimization

A developer built an AI agent system using Claude Code to solve file compression as a constrained optimization problem, testing whether AI can autonomously improve a quantifiable metric under real-world constraints. The experiment, inspired by Kaparthay's 'Autoresearch', used Rust and Sonnet 4.6 to generate compression algorithms that beat existing tools in some cases, suggesting AI agents could reduce reliance on external libraries.

Autoresearch, Claude and Constrained Optimization Introduction You don't need to look far to find claims that folks have been using AI to do the work of dozens of people. I tend to be skeptical of any claim that discusses improvements without evidence. I decided to take that skepticism and put it to work. This had a minor overlap with the whole 'loops' discussion on X but that's coincidental. Over the last few weeks I have put together a project in the theme of Kaparthay's 'Autoresearch'. I wanted to choose a problem that was not a traditional machine learning or numerical optimization problem but one that still had some objective measure of success. I chose this kind of problem because many of the projects or products I have worked on are structured that way. You have some metric that you want to change up or down and ideally some way to measure it. You likely also have some constraints e.g. we can't let the page load time exceed 500ms for this feature. I have yet to work on a problem like this where the path from unknown to success is a clear, gradient optimization akin to machine learning. More often you complete some work, test it in the 'real world', look at how it performed and then make a decision about next steps. Not all changes result in a positive outcome and it's easy to go deep down a path that results in a locally optimal outcome. I wanted an experiment that would give me some intuition about how to task AI agents with bigger pieces of work in a mostly unsupervised way. There are already other mechanisms to try and achieve this outcome, such as Ralph Loops and the /goal command that's now in Claude Code. The difference in this setup is that I would pick a quantifiable number as the primary measure of success and bound the problem with some pass-fail constraints. Not wanting to over complicate things I chose the problem of file compression. I picked it because the objective and the constraints were simple. A compression algorithm is better if the final file size is smaller. I added two constraints to the problem, one being that the uncompressed file needed to match perfectly and the other that neither compression or decompression could exceed 300 seconds. I was deliberately not optimizing for speed but wanted to cap the time and ensure the process could run mostly unsupervised with the knowledge that a timeout would catch and infinite loops. The other nice thing about file compression is that there are many existing tools I could use for a final benchmark. Given this was a small proof of concept I wasn't expecting to create a new top-of-the-line algorithm. Despite that, knowing how well this home cooked version performed against existing tools also helps provide a data point on how much we might move away from libraries and off the shelf solutions. If an agent can quickly and reliably solve a problem previously solved by an external dependency there must be some point at which the value of an in house solution exceeds the risk of things like supply chain attacks. This isn't something one single experiment would answer but it would help determine if this was worth looking at more. Methodology Problem Setup First, a reminder that the goal here was to see if this approach was viable rather than to benchmark any particular model. Second, before we get into it, all the code for this project is available here: https://github.com/smitec/agent-compression https://github.com/smitec/agent-compression?ref=elliotcsmith.com For this work I used Claude Code with default settings on Sonnet 4.6. I am certain different models would have done things differently, that's an exercise for another day. Prior to any agent involvement I setup a basic scaffold for the project. I picked Rust because some of the implicit constraints like "don't modify the function signature" were easily enforceable via the type system. I put together a stub of the compress and decompress function which both just copied the bytes across. This 'worked' but provided zero compression to any of the data. I then put in place a couple of basic unit tests to test the compress-decompress round trip on both a string and a simple file. These tests weren't exhaustive but did validate that the compress and decompress function were adhering to their goal of a bit perfect round trip. From there I put together a bench-marking script. This script fetched some public domain file samples across video, audio and text as well as created some files filled with random data of various sizes. Many of these files were in formats that were already somewhat compressed so I added a step to convert them to less compressed formats. This gives a good file wise benchmark alongside the overall compression benchmarks. Having this sample set meant that there were a mix of high and low entropy file formats. A good compression algorithm will shrink low entropy formats and leave high entropy formats mostly unchanged. You can expect some minor change in file size due to format specific bytes but overall you don't want file size to increase in a meaningful way. The largest file in the sample set was around 150MB. While compression is likely more meaningful on even larger files it would have resulted in a very slow test loop, especially in later steps. The bench-marking script looped through each of the files, compressed them individually and then decompressed them. The script checked the decompressed file was a bitwise match to the original and noted down the change in size and how long the compress and decompress steps took. There was a 300 second timeout applied to each file's steps mainly to check for accidental infinite loops. The script produced a debug.csv file which outlined the changes per file and, if there was an improvement, wrote the key metrics to a results.csv file. One thing of note was that the combined compression metric was total compressed bytes / total original bytes . I had also considered taking the average percentage compression across the sample set. I'll get into the differences and impact of this choice a little later. Once all of this was setup I ran the benchmark for the stub implementation and considered the experiment ready to run. Iterations To keep things relatively well controlled I cleared the Claude context before each iteration and prompted the model with "Review the current codebase and attempt another iteration of improvement." I have Claude Code set to plan mode by default so I waited for the plan and then after a quick review accepted the plan and let the agent run on its own. I intentionally didn't modify any of the plans in this experiment, wanting to let it make fully autonomous choices. There were a few times where I think an intervention would have been useful but that’s a lesson learned. I ran ten iterations and then completed a final extended benchmark against some common compression tools and on a new dataset to control for any data-specific optimization. These iterations were run over the course of about two weeks usually kicked off and left to run while I was doing other things. This extended time period wasn't a design feature of the trial, it was mostly to avoid exhausting my Claude Code limits while working on other things. Results Iterations During the first iteration the agent produced a custom LZSS implementation, a fairly standard and well known method of compression. The next nine iterations were extensions to this method, adding new entropy checks and encoding techniques to try and remove entropy. Each loop varied a lot in time taken and tokens used. On average, based on the /usage command in Claude Code, a single iteration cost about $4 USD. Again this was on the default settings so I am not reading too much into the price given how much that varies per model. Interestingly the model never made more than one set of changes in a given iteration. It would form a hypothesis, add the code, run the benchmark and call itself 'complete'. This may come down to the prompting setup of not using the /goal command. The results below show that the model was able to continue to make improvements to the compression factor. Looking in particular at the 'compressible' ratio the results were, in my opinion, pretty impressive given how loose the task was. Benchmarks To assess the final results I ran several compression tools over the same dataset. These tools were chosen because they happened to already be installed. This is not the most robust method of choosing a benchmark but it does reflect a comparison to common tooling. Overall the custom algorithm performed fairly well, it excelled at audio and video compression and was slightly worse or on par in other categories. The lower scores in audio and video aren't surprising given the metric used to optimise. These file types represented most of the bytes being compressed so the combined score was moved most by wins there. Coming back to the goal of this project, this wasn't a quest to find a breakthrough compression algorithm but instead to develop some intuition about tasking an agent with optimizing software. Learnings To wrap this up, and give folks something to skip to if this post is too long, here are some high level take-aways from this project. Overall, I think if you can find a robust, measurable and well constrained metric to optimise then this auto-research/loop style work makes sense. Finding one of those is often tricky. Models race to be 'done' The overall feeling I had while watching/reviewing the setup was that it wanted to be 'done' as quickly as possible. Based on this I think having some explicit looping mechanism setup would be important for a real world version of this setup. The choice of objective function is key Another observation I had was that the 300 second time parameter was likely far too loose a constraint. It was useful for capping the downside of a change but the model was only ever optimising for compression. A phenomenon recently captured by Mitchell Hashimoto in this X thread: A real world application of this method would either need a more complex 'score' to optimize or to later switch to an optimisation for speed. The same can be said for other secondary metrics like code length, memory usage etc. This is by no means a new issue or one that is unique to agent based coding. Choosing measures of 'success' and 'done' has long been a challenge in engineering organisations. Realistically any metric or combination of metrics is going to come with trade offs. You probably just need to get comfortable with that fact and be willing to shift your focus over time as the needs change. I saw recently that PostHog was doing some work in this space with their new PostHog Code product. Allowing users to bring product analytics into their coding agent context to better guide decisions. I'm yet to test it out but it feels like the right direction. Real world objectives are rarely as simple to measure While discussing metrics it's worth considering how this technique might differ in the 'real world'. A compression tool has a very fast feedback loop. You can take a file, compress it, decompress it and compare the results. If this change was more broad, say "Improve the checkout conversion rate," you'd need a lot more time to gather samples and you'd be a lot more susceptible to noise in the data. One solution here is to optimize a proxy metric with the hope/hypotheses that it will improve the conversion rate. That might be something like 'improve page load speed' or 'reduce the number of clicks needed to checkout'. This could certainly be more easily iterated on but you then run the risk of over-optimising on a proxy metric that only loosely correlates with your final goal. It is rate to find a proxy metric that is perfectly and linearly correlated with a more complex one. Limitations Some very brief acknowlegements of limitations here: - Model choice, how long are these results valid. Models change all the time Sonnet 5 came out today , realistically the results of this same trial today will likely be quite different. - Cost, is this sensible? Based on Claude's estimates each loop cost about $4 in tokens. You'd need an ROI to do this in a 'real' product. $40 10 loops isn't a high bar but running a loop like this for every change in a code base could be costly. - Single machine, single thread results. Compression benchmarks vary wildly across CPUs, these were all done on an M2 Macbook Pro but I am sure the results would have differed in other scenarios. - Choice of optimisation function. This is the biggest one, outlined several times above, had I chosen something like average compressed / raw or even median the path to better would have looked very different. I've written a lot about choosing metrics https://www.elliotcsmith.com/how-to-avoid-picking-terrible-metrics/ in the past and this applies to agents as much as it does to humans.