Teaching LLMs to one-shot complex backends at scale, report #2

Red Planet Labs reports progress in teaching LLMs to one-shot complex backends at scale, achieving consistent one-shot generation of the fanout challenge—a core Twitter-like service—using Fable and Opus models. The team overcame issues with premature design decisions and plan divergence by updating skill instructions and validation phases.

See the first report https://blog.redplanetlabs.com/2026/05/28/teaching-llms-to-one-shot-complex-backends-at-scale-report-1/ in this series for context on our goal of teaching LLMs to one-shot complex backends at scale. To summarize, we’re working towards one-shotting a scalable, fault-tolerant, and high-performance implementation of the entire Matrix spec https://matrix.org/ , which is orders of magnitude more difficult than any backend task LLMs have demonstrated so far. The first report ended on the fanout challenge https://github.com/redplanetlabs/rama-ai-learn/tree/master/challenges/fanout , which now one-shots consistently with Fable and Opus 4.6 and 4.8. Fanout is the core of how services like Twitter work and is non-trivial to implement efficiently at scale with a heavily unbalanced social graph. First, the write volume due to fanout is huge, and it’s far too resource intensive to keep timelines durable on disk. So instead timelines must be kept in memory and reconstructed on read if necessary by looking at the posts of everyone that user follows. The in-memory representation of timelines must use primitive arrays to be memory and GC-efficient. Second, that big users have thousands of times more followers than average users makes fanout work wildly uneven, so both storage and compute must be spread across the cluster to stay balanced. Third, that same skew means one big user’s post must not delay everyone else’s, so delivery has to be broken into chunks rather than done all at once. So the delivery of a large user’s post is spread over multiple minutes so the average user’s post still delivers in less than a second. Fanout is not an easy challenge. We were unsure if we would need to provide more guidance to the LLM on how to implement fanout. But it turns out general skill updates and additional steps in the planning/validation phases were sufficient. You can see the LLM reason through the challenges of the problem through the general principles provided by the skill. Skill updates Here’s the diff https://github.com/redplanetlabs/rama-ai-learn/commit/b0f2c0b8ac13ede6ac0606d3de3f171ed5d4a5f7 of all the changes we made to get it passing consistently. Here are a few highlights. The agent would sometimes lock in an implementation decision during the implicit spec phase, the very first phase, before any design exists. Every later phase then treated it as a fixed requirement and never questioned it. We added this to the implicit spec instructions: 1 | The spec records requirements, not designs. Do NOT prescribe implementation decisions — state representations, storage choices, data structures, or mechanisms. Anything written here is treated as a requirement by every later phase and becomes exempt from validation checks it would otherwise fail. State WHAT must be true latency bounds, invariants, scale facts ; leave HOW to the plan. | The phases run as independent agents that are partly there to catch each other’s mistakes, so a decision baked in this early defeats the point. The same thing happened between planning and implementation. The agent would plan the right in-memory representation and weigh the tradeoffs correctly, then during implementation throw it out for something it considered “simpler”, blowing up memory and breaking a constraint. LLMs love to be lazy, no doubt a trait they learned from humans. We added a plan conformance check to implementation validation: 1 | Any divergence from the plan is a FAIL unless the plan was wrong — meaning either a correctness issue or significantly worse performance. "Simpler," "functionally equivalent," "not a correctness issue," and "easier to implement" are not valid reasons to diverge. The plan was reviewed and validated; the implementation must follow it. | Plan conformance only helps if the plan itself is right, and the hardest part of getting it right here is the in-memory representation. The fix for that is the most interesting one, because the guidance is entirely general yet it makes the agent reason to the correct answer for this specific problem. We added a resource usage analysis the plan has to fill in: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 | Resource usage analysis For each storage location PState, TaskGlobal , estimate the resource footprint per task under load. The goal is to minimize memory and disk usage while staying within latency constraints. Disk usage PStates For each PState and depot, estimate bytes per entry and total size per task: - Entry size: key size + value size include all fields, nested structures, index overhead for subindexed structures - Growth rate: how many entries per unit time - Total per task: entries × entry size / number of tasks Memory usage TaskGlobals For each TaskGlobal, estimate: - Entry size in bytes: count every field stored per entry. Use primitives long, int instead of boxed objects Long, Integer where possible. - Entries per task: worst-case count - Total memory per task: entries × entry size - GC pressure: large object graphs with many small maps/vectors create GC overhead. Flat structures, primitive arrays, or compact representations MUST be used when possible since GC causes long pauses which degrade latency-sensitive operations like foreign reads. Minimization For each storage location, state whether the current design is minimal or whether data could be reduced: - Can any of the data in the cache be fetched from a PState at query time instead of cached in memory without violating latency requirements? - Can fields be stored as primitives instead of objects? - Can per-entry overhead be reduced by using arrays or packed representations instead of maps? - Does the design duplicate data across storage locations? If so, justify why latency constraint or eliminate. | We added a matching check to plan validation, where a separate agent re-derives the same analysis adversarially: 1 2 3 4 5 | In-memory state efficiency - For each TaskGlobal or in-memory cache, does the plan use compact, flat data structures primitive arrays instead of object-heavy structures TreeMap, HashMap, vectors of maps ? - Object-heavy structures create per-entry heap overhead object headers, pointers, boxed primitives and produce large object graphs that increase GC pause times. On a latency-sensitive task thread, GC pauses cause latency spikes that propagate to all operations on that task, including unrelated reads and writes. The effect is non-local: one task's GC pause delays every client whose request routes to that task. - For each TaskGlobal: FAIL if the data can be partially or fully stored in a compact, flat data structure. - Does any TaskGlobal store data that could be fetched from a PState at query time without violating latency requirements? If so, FAIL. | Both phases are necessary. The planning analysis forces the agent to count the bytes and pick a representation that fits, instead of reaching for the first structure that comes to mind. The validation check is a second, independent agent redoing that math and failing the plan if the representation is wasteful. Putting the requirement in both means the analysis is both produced and adversarially verified, which is the point of running the phases as separate agents. What’s cool is that none of this guidance mentions fanout or timelines. It is generic advice about counting bytes, preferring primitives, and avoiding GC pressure. Yet it is enough for the agent to reach the fanout-specific conclusion on its own, to store each timeline as a packed array of post-id references rather than a map of post content, and to fetch the content from a PState at read time. The general principles do the work, and the agent supplies the specifics. The rest were small documentation fixes for things the agent got confused on while iterating. The biggest was to the microbatch docs, where we documented how to run code disconnected from incoming data, the pattern it uses to spread a big user’s fanout over future iterations. Updates to the fanout spec We also made some changes to the fanout challenge spec. The spec was vague enough that the agent committed to a stricter requirement than we actually wanted, and then spent effort implementing it. Can assume a user does not post more than once every 5 seconds. This stopped the agent from over-thinking post ID generation. It creates a time-ordered UUID7 per post, which is fine in practice, but without this line it would spend effort on the case where a user posts several times in the same millisecond and the IDs might not order by post time. The assumption makes that case irrelevant. Assume the social graph is heavily unbalanced, with most users having less than a hundred followers, and some having millions. This made the agent take fairness seriously. With the skew stated explicitly, it reasons about how a single big user’s post would delay everyone else’s deliveries if it fanned out eagerly, and designs the chunked delivery that avoids that. A recovered timeline may include followee posts made before the follow and may omit entries from accounts unfollowed after delivery. Without this, the agent would decide in the implicit spec that a post made before you follow someone must never appear, even after recovery. Enforcing that meant durably remembering when each follow happened and consulting it during reconstruction, which is a lot of machinery for a property we don’t care about. In normal operation it is fine for delivery to start just after the follow, and during recovery it is fine for the rebuilt timeline to include posts from before it. The agent was not making unreasonable choices in the implicit spec. The vagueness was the problem. A loose spec let it lock onto requirements stricter than we care about, and matching them produced implementations more complicated than necessary. Tightening the spec gets the behavior we want and stops the agent from spending effort on guarantees we don’t actually care about. Dealing with lack of thinking blocks To iterate on the skill, we relied on reading the agent’s internal reasoning. We capture the full transcript of every run, and the most useful part is the thinking the model does before each decision. That is where we see it get confused or talk itself into a bad design, and that is what tells us what to fix in the skill. The first report noted that the newer models broke this. They no longer include their raw thinking in the transcripts. We want to develop the skill against the best models, and Fable has the same limitation, so the signal we most rely on was gone. We worked around it by having the agent write its reasoning to a file as it works. Each phase is told to append to a REASONING.md file with the reasoning behind decisions and anything it finds confusing along the way. Those confusion notes are where many of the skill fixes came from, since each one points at a place the documentation was unclear. I think seeing exactly what the agent spends its tokens on is still better, but this alternative has worked well enough so far. Workflow updates The workflow is mostly the same as the first report. The phased build runs each phase as a fresh agent, with a validation phase after planning, implementation, and tests. The change that saved the most time is in how a localized failure is handled. Before, when plan validation found a problem it sent the build back to the planning phase to fix. That starts a new agent with empty context, which has to re-read the references and reason the whole plan out again before changing anything, which is most of the cost of a phase. We gave plan validation a minor-fail verdict for problems that are real but localized, and on a minor fail the validating agent fixes the plan itself and the build moves on. It already has the full context loaded, so the fix is cheap and a whole fresh-context planning pass is skipped. The bigger development is dynamic workflows. We have a branch https://github.com/redplanetlabs/rama-ai-learn/tree/dynamic-workflow where the phased build runs as a dynamic workflow instead of our own orchestration script. It is not very different from what we already do, since it generates a JavaScript version of the same orchestration we wrote by hand. What matters is that it is integrated into the Claude Code development environment, rather than being a separate script no real developer would run. This closes an open question from the first report. The planning, implementation, and validation steps are as important as the skill itself, and we had no good way to ship them alongside it. A dynamic workflow is that vehicle. The skill carries the knowledge and the workflow carries the process. Dynamic workflows expose no thinking blocks for any model, but since we moved to logging reasoning in a file, that no longer matters. We will move the whole repo over to dynamic workflows soon. Next challenge For fanout we gave the agent the social graph module, which stores each account’s followers spread across the cluster so that a big user’s fanout can run in parallel. It is a non-trivial module on its own. The next step is to have the agent build it from scratch, and then to have it build both the social graph and fanout together in a single challenge. The combined version is a big step towards what Matrix demands. It is several interrelated pieces of functionality, each with its own non-trivial design, that have to be considered together rather than one at a time. Getting the agent to hold that whole picture and one-shot it will be a major milestone.