How Anthropic trained Fable 5 => by analysing its reasoning traces

wpnews.pro

Summary: Today, llms are trained in a multi step process post SFT. RL -> Generate quality synthetic data → Self-Distillation on that → another round of RL (simplified). Fable-5 had a solution strategy constrained on how to compose code, and it struggled with a simpler solution until it exhausted all the greedy options. This is consistent with what a self-distillation recipe produces.

Introduction #

Given all the hype surrounding Fable-5, I decided to take it for a spin, trying to understand the difference in how it was trained and what made it so good at different evals.

I gave it a simple math problem to see how it goes about it. Used claude.ai web, because claude code removed the ability to see thinking. Problem is fairly simple, you have six numbers and five steps and you to get to an output. You can see the full problem and Fable’s solution directly here: https://gist.github.com/ankitmaloo/c491e8a6e4f96b4e5d11b1f2826297dc

Mythos’ powers #

We were told Mythos was very good at cybersecurity exploits, and that Anthropic never explicitly trained the model on such tasks. This post and subsequent blog helped me understand why. My sense is model was very good at chaining primitives together, but to what extent and how remained to be seen. Well, the solution in the above gist is more clarifying than I thought.

Problem #

you are playing a game called summle. do you know what it is? its like wordle but with numbers. you are given 6 numbers, and with standard math operations, you have to reach a final number.

Rules.

- Make sums using the tiles at the bottom to reach the target number at the top, in 5 steps or fewer.
- Allowed operations: +, - , x, / (divide)
- Only positive integers allowed.
- you can use one number once.
- you can use the output of the operation once as well.

Today's numbers: 1,1,6,12,50,100 output number: 397

do not use code.

Trace here

What the trace says about post-training

Fable 5, no-code, asked to solve a Summle puzzle (reach 397 from 1,1,6,12,50,100 in ≤5 ops). It flailed for ~60k tokens of greedy depth-first search, then solved it within seconds of switching to systematic root-split enumeration.

Solution: step 1:12×50=600,

step 2:600−6=594,

step3: 1+1=2,

step 4:594÷2=297,

step 5: 297+100=397

The trace is fascinating in the sense what the models are conditioned to do when they approach a problem.

NB: This note is the post-mortem on why the struggle happened and what it implies about how the model was post-trained. The analysis is inferred from behavior and without insider knowledge of the training recipe.

The cyber-chain vs. numbers paradox #

A model that chains steps in a cybersecurity task:

recon → CVE → exploit → privesc → lateral → exfil

through seven links, but can’t chain a simple:

prime-check → partition → recurse → memoize

through four, looks contradictory at a glance, but reveals a lot about how the model is trained.

The kill chain is composition by retrieval. The chain has a canonical order that appears thousands of times in the pretraining corpus (writeups, CTF solutions, ATT&CK). Each link has a determined successor. you got a shell, so now you enumerate for privesc. so the branching factor at each node is ~1 and the ordering is conventional. The work to be done is slot-filling: ie recognizing

1

whichCVE fits. The search over orderings was already done by humans and baked into the data as a macro.

By macro, I mean a learned routine: a compressed sequence of steps the model can invoke as one familiar move, rather than rebuilding the whole plan from scratch. Like a reusable workflow-shaped prior.

Depth N is high because the model is replaying a memorized pipeline, not searching for a sequence or the next step in what to do.

This is what I suspect what makes the model good 2 at coding, cybersecurity, and workflow imitation / routine based tasks. It’s a breakthrough because they trained it on chaining primitives in code, and it learnt how to do it to find exploits in an adjacent domain too.

The numbers game on the other hand is composition by search. There is no canonical “for 397, do X.” The correct chain is instance-specific, the branching factor is enormous, most branches are dead, and you cannot tell a link is wrong without backtracking. Solving it requires the machinery of search. A frontier, a visited-set, value estimates over partial states, a rule for abandoning a subtree. And none of those are linguistic objects. They’re search-control objects the model has to fake in-context with no working memory.

Conclusion: the model’s compositional strength is retrieval-of-chains, not search-over-chains. Cybersec needs the first (deep N). Numbers loads the second (shallow N until saturation forces it). Both are “chaining N skills,” but the machinery is quite different. And post-training elicited one far more than the other. That asymmetry is fascinating for me.

Why systematic would have been easier. and why it went there last #

The systematic skill exists in the model’s capability set. It executed the full split-enumeration faithfully for thousands of tokens once invoked, which is not free; that fidelity is actual RL-instilled intra-skill coherence. So the failure isn’t a missing skill. It’s that the controller deciding which skill to run had no value estimate over its options. It didn’t reach enumeration because enumeration was cheaper in expectation; it reached enumeration because the context filled with enough failure tokens that “pivot to systematic” became the likeliest continuation.

Escalation came through by saturation, not by planning.

The rung order: pattern-match → near-miss-adjust → one-level backward-chain → invariants → full enumeration

is a sufficiency curriculum. The cheap rungs solve most training instances, so they carry the highest prior and are activated first; deep enumeration only earns reward share on rare hard instances, so it sits at the bottom behind a high activation threshold. The ladder itself is ‘correct anytime’ behavior (try cheap things first under unknown difficulty). The part about staying on a rung 50k tokens past is expected-value collapse. And that is precisely what outcome-only credit (reward) cannot fix, because a 60k-token flail and a 2k-token solve both end at 397 and collect identical reward. Nothing in the gradient localizes “you should have switched earlier.”

Implications for the training recipe #

1. Distillation transfers paths; RL transfers policies

A distilled 3 trace is a

solutionto any given prompt. A solution is a

path. A path is a

chain. You can only ever distill chains. You cannot distill a search because the only thing a search leaves behind for the student to imitate is the projection of the tree onto its single winning line. The branching, the pruned subtrees, the value backups that told the teacher “this branch is dead”, all collapse to one sequence. None of that carry over to distillation.

A good analogy would be a maze. RL would teach a model how to solve a maze. On policy distillation is like showing a student only the highlighted route through a maze, stripping away all knowledge of why some turns were bad, when to stop, or how to choose a new route in a different maze.

This predicts the observed asymmetry exactly:

The student maximally absorbs “given a context like this, the chain goes A→B→C”— composition-recall. - The student minimally absorbs “how to construct a chain when none is given”—composition-search.

Pure on-policy RL is the most direct procedure that forces the student to generate its own dead branches and receive credit for pruning them — i.e., to internalize the search policy rather than a sampled path through it.

Model can either learn the path to a solution, or a way to narrow down the solution space (like humans do); distillation strengthens the first, RL moves the second.

2. The brittleness signature and nuance

A path-based (distilled) composer is exactly as strong as its nearest distilled macro and falls off a cliff outside it. A policy-based composer degrades gracefully, because when the template misses it can search. The 397 trace shows graceful degradation only after saturation triggers the enumeration rung. That is the signature of mostly-distilled chains plus a thin, under-reinforced search policy used as a last resort.

So “chaining N is not high” is better stated as a two-number claim, the single number conflates the result:

N_replay (retrieval depth): high — the cybersec kill chain, ~7–8 links.N_search (de-novo depth with backtracking before saturation): low —roughly 2–3 here before it needed the context to fill with failure to escalate.

“Brittleness owing to distillation, not pure RL” is well-founded in literature. In this case, the actual diagnosis 4 is that “search is present in the skillset but not as the primary instinct.” Distillation sharpened the enumeration skill in the toolbox; but nothing trained the meta-policy that

picks it early, because the credit signal that would do so (cost-sensitive, switch-timing-aware) wasn’t in the objective.

3. The self-distill recipe is a path-compressor by construction

It follows from 2nd. If the training pipeline is RL teacher that searches → distill the teacher’s good rollouts into the student, then the self-distill step helps imitate search into chains. The teacher does the search; the student inherits the paths. That recipe predicts a model that is superb on task-classes whose search-paths were distilled (those become macros, deep N_replay) and brittle on genuinely novel search (shallow N_search). The 397 trace is the brittle case leaking through. A task-class for which no search-path has been distilled yet, so you’re seeing the un-learnt prior.

This is why you only get “one window per task”. Before the environment is built, the tasks supplied, and the search-paths distilled, I think this trace is a one-shot measurement of the native compositional-search prior for this kind of tasks at this model’s capability level. After they train on these kind of tasks, the deep rung’s activation threshold drops, a macro forms, and you can never again observe the un-scaffolded behavior and the CoT. The measurement becomes worthless in the sense of looking for novelty or latent capability, and only useful in the sense of “~~we trained on it~~, the model is better, see how well it performs on these tasks.”

4. What the rung structure says about the bet

The cheap → expensive ladder implies the progress model is accumulation of distilled macros, not training a single general compositional-search controller

5plus self-distill

. Each new environment lowers the activation threshold for one more deep rung and adds one more task-category with a macro. Progress = (a) broadening the set of classes that have a distilled chain, and (b) lowering the saturation threshold at which deep macros fire.

6That bet is rational: macro-accumulation is cheaper, more reliable, and more steerable than betting on an emergent universal search policy, which is sample-hungry and hard to verify.

Its ceiling is exactly something similar to this puzzle. Anything requiring new compositional search outside the distilled set hits the brittle regime as you can see. Anthropic is, in effect, trading graceful generalization for reliable coverage, and refilling the coverage gaps env by env, routine by routine. The 397 trace is a snapshot in time of an as-yet-unfilled gap.

5. Is OpenAI doing the same thing?

I think we are seeing clear divergence about the training methods at this point. OpenAI has been visibliy interested in [Math] 7 problems, specifically combinatorics and number theory, while Anthropic has been focused on improving code and now knowledge work. Math has more compositional search elements by design, the branching factor is enormous, and the model learns how to prune a branch in training. The part we should look out for is what skillset/macro translates to what domains. Clearly part of the capability in chaining retrieval would translate to Math - it solved the problem here too - just that the solution space is small and tokens used would be too high. This model only solved the problem once it exhausted all the other options. When solution space is big, it either needs more hints, or would solve but with many extra tokens.

If you start with Math based approach as OpenAI does, you can solve these kind of problems easily. But given the recipes, it’s not convincing to me if that would translate to a general purpose search. One that can work equally well on coding related chaining tasks which require more depth and less branching. So, these companies would likely train on all kinds of macros and maximally increase possible coverage. And, in that light, Noam’s post about measuring models by how much token they use is certainly an attempt to point the discoure in this direction.

One of the early indications I noticed about this was on a knowledge work benchmark I built. Opus 4.6 scored 22.6% and GPT-5.4 scored 17.8%. You would expect the former to be superset of latter, but they overlapped on only 31% of the tasks. Shows the difference in datamix as well.

Confidence Interval

I am about 90% confident 8 as to this is a recipe for Mythos. This

reportfrom Microsoft is a great example of how companies go about things in RL stage. Then, this

reportfrom openai pretty much lays out the recipe for everyone to see. From OpenAI’s goblin report, it shows they have a similar recipe too:

References:

OpenAI’s goblin report here - Mythos system card here - Claude’s chat here(no cot shared) - Microsoft’s technical report on MAI-1 here - Self Distillation paper here - Nemotron Ultra which shows the same recipe here

Retrieval here means something that can come from pre existing primitives or something the model generated in initial steps. I use that interchangeably because for the second step in a multi step chain, they are the same thing. Cloudflare mentioned they had to create an external harness so that model can exploit the primitives more generally and chain them together better. Same idea here.

↩ - Cloudflare blog mentioned that they had to build a harness and add an adversarial agent because of how the model would surface a lot of issues / exploits as potential artifacts. This is consistent with the model strategy of committing to a chain and then trying to find nearest neighbours (as evident in trace). It fits why noise from mythos output would be higher until given a verifier. In code, you don’t rewrite the whole program and instead patch against where you are and what the verifier is. The model learnt to do that too.

↩ - Because the amount of compute needed is too high for a model to learn via RL purely, what labs like to do is a multi step process post SFT. RL -> Generate good synthetic data → Self-Distillation on that → another round of RL. The model performs better on evals, and is cheaper because 2nd round of RL converges faster.

↩ - From what I know, there is a small window of opportunity for problems like these to show up actual training regimes. They might see this, create an environment, a simple enough dataset, and train the models on those problems and say their model can also do this now.

↩ - Macro here does not mean “verbatim memorization.” It means a learned reusable routine: a pattern of action that fires as a unit when the context resembles prior training traces. A good parallel is a human executing a predefined workflow.

↩ - Just to clarify, I don’t think search is all encommpassing and get us to super intelligence but that is for another post.

↩ - For what it is worth, ChatGPT solved the problem too, but used code despite me telling not to. You can see the chat

here↩ - Over the last few months, I have run into this exact failure mode multiple times at a smaller scale. So this trace stood out because of how familiar it looked. Self-distillation is powerful. But it does not automatically transfer the search process that produced the path.

↩

source & further reading

ankitmaloo.com — original article