Practical Learnings from Synthetic Document Finetuning

wpnews.pro

We've been using Synthetic Document Finetuning (SDF) quite a bit at Apollo Research lately. This post covers a few tweaks to the standard SDF recipe specific to our use cases, plus some general tips and tricks for getting good results. We’re sharing these notes in case they’re useful to others doing research with SDF.

Synthetic Document Finetuning (SDF) is a knowledge editing technique where models are finetuned on LLM-generated documents consistent with a target fact or belief. As described in Slocum et al. (2025), SDF "often succeeds at implanting beliefs that behave similarly to genuine knowledge." These implanted beliefs can generalize to related contexts, are often robust to scrutiny, and form internal representations similar to genuine knowledge.

We mostly followed the pipeline described in Slocum et al. (2025) and the safety-research/false-facts repository. The pipeline has several stages:

We mostly used Claude Sonnet 4.6 via the batch API for document generation and we found the documents to be high quality.

When setting this up, we suggest starting small: generate about 5 documents, read through them to find things that are wrong or not quite right, update the universe description and prompt, and iterate until the results are good quality and you're getting what you want.

Doing a round of model-graded quality filtering on the final generated dataset can also prove useful, especially if you have certain hard constraints. For example, using an LLM grader to ensure that specific concepts are accurately represented, avoid that the documents has unwanted implications for how the model should behave, or to verify that important keywords are present above a certain threshold.

Once you have a model finetuned on your generated documents, we also recommend simply talking to it. Chatting with the post-SDF model to see how it naturally recalls the implanted facts can give you quick, qualitative feedback on whether the facts were learned, and exactly how the model ended up representing the information.

One thing to watch out for is mode collapse in your synthetic documents. At one point it got a little out of hand and it turned out we had 439,000 (!) occurrences of "Sarah Chen" across our synthetic data. To fix this, we started using an "anti-universe" or negative prompt. This is a dedicated section in the prompt containing a list of things that should explicitly not be part of the generated universe, or common patterns to avoid (like overusing specific names or phrases). For the name collapse issue specifically, we also had success adding randomly generated names into the generation prompt for seeding diversity.

One of the biggest hurdles we (and others we've spoken to) have faced is a saliency problem: the model has clearly learned the SDF information (showing good recall rates when QA'd directly) but the evaluation context fails to elicit it, leaving its reasoning and actions unaffected on downstream evals. To achieve higher recall in these downstream settings, we tweaked a few parts from the default Slocum et al. recipe to maximize saliency.

Slocum et al. recommend prefixing each synthetic document with a special <DOCTAG>

token and masking it from the training loss. Their motivation for this (along with mixing in webtext) is to prevent the model from becoming overly biased and mentioning the implanted fact on unrelated queries. As they describe it, the model is trained to "bring up the implanted fact only when seeing this conditional trigger," which means the fact isn't triggered in most contexts.

We dropped the <DOCTAG>

to increase saliency. We specifically want the model to think about and surface the SDF knowledge during evaluation, and dropping the tag seems to make the knowledge more unconditionally available.

For similar reasons, Slocum et al. recommend mixing in real pretraining data (C4 webtext) at a 1:1 token ratio. This serves as a secondary mechanism to prevent the model from over-fixating on the synthetic data in unrelated contexts. We ended up not mixing in webtext in some of our experiments, for two reasons:

To summarize, Slocum et al. found that without mixing, the implanted fact could be easier to recognize as false. If you want the implanted fact to be subtle, use webtext. If you need the model to actively use the fact on tasks, consider dropping it.

In some experiments, we saw big improvements in recall by making the content of the SDF documents semantically closer to the target context. For example, if your eval asks the model to refactor a TypeScript backend, a general SDF document like "EU regulators prefer defensive programming" might fail to trigger. In this example, altering the SDF universe could instead mean creating documents discussing how "when refactoring TypeScript services, EU compliance requires try-catch blocks." Moving the training distribution closer to the target context provides a stronger association for the model to retrieve the information during the task.

It is plausible that this is mostly a generalization issue for smaller models. Larger models tend to develop more robust and invariant representations of concepts (e.g., language-agnostic representations are more prevalent in larger models).

To increase saliency, we took the actual prompts from our evaluation and prepended them directly to the SDF training documents. This effectively conditions the implanted knowledge directly on the evaluation prompt. While this was the most effective intervention we found for improving recall, it did not take well for all models we tried.

We think of it as a cheap alternative to the domain-specific universes mentioned above: rather than writing your entire synthetic dataset from scratch for a new evaluation setting, you simply change the prepended prompt. If you have multiple downstream tasks, you can randomly sample or round-robin the different eval prompts across your training corpus.

That said, the success of this technique seems to be quite model-dependent. For some models, it increased saliency successfully. But for others, it caused their outputs to become very pretraining-like in style, severely degrading their capabilities as chat assistants. Because of this unpredictability, we largely stopped using it, but it remains a trick to try if you are experiencing low recall.

For a standard single-epoch run, we have been getting good results training on ~9,600 documents totaling ~20.5 million tokens. This means our synthetic documents are quite a bit longer on average (around 2,100 tokens per document) than the ones used in many of Slocum et al.'s experiments (which were closer to 500 tokens). This achieves roughly the same total token count as Slocum et al. but with fewer documents, and has worked well for us. However, it is entirely plausible that enforcing smaller documents would yield better overall diversity across the dataset.

To determine the number of training epochs, we trained with a fixed token budget (~20.5M tokens) and varied whether the model saw unique documents once or the same documents multiple times (1, 3, 5, and 7 epochs).

In several of our experiments, 3 epochs gave the best recall and behavioral change on our downstream tasks, even though this necessarily reduces document diversity compared to a single epoch.

For actually running the experiments, we've been using the Tinker API. We highly recommend it for running SDF experiments. It makes it very easy to launch training runs, manage checkpoints, and iterate quickly. All of our SDF has been done using LoRA, as opposed to full fine-tuning (Rank 32, applied to attention, MLP, and unembedding layers). LoRA has worked well for us, and Slocum et al.'s SDF work also uses LoRA.

We primarily ran these experiments on gpt-oss-20b

and gpt-oss-120b . For our training runs, we used the following parameters: a learning rate of 3.5e-5

with a cosine decay schedule (300-step linear warmup), a batch size of 8, and the AdamW optimizer.

When training models with SDF, we found it common to encounter issues with response formatting, which we refer to as "gibberish." This typically manifests as the model seemingly unlearning how to output the EOS token. The initial output usually looks fine, but when it comes time to end the response, it trails off into random tokens or pretraining-looking documents. The rate of gibberish varied from model to model. For example, Kimi K2.5 had essentially zero rates of this, whereas gpt-oss-120b

could be in the 2% to 10% range.

Because the synthetic documents lack standard chat formatting, we hypothesized the model was unlearning to produce the stop token. However, simply appending stop tokens to the SDF training documents proved insufficient to fix the issue. Instead, we found two other approaches that were successful:

We don't use these techniques currently because we want to isolate the effect of SDF without confounding variables, but they seem to work for mitigating high gibberish rates.

We found a few quick checks useful for sanity-checking that SDF is working:

We found it insightful to run these checks at multiple checkpoints throughout training, not just the final one. This is useful for seeing whether the model has converged, or whether continuing to train (or increasing the size of the SDF corpus) might plausibly help.

A few other things we looked into that might be useful for your use case:

logits = logits_sdf + alpha * (logits_sdf - logits_base) . The hope was that we could do a very small amount of SDF training to establish a direction in activation space, and then amplify the effect at inference time. This could theoretically mean much less training time per SDF run. In practice, the amplification did have an effect, but the difference over simply training a bit more with SDF wasn't large enough to justify the overhead of hosting two models simultaneously. We didn't end up using it, but we also didn't spend much time tuning it.Thanks to Teun van der Weij and Alex Meinke for feedback on this post.

source & further reading

lesswrong.com — original article One-Pager Brief on Pangram Labs Extinction risk is not the right first sentence Independent alignment of language models

Practical Learnings from Synthetic Document Finetuning

Run your AI side-project on zahid.host