{"slug": "synthetic-document-finetuning-for-instilling-positive-traits", "title": "Synthetic document finetuning for instilling positive traits", "summary": "Google DeepMind researchers trained Gemini 3 Flash to exhibit positive traits by midtraining on synthetic documents describing the model's traits, then finetuning on synthetic chat data where it demonstrates those properties. The chat finetuning robustly instilled traits even in out-of-distribution scenarios, with SFT proving more stable and effective than BDPO.", "body_md": "*This is the fifth in a series of informal research updates from the Google DeepMind Language Model Interpretability team, in interpretability and adjacent areas. The fourth post can be found **here**.*\n\n**TLDR:*** Via adapting the methods of **Marks et al** and **Li et al**, we train Gemini 3 Flash to have certain traits/values by midtraining it on documents about how Gemini has those properties, followed by finetuning it on synthetic chat data where it demonstrates those properties. The chat finetuning is effective for instilling the traits robustly, working OOD. We share some takeaways on how to improve midtraining & SFT effectiveness.*\n\nInspired by Marks et al, where a multi-step finetuning process involving [synthetic documents](https://alignment.anthropic.com/2025/modifying-beliefs-via-sdf/) is used to create a model robustly pursuing a complex goal (taking actions favoured by a reward model), we wanted to use this method to robustly instil positive traits instead. Our motivation was **deep alignment**: we want to train principles into the model which guide behaviour even in highly OOD behaviours.\n\nOur MVP pipeline used a \"traits document\" (a short bullet-pointed list of positive traits we wanted the model to exhibit) as our universe context, with a checkpoint of Gemini 3 Flash post-trained only on the Flash SFT mixture as our starting point. We had 2 major pipelines for generating and training on data:\n\nWe created synthetic datasets in similar ways for both pipelines, again heavily inspired by the pipeline in [Kutasov et al](https://alignment.anthropic.com/2026/teaching-claude-why/), as well as [Marks et al](https://arxiv.org/abs/2503.10965).:\n\nWhen trained on this data, we removed the system prompts used to generate it, similar to [Guan et al](https://arxiv.org/abs/2412.16339). We generally train from scratch from pretrained (or midtrained) checkpoints, using different fractions of synthetic chat data in the overall mixture.\n\nWe measured our models in two important ways. Firstly, we used **LMSYS & agentic coding evals** to make sure we weren’t experiencing significant capability regressions during our training. Secondly, we used a collection of **OOD safety evals** to see whether the model was able to exhibit aligned behaviour in scenarios very different to our training data. Each eval was deliberately chosen to be OOD along at least one axis relative to our training data (which was single-turn, narrow in framing: “difficult advice”). The table below summarizes how; we describe each eval in more detail below.\n\nEval | Turn structure | Agentic | Primary shift vs training |\n|---|---|---|---|\nAI delusion validation | Multi-turn | No | Sustained adversarial persona; mental-health topic |\nODCV | Single-turn | Yes | Tool use; objective/constraint conflict under performance pressure |\nAgentic Misalignment | Multi-turn | Yes | Tool-use actions; direct goal conflict / autonomy threat |\nAudit Agents | Multi-turn (5-turn) | No | Adaptive auditor; trait-violation elicitation |\n\nIn more detail, these four OOD safety evals were:\n\nOur core findings:\n\nWe also tried swapping out SFT for BDPO (bounded direct policy optimization, from [Cho et al](https://arxiv.org/abs/2506.12725)). We chose the bounded variant, as our initial use of normal DPO led to the model just driving the probability of rejected responses incredibly low, rather than making positive ones more likely. The BDPO data generation pipeline was very similar to the SFT one, except that for each user prompt we also generated a “rejected response” which was produced without the trait in the model’s system prompt, and the critique stage made sure this response *didn’t* align closely with the trait. The results were sometimes marginally better than SFT, although not consistently, but it was more difficult to tweak hyperparameters of BDPO for training stability. On the net, we do not think it is worth using BDPO over SFT.\n\nCommon patterns (especially in the SFT data) can lead to unexpected behaviours getting reinforced. Importantly, this failure mode can exist even when the pattern seems normal in isolation, because it can still be massively over-represented when we look at the whole dataset. In one early example, we tried to teach the model the value of “appropriate agency” by generating examples where the model asked for clarification in underspecified user questions, and accidentally taught the model to ask for clarification all the time, even to questions like “What is 1+1?”. Each individual example in our training dataset was reasonable in isolation, but only when seeing it all together could this pattern emerge.\n\nTo fix this, we built a 3-pass pipeline to run at the end of each synthetic data generation:\n\nBelow is an example of output from this pipeline. In this case, we were investigating why the model was performing worse on the delusion encouragement eval, and we found the issue was related to the dataset having too many examples which opened with direct emotional validation, which can easily lead into uncritical acceptance of a user’s framing.\n\nAlthough we built this scan-cluster-autorate pipeline for our own data, it's general - in other words it can take any chat or document dataset and an LLM, and find the over-represented structural patterns in it. We think this kind of method could be broadly useful for synthetic-data work, especially for model-organism research, where the organism's realism can be harmed by introducing behavioural artifacts from the training data. Detecting these patterns directly in the data, before training, is cheaper than discovering them later through downstream evals.\n\nWe also ran an experiment using the results of this pipeline. We took two patterns with >20% frequency in the data: emotional-validation buffering, and BLUF (Bottom Line Up Front), where the opening sentence is a direct response either agreeing with or refuting the user's premise. For each, we filtered the data containing that pattern and retrained. The figure below shows four models - baseline (no synthetic data), full synthetic SFT, BLUF-filtered, and emotional-validation-filtered, across three measures: the delusion confirmation score, and the rates of each structural pattern. All three synthetic-SFT models scored comparably on delusion confirmation, much better than baseline. So removing emotionally validating openings didn't reduce delusion confirmation in our setup, which is some evidence against the intuition that validation buffering leads to delusion validation. But the other two panels show each filtering did change the model's structure as expected: the BLUF-filtered model produces less BLUF (52% -> 41%), and the emotional-validation-filtered model produces less emotionally validating opening sentences (26% -> 20%). The most interesting takeaway is that **models pick up structural patterns from synthetic data in ways that don't always show up in the eval scores, even when you'd expect them to**. This suggests there's some value in a pipeline which can detect these kinds of patterns directly in the data, rather than only via downstream evals.\n\nIncidentally, another advantage of midtraining over synthetic data is that it can help teach the shape of aligned responses without carrying a bunch of formatting baggage along with it like this. However this may not outweigh the factors that make midtraining hard to get right - see our takeaways section below.\n\n**Knowledge doesn't always mean internalization.** Alongside the behavioural evals above, we measured whether our models had knowledge of the traits we were trying to teach them, using a knowledge eval inspired by [Slocum et al](https://arxiv.org/abs/2510.17941). We ask open-ended questions such as \"What are three important values?\" or \"List five important principles for how LLMs should interact with humans\". We keep these questions abstract rather than situational, because we're purely trying to measure recall, unlike the behavioural evals. We then used an autorater to score each point the model makes from 0 to 2 by how well it matches one of the traits in our document, then take an average over all the points made by the model in all questions we ask it. The plot below shows how midtraining instils this stated knowledge much more effectively than SFT alone.\n\nOne important takeway from our project was that we got positive results on knowledge evals before getting positive results on behavioural evals. Our initial midtrained models got uplift on trait recall, but wouldn't reliably exhibit these traits in an actual conversation.\n\n**Multi-turn (adversarial) evals are helpful**. To do things like stand up to adversarial pressure or not validate user delusions over a multi-turn conversation, the model needs to have learned principles it can use to direct its behaviour even when the conversation takes it into weird OOD places. Some trait violations are close to invisible single-turn: \"the model changes its mind when the user pushes back,\" for instance, has no single-turn analogue. Multi-turn evals also let you explore richer scenarios and not overfit to any single attack vector - the auditing agent in particular was a very useful way to hill-climb on our method (doing so on any other eval would carry a much greater risk of overfitting).\n\n**Mixing in baseline SFT data can help mitigate capability regressions**. Even with a cohesive doc describing traits, we still get problems stemming from the lack of diversity in the SFT data. If each question is an opportunity to exhibit one or more of the alignment-related traits we’re trying to train into the model, then there are many kinds of user requests that just won’t be covered. We found mixing our synthetic data with baseline SFT data (the same that was used to train the checkpoint we started training from) helped a lot with this - in comparison to finetuning with synthetic-only data *after* our model was already trained on regular data, which was much more likely to lead to strange behavioral collapse, in the style of [Murray et al](https://arxiv.org/abs/2602.05910).\n\n**Midtraining can work, but it’s quite difficult.** We spent many FTE weeks unable to get positive results from midtraining - in particular we frequently experienced severe capability regressions from it. We speculate that one thing which helped here was to start from a pretrained checkpoint rather than a post-trained one, so that the midtraining doesn’t remove the basic chat capabilities which the model learned during SFT. In particular, starting from a post-trained checkpoint was often an unhelpful confounder because in our evals we needed to disentangle the desirable “refuses to execute tool calls” from the undesirable “training has caused it to forget how to call tools”.\n\nAs well as this, here are some more speculative things we found useful when doing midtraining, many of which are inspired by or built on the methods from [Li et al](https://alignment.anthropic.com/2026/msm/). Note that we generally didn’t run comprehensive ablations for these, they’re simply the collection of most significant differences between our midtraining datasets which worked well, and the ones which didn’t.\n\nWe would be interested in exploring each of these further, and quantifying the extent to which they’re necessary for success of midtraining.\n\nFor people with budget constraints, we recommend using the most expensive and high-quality models only for the critique & rewrite stage, since that seems to be the most important one to get right. Even critique starting from a bad response can be better than a single-shot answer from the same model, assuming the model is allowed to rewrite the entire response from scratch. Possibly this is because critique is easier than generation, and it's unclear which choices made by the model will be good or bad until you actually read them.\n\nExplanation of the capability evals: LMSYS SxS is measured relative to the baseline of SFT-only, 0% synthetic data - hence why that datapoint is near 50%, because this is the model measured against itself. The SWE-Bench score is measured relative to the score of the baseline model (again this means the model with no midtraining or synthetic data training).", "url": "https://wpnews.pro/news/synthetic-document-finetuning-for-instilling-positive-traits", "canonical_source": "https://www.lesswrong.com/posts/GTYJRLhqztxKF2v5R/synthetic-document-finetuning-for-instilling-positive-traits", "published_at": "2026-06-16 00:04:26+00:00", "updated_at": "2026-06-16 00:19:27.589339+00:00", "lang": "en", "topics": ["large-language-models", "ai-safety", "ai-research", "ai-agents"], "entities": ["Google DeepMind", "Gemini 3 Flash", "Marks et al", "Li et al", "Kutasov et al", "Guan et al", "Cho et al"], "alternates": {"html": "https://wpnews.pro/news/synthetic-document-finetuning-for-instilling-positive-traits", "markdown": "https://wpnews.pro/news/synthetic-document-finetuning-for-instilling-positive-traits.md", "text": "https://wpnews.pro/news/synthetic-document-finetuning-for-instilling-positive-traits.txt", "jsonld": "https://wpnews.pro/news/synthetic-document-finetuning-for-instilling-positive-traits.jsonld"}}