Task-Seeded Synthetic Q&A Generation for Nemotron Pretraining

wpnews.pro

Enterprise + ArticlePublished June 4, 2026 In large-scale LLM development, the question is no longer simply how much data a model sees. It is also whether the data contains enough structured learning signals. General web, code, math, multilingual, and domain data provide a broad base. Task-seeded synthetic Q&A complements them by adding compact, task-structured examples with a clear information need, a constrained response space, and explanations that connect evidence to an answer. In a 100B-token continuation experiment on the Nemotron-3 Nano model, task-seeded SDG improved MMLU-Pro by +1.8, average code by +1.9, commonsense understanding by +1.6, and GPQA by +11.1, while average math remained stable.

This post describes a task-seeded synthetic Q&A generation workflow developed for Nemotron-family training, including Ultra and Super training runs. The workflow uses training splits from broad public task families as capability seeds, generates new task-aligned examples, enriches them with reasoning and relevant knowledge, and filters them into curated synthetic datasets. Held-out evaluation and test data are excluded from generation. Downstream training recipes can then decide how to mix those datasets with the broader corpus.

Figure 1. The task-seeded SDG pipeline ends at curated generated data. Training mixture design and reported evaluations happen downstream.

TL;DR #

We use public task training splits as capability seeds, not as examples to memorize.
We frame the data through transfer learning across task families: a model can learn reusable behaviors from broad seed tasks, then apply them to related applications and evaluations.
The pipeline generates similar questions and answer-enriched examples with reasoning and task-relevant context.
Multiple-choice tasks are easier to verify; open generation tasks need task-specific extraction and filtering.
In a 100B-token continuation experiment on the Nemotron-3 Nano model, task-seeded SDG improved MMLU-Pro by +1.8, average code by +1.9, commonsense understanding by +1.6, and GPQA by +11.1 while keeping average math stable.

At A Glance #

Element	Value
Seed source	Public task training splits available through `lm-eval-harness`
Scale	About 70 tasks and about 700 subtasks
Data types	Similar questions, answer-enriched samples, reasoning/context traces
Verification	Schema checks, format checks, deduplication, majority voted answer checks
Training use	Late-stage Nemotron-family training, including Ultra/Super workstreams
Main result	Gains on MMLU-Pro, code, commonsense, and GPQA in a 100B-token Nemotron-3 Nano continuation

Generation Pipeline #

The generation workflow is a compact loop: collect training-split seeds, normalize heterogeneous task records, generate new examples, enrich answers, and filter the resulting data. In the internal pipeline, we used roughly 70 public task datasets from lm-eval-harness

, covering about 700 subtasks. For each task, we used only suitable training splits as SDG seeds; held-out test data was not used for generation, and tasks without suitable training data were excluded from seed collection.

The seed pool covered both knowledge-intensive and reasoning-intensive tasks:

Seed group	Approximate coverage	Purpose
Knowledge-intensive tasks	39 tasks, about 300 subtasks, about 3M seed samples	Improve factual, scientific, multilingual, and domain-specific QA behavior
Reasoning-intensive tasks	34 tasks, about 400 subtasks, about 1.5M seed samples	Improve analytical reasoning, logical reasoning, math, code, and commonsense reasoning

For Nemotron Ultra and Super pretraining, we used a license-compatible subset of the generated data suitable for commercial model training.

The end-to-end process has five stages: Collect seed tasks. Enumerate availablelm-eval-harness

tasks, group them by output type, and keep only tasks with suitable training splits.Normalize records. Since eachlm-eval-harness

task defines its own fields and formatting in YAML, we convert task records into a unified JSONL-style schema. For multiple-choice tasks, the normalized record contains the question and candidate options. For generative tasks, it contains the question or prompt, plus context when the task provides it.Generate similar examples. Given a seed example, the generator creates a new question that preserves the underlying capability while changing the content.Enrich answers. The generator solves the generated questions and adds the final answer plus relevant reasoning, knowledge, or context.Filter and package. The pipeline applies schema checks, format checks, deduplication, and task-specific answer validation where possible. Multiple-choice data is easier to verify directly; generation-style data requires more cautious task-specific handling.

One practical formatting choice is to store semantic answer text rather than only option labels when possible. For example, writing the answer as dirt trapped under the fingernails

gives the model a clearer training signal than only writing B

.

Why Task-Seeded Data? #

Public task datasets are imperfect, but their training splits contain compact examples of how information is requested, constrained, and resolved. They capture useful correlations among task framing, domain knowledge, reasoning depth, candidate answers, and final response form. A model may see abundant raw text during pretraining and still benefit from synthetic data that makes those correlations explicit. Task-seeded synthetic data addresses this gap by turning public task training splits into data generation templates. Using only suitable training splits from broad task families, we generate new examples that preserve useful properties of the source interaction:

task framing, such as whether the example asks for selection, generation, classification, or explanation;
answer structure, such as multiple-choice options, short answers, free-form responses, or format-constrained outputs;
domain and context, such as science, commonsense, factual knowledge, math, code, multilingual QA, or reading comprehension;
difficulty and reasoning depth, such as whether the example requires a direct fact, a comparison among alternatives, or several reasoning steps;
explanatory signal, such as task-relevant knowledge, reasoning, or context that helps connect the question to the answer.

This lets us expose the model to reusable reasoning and knowledge-use patterns across task families, without tying the dataset to the surface format of one data source.

Why Use Broader Seed Tasks? #

A useful way to interpret this pipeline is through transfer learning across task families. Many improvements do not come from learning a single task's surface format. They come from strengthening reusable behaviors that appear across many tasks: identifying the information need, applying relevant domain knowledge, separating plausible alternatives, following response constraints, doing multi-step reasoning, and grounding a final answer in the right context.

Because of this, we do not generate from a narrow set of task formats. Instead, we collect a broader set of training-split seed samples from lm-eval-harness

and use them to cover many neighboring capability regions. A science QA seed can help with commonsense physical reasoning. A logical reasoning seed can help with careful alternative comparison. A math or code seed can help with multi-step planning even when the final application is not exactly the same task. The goal is positive transfer learning across task families, while reducing the risk that the model simply learns the quirks of a single data source.

This motivation is also consistent with earlier evidence in Nemotron Nano pretraining. We found that using AGIEval training data improved MMLU-Pro, suggesting that structured Q&A data from one task family can improve behavior outside the original source family. The broader seed collection used here extends that idea: rather than leaning on one task source, it uses many training-split task families so that transferable reasoning, knowledge-use, and answer-selection behaviors have more opportunities to appear.

Why Add Context And Reasoning? #

The answer alone is often a weak training signal, especially for science, commonsense, and multi-step reasoning examples. Adding task-relevant knowledge or reasoning traces gives the model a path from question to answer, and helps it learn why plausible distractors are wrong.

The PIQA-style example in Figure 2 shows this distinction in a compact setting. The generated question can be answered with the correct option alone, but the answer-generation variants add the definition, historical context, and distractor analysis that make the record a stronger learning signal.

Figure 2. A concrete SDG transformation: a PIQA-style seed leads to fresh similar questions, and one generated question is expanded into two answer-enriched records.

In an internal with-context versus no-context ablation, the context-enriched variant gave stronger numbers on several knowledge- and reasoning-heavy evaluations:

Evaluation	No context	With context	Change
ARC-Challenge	91.89	92.24	+0.35
CommonsenseQA	80.02	80.26	+0.24
PIQA	82.86	84.44	+1.58
WinoGrande	79.87	80.51	+0.64
AGIEval-en CoT	63.16	69.32	+6.16
GPQA-Diamond CoT n-shot	34.85	45.96	+11.11
MMLU-Pro 5-shot	64.45	66.89	+2.44
MBPP+ sampled	73.77	74.82	+1.05

Training Use #

The task-seeded synthetic data was mixed into late-stage Nemotron-family training. In one 100B-token continuation experiment on the Nemotron-3 Nano model, adding newly synthesized task-seeded data improved several capability groups:

Metric group	Before	After	Change
MMLU-Pro	64.8	66.6	+1.8
Average code	73.2	75.1	+1.9
Average math	87.6	87.9	+0.3
Commonsense understanding	72.9	74.5	+1.6
GPQA	30.8	41.9	+11.1

These results are encouraging for two reasons. First, the improvements are not limited to the most direct target. MMLU-Pro, code, commonsense understanding, and GPQA all improve, while math remains stable. Second, the large GPQA improvement suggests that examples enriched with relevant knowledge and reasoning can help models handle more difficult scientific reasoning questions.

This pattern is consistent with the transfer-learning interpretation discussed above: gains appear across multiple capability groups rather than only in one closely related evaluation.

The result also reveals a useful tradeoff: targeted task-seeded data must be mixed carefully. Gains on one evaluation should be checked alongside broad general-knowledge retention, not in isolation.

What We Learned #

Several practical findings stood out:

Broad seed coverage improves generalization. Using many task families reduces the risk of overfitting to one evaluation style.
Context helps. Reasoning and task-relevant knowledge make synthetic answers more useful than labels alone.
Output format is part of the training signal. Seemingly small choices, such as answer letter versus answer text, can affect downstream behavior.
Multiple-choice tasks are easier to verify than open generation tasks. Generation tasks require task-specific answer extraction and may need separate handling.
Mixture design matters. Natural sample-count distributions can overweight large tasks, so important task families may need explicit sampling controls.
Benchmark improvements should be checked against broad capability retention. MMLU-Pro or GPQA gains are more meaningful when math, code, commonsense, and general knowledge remain stable.

Conclusion #

Task-seeded synthetic data gives model builders a practical way to target the skills that matter in late-stage training. By collecting broad training-split task seeds, generating new examples, enriching answers with reasoning and knowledge, and filtering the resulting data carefully, we can improve model performance on difficult reasoning and knowledge tasks while preserving broad capabilities.

For Nemotron Ultra and Super pretraining, this workflow provides a scalable recipe for making synthetic data more intentional. The key is not simply to generate more data, but to generate data with the right structure, the right explanatory signal, and enough metadata for downstream mixture decisions.

source & further reading

huggingface.co — original article SEMA: Exploring a Different AI Architecture The State of Simulation for Physical AI: An Overview Endorsement request for arXiv cs.AI — HindsightTag paper on retroactive LLM agent memory