John6666’s pipeline advice is excellent — I’ll add a different layer: we run a fleet of small qwen-based agents in production (7B–32B, long-lived sessions, tool calling), and several failure modes we learned the hard way are directly relevant to what you’re building. Dataset design can’t fix all of them — knowing which ones to solve elsewhere will save you months.
1. Don’t spend dataset budget teaching what decoding can enforce. Tool calling on small qwens drifts: the model emits JSON-as-prose instead of actual structured calls, intermittently, no matter how many clean examples it saw in SFT. The fix that actually held for us was enforcement at the sampling layer — constrained generation / grammar-based decoding (llama.cpp grammars, Ollama’s format=schema, outlines). Put a modest amount of tool-calling data in SFT for format familiarity, then enforce structure at inference. A 0.8B will never be reliable on this from data alone.
2. Language drift fires exactly at uncertainty — train for those moments. Small qwens flip into Chinese or English mid-sentence precisely around hedge phrases and confusion (“hmm, let me think…”). For a Persian-only assistant this is your biggest UX risk, because qwen’s pretraining gravity is zh/en. Two-part fix: include SFT examples where the model is uncertain, confused, or correcting itself entirely in Persian (those are the states where drift triggers), and pin an explicit language anchor in the deployed system prompt. Test specifically: ambiguous questions, misspelled student input, out-of-domain asks — drift hides there, not in clean QA.
3. Persian-specific corpus traps worth checking before any filtering (verify these against your data — they’re the classic Persian NLP issues): ZWNJ (U+200C) is grammatically meaningful (میروم vs می روم) and web scrapes routinely strip or mangle it — and because the damage is uniform, your n-gram filter won’t flag it; you’d be teaching broken orthography fluently. Also normalize Arabic-vs-Persian codepoints (ي→ی, ك→ک) — OSCAR is notoriously mixed. Libraries like hazm/parsivar handle both. One hour of character-level spot-checking raw text by eye is worth more than any perplexity filter; we learned this from a corrupted-extraction incident where the model produced fluent, confident, completely wrong output — garbage substrate doesn’t look like garbage after training, it looks like confidence.
4. For the grammar-tutor data: be careful with negative examples. Small models pattern-match surface text; a dataset rich in wrong-Persian labeled “incorrect” can prime the very errors it warns against. Prefer correct-demonstration-heavy data, with error-correction examples in a rigid, clearly-fenced format (student error → correction → one-line rule), sparse rather than abundant.
5. Design for short sessions. Small qwens degrade hard past ~8–10 turns/tool-calls — fabrication and loops cascade. For a student assistant that’s fine if the data teaches it: self-contained exchanges, summarize-and- reset patterns, not long winding dialogues the deployed model can’t sustain.
Your CPT-first instinct already proved itself (the Wikipedia improvement) — these five are about not losing that gain at the edges. Good luck — a Persian tutor on low-end devices is exactly the kind of thing small models should be doing.