How can i build a High Quality dataset?

A developer building a Persian-language tutor using small Qwen-based agents shares five dataset design lessons learned from production failures, including avoiding tool-calling data where decoding can enforce structure, training for language drift during uncertainty, checking Persian-specific corpus traps like ZWNJ and Arabic codepoints, limiting negative examples in grammar data, and designing for short sessions to prevent degradation.

John6666’s pipeline advice is excellent — I’ll add a different layer: we run a fleet of small qwen-based agents in production 7B–32B, long-lived sessions, tool calling , and several failure modes we learned the hard way are directly relevant to what you’re building. Dataset design can’t fix all of them — knowing which ones to solve elsewhere will save you months. 1. Don’t spend dataset budget teaching what decoding can enforce. Tool calling on small qwens drifts: the model emits JSON-as-prose instead of actual structured calls, intermittently, no matter how many clean examples it saw in SFT. The fix that actually held for us was enforcement at the sampling layer — constrained generation / grammar-based decoding llama.cpp grammars, Ollama’s format=schema, outlines . Put a modest amount of tool-calling data in SFT for format familiarity, then enforce structure at inference. A 0.8B will never be reliable on this from data alone. 2. Language drift fires exactly at uncertainty — train for those moments. Small qwens flip into Chinese or English mid-sentence precisely around hedge phrases and confusion “hmm, let me think…” . For a Persian-only assistant this is your biggest UX risk, because qwen’s pretraining gravity is zh/en. Two-part fix: include SFT examples where the model is uncertain, confused, or correcting itself entirely in Persian those are the states where drift triggers , and pin an explicit language anchor in the deployed system prompt. Test specifically: ambiguous questions, misspelled student input, out-of-domain asks — drift hides there, not in clean QA. 3. Persian-specific corpus traps worth checking before any filtering verify these against your data — they’re the classic Persian NLP issues : ZWNJ U+200C is grammatically meaningful میروم vs می روم and web scrapes routinely strip or mangle it — and because the damage is uniform , your n-gram filter won’t flag it; you’d be teaching broken orthography fluently. Also normalize Arabic-vs-Persian codepoints ي→ی, ك→ک — OSCAR is notoriously mixed. Libraries like hazm/parsivar handle both. One hour of character-level spot-checking raw text by eye is worth more than any perplexity filter; we learned this from a corrupted-extraction incident where the model produced fluent, confident, completely wrong output — garbage substrate doesn’t look like garbage after training, it looks like confidence. 4. For the grammar-tutor data: be careful with negative examples. Small models pattern-match surface text; a dataset rich in wrong-Persian labeled “incorrect” can prime the very errors it warns against. Prefer correct-demonstration-heavy data, with error-correction examples in a rigid, clearly-fenced format student error → correction → one-line rule , sparse rather than abundant. 5. Design for short sessions. Small qwens degrade hard past ~8–10 turns/tool-calls — fabrication and loops cascade. For a student assistant that’s fine if the data teaches it: self-contained exchanges, summarize-and- reset patterns, not long winding dialogues the deployed model can’t sustain. Your CPT-first instinct already proved itself the Wikipedia improvement — these five are about not losing that gain at the edges. Good luck — a Persian tutor on low-end devices is exactly the kind of thing small models should be doing.