# How can i build a High Quality dataset?

> Source: <https://discuss.huggingface.co/t/how-can-i-build-a-high-quality-dataset/176571#post_15>
> Published: 2026-06-18 03:17:40+00:00

John6666’s pipeline advice is excellent — I’ll add a different layer: we run a fleet of small qwen-based agents in production (7B–32B, long-lived sessions, tool calling), and several failure modes we learned the hard way are directly relevant to what you’re building. Dataset design can’t fix all of them — knowing which ones to solve elsewhere will save you months.

**1. Don’t spend dataset budget teaching what decoding can enforce.** Tool calling on small qwens drifts: the model emits JSON-as-prose instead of actual structured calls, intermittently, no matter how many clean examples it saw in SFT. The fix that actually held for us was enforcement at the sampling layer — constrained generation / grammar-based decoding (llama.cpp grammars, Ollama’s format=schema, outlines). Put a *modest* amount of tool-calling data in SFT for format familiarity, then enforce structure at inference. A 0.8B will never be reliable on this from data alone.

**2. Language drift fires exactly at uncertainty — train for those moments.** Small qwens flip into Chinese or English mid-sentence precisely around hedge phrases and confusion (“hmm, let me think…”). For a Persian-only assistant this is your biggest UX risk, because qwen’s pretraining gravity is zh/en. Two-part fix: include SFT examples where the model is *uncertain, confused, or correcting itself entirely in Persian* (those are the states where drift triggers), and pin an explicit language anchor in the deployed system prompt. Test specifically: ambiguous questions, misspelled student input, out-of-domain asks — drift hides there, not in clean QA.

**3. Persian-specific corpus traps worth checking before any filtering** (verify these against your data — they’re the classic Persian NLP issues): ZWNJ (U+200C) is grammatically meaningful (میروم vs می روم) and web scrapes routinely strip or mangle it — and because the damage is *uniform*, your n-gram filter won’t flag it; you’d be teaching broken orthography fluently. Also normalize Arabic-vs-Persian codepoints (ي→ی, ك→ک) — OSCAR is notoriously mixed. Libraries like hazm/parsivar handle both. One hour of character-level spot-checking raw text by eye is worth more than any perplexity filter; we learned this from a corrupted-extraction incident where the model produced fluent, confident, completely wrong output — garbage substrate doesn’t look like garbage after training, it looks like confidence.

**4. For the grammar-tutor data: be careful with negative examples.** Small models pattern-match surface text; a dataset rich in wrong-Persian labeled “incorrect” can prime the very errors it warns against. Prefer correct-demonstration-heavy data, with error-correction examples in a rigid, clearly-fenced format (student error → correction → one-line rule), sparse rather than abundant.

**5. Design for short sessions.** Small qwens degrade hard past ~8–10 turns/tool-calls — fabrication and loops cascade. For a student assistant that’s fine *if* the data teaches it: self-contained exchanges, summarize-and- reset patterns, not long winding dialogues the deployed model can’t sustain.

Your CPT-first instinct already proved itself (the Wikipedia improvement) — these five are about not losing that gain at the edges. Good luck — a Persian tutor on low-end devices is exactly the kind of thing small models *should* be doing.
