{"slug": "how-can-i-build-a-high-quality-dataset", "title": "How can i build a High Quality dataset?", "summary": "A developer building a Persian-language tutor using small Qwen-based agents shares five dataset design lessons learned from production failures, including avoiding tool-calling data where decoding can enforce structure, training for language drift during uncertainty, checking Persian-specific corpus traps like ZWNJ and Arabic codepoints, limiting negative examples in grammar data, and designing for short sessions to prevent degradation.", "body_md": "John6666’s pipeline advice is excellent — I’ll add a different layer: we run a fleet of small qwen-based agents in production (7B–32B, long-lived sessions, tool calling), and several failure modes we learned the hard way are directly relevant to what you’re building. Dataset design can’t fix all of them — knowing which ones to solve elsewhere will save you months.\n\n**1. Don’t spend dataset budget teaching what decoding can enforce.** Tool calling on small qwens drifts: the model emits JSON-as-prose instead of actual structured calls, intermittently, no matter how many clean examples it saw in SFT. The fix that actually held for us was enforcement at the sampling layer — constrained generation / grammar-based decoding (llama.cpp grammars, Ollama’s format=schema, outlines). Put a *modest* amount of tool-calling data in SFT for format familiarity, then enforce structure at inference. A 0.8B will never be reliable on this from data alone.\n\n**2. Language drift fires exactly at uncertainty — train for those moments.** Small qwens flip into Chinese or English mid-sentence precisely around hedge phrases and confusion (“hmm, let me think…”). For a Persian-only assistant this is your biggest UX risk, because qwen’s pretraining gravity is zh/en. Two-part fix: include SFT examples where the model is *uncertain, confused, or correcting itself entirely in Persian* (those are the states where drift triggers), and pin an explicit language anchor in the deployed system prompt. Test specifically: ambiguous questions, misspelled student input, out-of-domain asks — drift hides there, not in clean QA.\n\n**3. Persian-specific corpus traps worth checking before any filtering** (verify these against your data — they’re the classic Persian NLP issues): ZWNJ (U+200C) is grammatically meaningful (میروم vs می روم) and web scrapes routinely strip or mangle it — and because the damage is *uniform*, your n-gram filter won’t flag it; you’d be teaching broken orthography fluently. Also normalize Arabic-vs-Persian codepoints (ي→ی, ك→ک) — OSCAR is notoriously mixed. Libraries like hazm/parsivar handle both. One hour of character-level spot-checking raw text by eye is worth more than any perplexity filter; we learned this from a corrupted-extraction incident where the model produced fluent, confident, completely wrong output — garbage substrate doesn’t look like garbage after training, it looks like confidence.\n\n**4. For the grammar-tutor data: be careful with negative examples.** Small models pattern-match surface text; a dataset rich in wrong-Persian labeled “incorrect” can prime the very errors it warns against. Prefer correct-demonstration-heavy data, with error-correction examples in a rigid, clearly-fenced format (student error → correction → one-line rule), sparse rather than abundant.\n\n**5. Design for short sessions.** Small qwens degrade hard past ~8–10 turns/tool-calls — fabrication and loops cascade. For a student assistant that’s fine *if* the data teaches it: self-contained exchanges, summarize-and- reset patterns, not long winding dialogues the deployed model can’t sustain.\n\nYour CPT-first instinct already proved itself (the Wikipedia improvement) — these five are about not losing that gain at the edges. Good luck — a Persian tutor on low-end devices is exactly the kind of thing small models *should* be doing.", "url": "https://wpnews.pro/news/how-can-i-build-a-high-quality-dataset", "canonical_source": "https://discuss.huggingface.co/t/how-can-i-build-a-high-quality-dataset/176571#post_15", "published_at": "2026-06-18 03:17:40+00:00", "updated_at": "2026-06-18 03:28:24.246623+00:00", "lang": "en", "topics": ["large-language-models", "ai-agents", "natural-language-processing", "ai-products", "ai-infrastructure"], "entities": ["Qwen", "Ollama", "llama.cpp", "OSCAR", "hazm", "parsivar"], "alternates": {"html": "https://wpnews.pro/news/how-can-i-build-a-high-quality-dataset", "markdown": "https://wpnews.pro/news/how-can-i-build-a-high-quality-dataset.md", "text": "https://wpnews.pro/news/how-can-i-build-a-high-quality-dataset.txt", "jsonld": "https://wpnews.pro/news/how-can-i-build-a-high-quality-dataset.jsonld"}}