cd /news/large-language-models/how-can-i-build-a-high-quality-datas… · home topics large-language-models article
[ARTICLE · art-32029] src=discuss.huggingface.co ↗ pub= topic=large-language-models verified=true sentiment=· neutral

How can i build a High Quality dataset?

A developer building a Persian-language tutor using small Qwen-based agents shares five dataset design lessons learned from production failures, including avoiding tool-calling data where decoding can enforce structure, training for language drift during uncertainty, checking Persian-specific corpus traps like ZWNJ and Arabic codepoints, limiting negative examples in grammar data, and designing for short sessions to prevent degradation.

read2 min views1 publishedJun 18, 2026

John6666’s pipeline advice is excellent — I’ll add a different layer: we run a fleet of small qwen-based agents in production (7B–32B, long-lived sessions, tool calling), and several failure modes we learned the hard way are directly relevant to what you’re building. Dataset design can’t fix all of them — knowing which ones to solve elsewhere will save you months.

1. Don’t spend dataset budget teaching what decoding can enforce. Tool calling on small qwens drifts: the model emits JSON-as-prose instead of actual structured calls, intermittently, no matter how many clean examples it saw in SFT. The fix that actually held for us was enforcement at the sampling layer — constrained generation / grammar-based decoding (llama.cpp grammars, Ollama’s format=schema, outlines). Put a modest amount of tool-calling data in SFT for format familiarity, then enforce structure at inference. A 0.8B will never be reliable on this from data alone.

2. Language drift fires exactly at uncertainty — train for those moments. Small qwens flip into Chinese or English mid-sentence precisely around hedge phrases and confusion (“hmm, let me think…”). For a Persian-only assistant this is your biggest UX risk, because qwen’s pretraining gravity is zh/en. Two-part fix: include SFT examples where the model is uncertain, confused, or correcting itself entirely in Persian (those are the states where drift triggers), and pin an explicit language anchor in the deployed system prompt. Test specifically: ambiguous questions, misspelled student input, out-of-domain asks — drift hides there, not in clean QA.

3. Persian-specific corpus traps worth checking before any filtering (verify these against your data — they’re the classic Persian NLP issues): ZWNJ (U+200C) is grammatically meaningful (میروم vs می روم) and web scrapes routinely strip or mangle it — and because the damage is uniform, your n-gram filter won’t flag it; you’d be teaching broken orthography fluently. Also normalize Arabic-vs-Persian codepoints (ي→ی, ك→ک) — OSCAR is notoriously mixed. Libraries like hazm/parsivar handle both. One hour of character-level spot-checking raw text by eye is worth more than any perplexity filter; we learned this from a corrupted-extraction incident where the model produced fluent, confident, completely wrong output — garbage substrate doesn’t look like garbage after training, it looks like confidence.

4. For the grammar-tutor data: be careful with negative examples. Small models pattern-match surface text; a dataset rich in wrong-Persian labeled “incorrect” can prime the very errors it warns against. Prefer correct-demonstration-heavy data, with error-correction examples in a rigid, clearly-fenced format (student error → correction → one-line rule), sparse rather than abundant.

5. Design for short sessions. Small qwens degrade hard past ~8–10 turns/tool-calls — fabrication and loops cascade. For a student assistant that’s fine if the data teaches it: self-contained exchanges, summarize-and- reset patterns, not long winding dialogues the deployed model can’t sustain.

Your CPT-first instinct already proved itself (the Wikipedia improvement) — these five are about not losing that gain at the edges. Good luck — a Persian tutor on low-end devices is exactly the kind of thing small models should be doing.

── more in #large-language-models 4 stories · sorted by recency
── more on @qwen 3 stories trending now
sponsored brought to you by zahid.host 4,200+ EU-deployed projects
reading about agents? ship yours in a single git push.

Run your AI side-project on zahid.host

EU-based hosting, git-push deploys, automatic HTTPS, no cold starts. Free tier with a custom domain — perfect for shipping the agent you just read about.

$git push zahid main
Live at https://your-agent.zahid.host
Get free account → Pricing
from €0/mo · no card required
LIVE [news/how-can-i-build-a-hi…] indexed:0 read:2min 2026-06-18 ·