A Chinese 8B model beat the Western 8B models at Japanese RAG. I still wouldn't put it in the default deployment — and that distinction is the point.

A developer benchmarked Japanese, Western, and Chinese 8B models on a Japanese RAG task, finding that Japanese-tuned models significantly outperform Western 8B models, while Chinese deepseek-r1-8b matches Japanese-tuned performance. However, the developer argues that deployment eligibility depends on policy constraints, not just capability, and recommends keeping the two decisions separate.

Extends an earlier model-selection benchmark to three model families Japanese / Western / Chinese on a Japanese RAG task. Repo + raw results: https://github.com/elvisyao007/eval-driven-llm/tree/main/reports/model-selection-v2 An earlier post benchmarked local models for a Japanese RAG task and settled on selecting by constraint rather than raw capability. This post widens the field to three families — Japanese-tuned, Western open, and Chinese — and the result forces a distinction that matters more than any single score: model capability and deployment eligibility are two different questions, and conflating them is how people get model selection wrong. Same Japanese RAG task, same judge protocol, same discriminating golden set oracle 87.5%, only 11% of questions answered by all models — it actually separates the field . hit@5, 8B class unless noted: | Model | Family | hit@5 | |---|---|---| | Swallow-8B | Japanese-tuned | ~0.53 | | Nemotron-9B-JP | Japanese-tuned | ~0.62 | | ELYZA-JP-8B | Japanese-tuned | ~0.40 | deepseek-r1-8b | Chinese | ~0.51 | | Llama-3.1-8B | Western | ~0.22 | | Mistral-7B | Western | ~0.18 | | gemma4-31b | Western 31B | ~0.62 | Three things fall out of this, and they don't all point the same direction. The Western 8B models cratered: Llama-3.1-8B at 0.22, Mistral-7B at 0.18, against a Japanese-tuned average around 0.52. That's not a small gap; it's the difference between usable and not. This answers a question people sometimes ask skeptically — why do Japanese-specific models exist when Llama is right there? At the 8B scale, on a Japanese retrieval-grounded task, a generic Western model without Japanese fine-tuning is not in the running. The Japanese tuning is doing decisive work. One honest qualifier on the table: gemma4-31b 0.62 is the one Western model that holds up — but it's 31B, not 8B. It earns its score with 4× the parameters, not with Japanese optimization. So read the table in two tiers: within the 8B class, Japanese-tuned wins clearly; across sizes, you can buy Western competitiveness with a much bigger model. Don't read "gemma is strong" as "Western 8B is fine" — the 8B Western models specifically failed. deepseek-r1-8b scored 0.51 — above the Western 8B models by a wide margin, and right in the range of the Japanese-tuned models. On capability alone, measured on this task, it's a real contender. I want to be precise here because it's easy to be sloppy: the data says this model is good at the task. That's a measurement, and I'm reporting it straight. For Japanese enterprise deployment, my default model lineup excludes Chinese models. Not because of the score — the score is fine — but because of deployment-policy constraints that are independent of capability : So the model goes in my content/research layer — where I'll benchmark it, learn from it, report its numbers honestly as I just did — but not in the deployment default I'd recommend to a Japanese enterprise client. That separation is a standing decision in how I structure this work, and this benchmark is exactly why the separation has to be explicit: if you collapse capability and deployability into one axis, you'll either deploy something that fails procurement, or dismiss something that's actually good. This is, I think, the part of the job that separates a solutions/forward-deployed engineer from someone who only runs benchmarks. The benchmark tells you what's capable. The deployment decision is a different function — it takes in the score and the client's compliance reality, the procurement constraints, the data-handling posture — and those are not the model's fault or merit, they're the deployment context. Keeping the two reasoning steps separate is the skill. Selecting a model for deployment is not "pick the highest score." It's a two-step function: measure capability honestly, then filter by the deployment context — size constraints, latency, language fit, and procurement/compliance reality. The Chinese model passed step one and is filtered at step two for reasons that aren't about its quality. The Western 8B models failed step one outright. The Japanese-tuned models pass both for this client profile. Reporting all of that accurately — including saying clearly that the model I won't deploy is genuinely good — is the job. Raw numbers, judge protocol, the discriminating golden set: https://github.com/elvisyao007/eval-driven-llm/tree/main/reports/model-selection-v2 https://github.com/elvisyao007/eval-driven-llm/tree/main/reports/model-selection-v2 Companion: eval-sanity the sanity gate confirming the metric discriminates before any score is trusted .