{"slug": "running-35b-400b-llms-on-a-gpu-less-cluster-to-mine-10000-papers-and-the-4-bugs", "title": "Running 35B–400B LLMs on a GPU-less Cluster to Mine 10,000 Papers — and the 4 Bugs That Almost Ruined the Data", "summary": "A developer built a CPU-only, distributed LLM pipeline to extract structured data from 10,000 full-text research papers, using a 35B MoE model running on a cluster of older x86 servers with zero GPUs. The pipeline processed 14,000 candidate sentences in a few hours, but four silent data-quality bugs nearly corrupted the results, including a chunk-index reset that caused a 78% data loss in the vector database. The project demonstrated that CPU-based LLM extraction at scale is feasible, with correctness emerging as the primary challenge over throughput.", "body_md": "A field report from building a CPU-only, distributed LLM pipeline for large-scale scientific literature extraction. No GPUs. A lot of quantization. And four silent data-quality bugs that taught me more than the happy path ever did.\n\nOur team runs an internal research cluster: a couple dozen older x86 servers, plenty of RAM, **zero GPUs**. The mandate was to extract structured data — effect sizes, the entity each one describes, and the direction of effect — from ~10,000 full-text research papers, so a downstream meta-analysis could pool them.\n\nThe obvious 2024-era answer is \"send it to a hosted LLM API.\" That wasn't on the table for data-governance reasons: the corpus had to stay on-prem. So the real question became:\n\n**Can you do serious LLM extraction at the 10k-document scale with CPUs only?**\n\nSpoiler: yes — but the interesting part isn't the throughput. It's that *correctness*, not speed, turned out to be the hard problem. Let me walk through the architecture, then the four bugs that each silently corrupted the data in a different way.\n\nEverything is open source and CPU-friendly:\n\n`requests`\n\n+ `ThreadPoolExecutor`\n\n) for orchestration. No Ray, no fancy scheduler — just a queue and one worker bound per node, because each llama.cpp server runs `--parallel 1`\n\n: on CPU, inference is memory-bandwidth bound, so one in-flight request already saturates the memory bus and batching buys little.Each node is a dual-socket Xeon, ~36 cores total (AVX-512), no accelerator. The 35B MoE generated **~6 tokens/s per node**; with 8 nodes load-balanced, a sentence took ~10s end to end and the full 14k-sentence extraction finished in a few hours.\n\nMoE was the unlock for CPU: ~3B active parameters per token means it generates at a usable rate even without a GPU, while delivering quality far above what its ~3B active count alone would suggest.\n\nThe ~400B Q3 model was reserved for a separate, earlier abstract-level pass — a different job at a different scale, out of scope for this post — where its stronger one-shot reading paid off. On a single CPU node it ran at low single-digit tokens/s, so routing the sentence-level corpus through it was never viable; everything below is the 35B MoE.\n\nFirst, a clarification I had to make repeatedly, because it confuses people (it confused me):\n\nA vector DB stores `d = -0.45`\n\nas a text token inside an embedding. It will happily *find* that sentence by meaning, but it cannot *compute* over the number. If your goal is to pool effect sizes, embeddings are the wrong tool. You want extraction.\n\nSo the pipeline is a **hybrid**: a cheap mechanical pass to find candidate sentences, then an LLM to interpret them.\n\n```\n10k full-text papers\n   │\n   ├─ ① regex pre-filter  (mechanical, no understanding)\n   │     keep sentences that have a number near a target-entity keyword\n   │     → ~14k candidate sentences\n   │\n   └─ ② LLM mapping       (the judgment step)\n         each sentence → {entity, metric, direction, value, measure_type}\n         → structured JSON for the meta-analysis\n```\n\nRegex is the funnel; the LLM is the brain. Neither replaces the other.\n\nNow the fun part.\n\nThe embedding side (the RAG corpus) had its own chunking pipeline. It looked fine. Counts looked fine. Then someone asked a simple question — \"how many points are actually in the collection?\" — and the numbers didn't add up: **~1M chunks generated, ~217k points in the DB.**\n\nA 78% gap. Where did 800k chunks go?\n\nThe culprit was the point ID. Each chunk got an ID derived from `(paper_id, chunk_index)`\n\n. Reasonable — except `chunk_index`\n\nwas **reset to 0 at the start of every section**:\n\n```\nfor section in sections:\n    for j, chunk in enumerate(chunk_text(section)):   # j resets per section!\n        point_id = make_id(paper_id, j)               # collision: (abstract,0) == (methods,0)\n        upsert(point_id, ...)\n```\n\nSo a paper's *abstract* chunk-0 and its *methods* chunk-0 and its *results* chunk-0 all hashed to the **same point ID**. Qdrant upserts are idempotent by ID, so each new section silently **overwrote** the previous one. Every paper collapsed to roughly `max(chunks in any single section)`\n\npoints.\n\nI confirmed it by replaying the raw chunks: 27,222 chunks across a sample → only 5,672 unique `(paper_id, chunk_index)`\n\npairs. **79.2% collision** on the sample, closely matching the 78% gap across the full DB (the small delta is just sampling — one is a replayed subset, the other the whole collection).\n\nThe fix is a one-liner — make `chunk_index`\n\na running counter across the whole paper (and derive the ID with a deterministic hash like `hashlib`\n\n/UUID, not Python's per-process `hash()`\n\n, so IDs stay stable across runs) — but the lesson isn't the fix. It's that **a silent overwrite produces a database that looks completely healthy**: green status, fast queries, plausible counts. Nothing errors. You only catch it if you reconcile \"things I generated\" against \"things that landed.\"\n\nReconcile your pipeline's input count against its output count at every hop. Silent data loss doesn't throw.\n\nWhile fixing #1, I re-ran the chunker on a fresh corpus and a sample paper produced **7,588 chunks, of which only 1,897 were unique** — 75% duplicates.\n\nThe XML parser walked sections like this:\n\n```\nfor sec in body.findall(\".//sec\"):          # ALL <sec>, including nested ones\n    paragraphs = sec.findall(\".//p\")          # ALL <p>, recursively\n```\n\nIn journal XML, sections nest. A parent `<sec>`\n\ncontains child `<sec>`\n\ns. `.//p`\n\nis recursive, so the parent emitted *all* of its children's paragraphs — and then each child `<sec>`\n\nwas visited separately and emitted them *again*. Deeply nested papers (a conference-proceedings document with 600 sub-sections was the worst) exploded.\n\nFix: take **direct-child** paragraphs only (`sec.findall(\"p\")`\n\n), plus a within-paper dedup as a safety net. Chunks dropped to the honest count, embedding time dropped with it.\n\n`.//`\n\nin XPath is a footgun when your tree is recursive and you also iterate the tree.\n\nOnto the extraction LLM — the 35B MoE workhorse. It's a reasoning model that emits a `<think>…</think>`\n\nblock before its answer. The first run capped generation at 512 tokens with `stop=[\"\\n\\n\"]`\n\n. Result: **0% parse rate**. The `\\n\\n`\n\nstop fired *inside* the thinking block, truncating mid-thought; no JSON ever appeared.\n\nOK, remove the bad stop, give it room. Bump to 1024 tokens. Now ~42% parse — better, but a third of outputs were still `<think>`\n\nwith no `</think>`\n\n: the model hit the token cap *still reasoning*.\n\nSo give it more room. 2048 tokens, 600-second timeout, quality-first. I ran a single hard sentence as a test. It generated **6,144 characters in 269 seconds and still hadn't closed the think block** — it was literally mid-sentence, \"Let's draft the JSON:\", when it ran out of budget. At that rate, 14k sentences would take **~5 days** and *still* fail on the hard ones.\n\nThe model wasn't slow. It was **non-terminating**: on ambiguous inputs it reasoned in circles and never committed to an answer. More tokens didn't help; it just thought more.\n\nThe fix is a known trick for reasoning models in raw-completion mode: **pre-close the think block in the prompt** so the model skips deliberation and answers directly:\n\n```\nprompt = f\"...<|im_start|>assistant\\n<think>\\n\\n</think>\\n\\n\"\n#                                  ^ empty, pre-closed → no open-ended reasoning\n```\n\nLatency dropped from \"minutes, maybe never\" to **~10 seconds**, deterministically. The whole 14k run finished in hours, at 99.96% parse.\n\nA reasoning model with no thinking budget is a liability for bulk structured output. If you don't need the chain-of-thought, close it.\n\nNo-think mode had its own quirk: on ~14% of the harder sentences, the model returned **completely empty output**. Not bad JSON — nothing. Deterministic (temperature 0), so retries reproduced the emptiness exactly.\n\nThe model, forced to answer immediately, was \"blanking\" on sentences it found ambiguous. The fix was almost insultingly small: **seed the assistant turn with an opening bracket** so the model is already inside a JSON array and *must* continue it:\n\n```\nprompt = f\"...<|im_start|>assistant\\n<think>\\n\\n</think>\\n\\n[\"\n#                                                          ^ forces JSON to start\n```\n\n(You then prepend the `[`\n\nback when parsing, since the completion only returns what comes *after* the prompt.) This recovered 298 of 301 empties → 99.86% parse on the hard subset.\n\nWhen a model can output \"nothing,\" constrain the output space so \"nothing\" isn't reachable.\n\nThe last lesson is subtler. The first extraction run mapped a number whenever a sentence had a number near a target-entity keyword. The audit found **~50% of the mapped \"effect sizes\" weren't the target effect at all** — they were regression-predictor t-values (age, sex, medication), correlations with secondary *task* scores, even positional coordinates (`x = -28`\n\n) the regex had grabbed as if they were measurements.\n\nThat noise produced a confident-but-spurious aggregate signal. Garbage in, *significant* garbage out.\n\nThe fix had two halves, and getting it wrong in an instructive way:\n\nBut I over-corrected: my first sharpened prompt rejected so aggressively it returned `[]`\n\nfor valid patient-vs-control effects too (1/15 on a sanity sample). The filter and the prompt were fighting — the filter guaranteed the paper was on-topic, but the prompt still demanded an explicit topic keyword *in the sentence*. Once I told the prompt \"you can trust that this sentence is from an on-topic paper; extract the entity's effect and only reject these specific noise types,\" recall snapped back (9/15) with zero coordinate leakage.\n\nPrecision in the mapped set went from ~49% to ~66% at the *sentence* level; at the *paper* level — meaning every paper that contributes an effect is genuinely on-topic — it was 100%. Total entries dropped from ~4,900 to ~1,700, almost all of it noise. The residual ~34% sentence-level noise isn't pooled blind, but be precise about what catches it: the load-bearing filter downstream is **entity normalization against a controlled vocabulary** — off-target entities (age, sex, medication) get dropped there — backed by a validation gate. (Stratifying by measure type and dedup are cleanup, not misclassification removal: a predictor t-value mislabeled as a target effect sails right through those.) The mapping's job is to maximize signal and flag; the controlled-vocabulary step is where the final noise is supposed to die.\n\nThe most dangerous extraction failure isn't a crash or a low parse rate. It's clean-looking data that's confidently wrong. Audit what your pipeline\n\nincludes, not just what it drops.\n\n`<think>`\n\nand seeding the output bracket turned a 5-day non-terminating job into a few-hour deterministic one.None of these are exotic. They're the unglamorous correctness work that sits between \"the demo runs\" and \"the numbers are trustworthy\" — which, for anything feeding a real analysis, is the whole job.\n\n*This pipeline powered the large-scale literature extraction behind our chronic-stress scoping-review preprint ( Research Square).*\n\n*Tools used: llama.cpp, BGE-M3, Qdrant, Python. All on-prem, all CPU.*", "url": "https://wpnews.pro/news/running-35b-400b-llms-on-a-gpu-less-cluster-to-mine-10000-papers-and-the-4-bugs", "canonical_source": "https://dev.to/sysoft/running-35b-400b-llms-on-a-gpu-less-cluster-to-mine-10000-papers-and-the-4-bugs-that-almost-ka3", "published_at": "2026-06-03 05:55:34+00:00", "updated_at": "2026-06-03 06:11:40.211585+00:00", "lang": "en", "topics": ["large-language-models", "artificial-intelligence", "machine-learning", "ai-infrastructure", "ai-research"], "entities": ["llama.cpp", "Xeon", "AVX-512"], "alternates": {"html": "https://wpnews.pro/news/running-35b-400b-llms-on-a-gpu-less-cluster-to-mine-10000-papers-and-the-4-bugs", "markdown": "https://wpnews.pro/news/running-35b-400b-llms-on-a-gpu-less-cluster-to-mine-10000-papers-and-the-4-bugs.md", "text": "https://wpnews.pro/news/running-35b-400b-llms-on-a-gpu-less-cluster-to-mine-10000-papers-and-the-4-bugs.txt", "jsonld": "https://wpnews.pro/news/running-35b-400b-llms-on-a-gpu-less-cluster-to-mine-10000-papers-and-the-4-bugs.jsonld"}}