Running 35B–400B LLMs on a GPU-less Cluster to Mine 10,000 Papers — and the 4 Bugs That Almost Ruined the Data

wpnews.pro

A field report from building a CPU-only, distributed LLM pipeline for large-scale scientific literature extraction. No GPUs. A lot of quantization. And four silent data-quality bugs that taught me more than the happy path ever did.

Our team runs an internal research cluster: a couple dozen older x86 servers, plenty of RAM, zero GPUs. The mandate was to extract structured data — effect sizes, the entity each one describes, and the direction of effect — from ~10,000 full-text research papers, so a downstream meta-analysis could pool them.

The obvious 2024-era answer is "send it to a hosted LLM API." That wasn't on the table for data-governance reasons: the corpus had to stay on-prem. So the real question became:

Can you do serious LLM extraction at the 10k-document scale with CPUs only?

Spoiler: yes — but the interesting part isn't the throughput. It's that correctness, not speed, turned out to be the hard problem. Let me walk through the architecture, then the four bugs that each silently corrupted the data in a different way.

Everything is open source and CPU-friendly:

requests

ThreadPoolExecutor

) for orchestration. No Ray, no fancy scheduler — just a queue and one worker bound per node, because each llama.cpp server runs --parallel 1

: on CPU, inference is memory-bandwidth bound, so one in-flight request already saturates the memory bus and batching buys little.Each node is a dual-socket Xeon, ~36 cores total (AVX-512), no accelerator. The 35B MoE generated ~6 tokens/s per node; with 8 nodes load-balanced, a sentence took ~10s end to end and the full 14k-sentence extraction finished in a few hours.

MoE was the unlock for CPU: ~3B active parameters per token means it generates at a usable rate even without a GPU, while delivering quality far above what its ~3B active count alone would suggest.

The ~400B Q3 model was reserved for a separate, earlier abstract-level pass — a different job at a different scale, out of scope for this post — where its stronger one-shot reading paid off. On a single CPU node it ran at low single-digit tokens/s, so routing the sentence-level corpus through it was never viable; everything below is the 35B MoE.

First, a clarification I had to make repeatedly, because it confuses people (it confused me):

A vector DB stores d = -0.45

as a text token inside an embedding. It will happily find that sentence by meaning, but it cannot compute over the number. If your goal is to pool effect sizes, embeddings are the wrong tool. You want extraction.

So the pipeline is a hybrid: a cheap mechanical pass to find candidate sentences, then an LLM to interpret them.

10k full-text papers
   │
   ├─ ① regex pre-filter  (mechanical, no understanding)
   │     keep sentences that have a number near a target-entity keyword
   │     → ~14k candidate sentences
   │
   └─ ② LLM mapping       (the judgment step)
         each sentence → {entity, metric, direction, value, measure_type}
         → structured JSON for the meta-analysis

Regex is the funnel; the LLM is the brain. Neither replaces the other.

Now the fun part.

The embedding side (the RAG corpus) had its own chunking pipeline. It looked fine. Counts looked fine. Then someone asked a simple question — "how many points are actually in the collection?" — and the numbers didn't add up: ~1M chunks generated, ~217k points in the DB.

A 78% gap. Where did 800k chunks go?

The culprit was the point ID. Each chunk got an ID derived from (paper_id, chunk_index)

. Reasonable — except chunk_index

was reset to 0 at the start of every section:

for section in sections:
    for j, chunk in enumerate(chunk_text(section)):   # j resets per section!
        point_id = make_id(paper_id, j)               # collision: (abstract,0) == (methods,0)
        upsert(point_id, ...)

So a paper's abstract chunk-0 and its methods chunk-0 and its results chunk-0 all hashed to the same point ID. Qdrant upserts are idempotent by ID, so each new section silently overwrote the previous one. Every paper collapsed to roughly max(chunks in any single section)

points.

I confirmed it by replaying the raw chunks: 27,222 chunks across a sample → only 5,672 unique (paper_id, chunk_index)

pairs. 79.2% collision on the sample, closely matching the 78% gap across the full DB (the small delta is just sampling — one is a replayed subset, the other the whole collection).

The fix is a one-liner — make chunk_index

a running counter across the whole paper (and derive the ID with a deterministic hash like hashlib

/UUID, not Python's per-process hash()

, so IDs stay stable across runs) — but the lesson isn't the fix. It's that a silent overwrite produces a database that looks completely healthy: green status, fast queries, plausible counts. Nothing errors. You only catch it if you reconcile "things I generated" against "things that landed."

Reconcile your pipeline's input count against its output count at every hop. Silent data loss doesn't throw.

While fixing #1, I re-ran the chunker on a fresh corpus and a sample paper produced 7,588 chunks, of which only 1,897 were unique — 75% duplicates.

The XML parser walked sections like this:

for sec in body.findall(".//sec"):          # ALL <sec>, including nested ones
    paragraphs = sec.findall(".//p")          # ALL <p>, recursively

In journal XML, sections nest. A parent <sec>

contains child <sec>

s. .//p

is recursive, so the parent emitted all of its children's paragraphs — and then each child <sec>

was visited separately and emitted them again. Deeply nested papers (a conference-proceedings document with 600 sub-sections was the worst) exploded.

Fix: take direct-child paragraphs only (sec.findall("p")

), plus a within-paper dedup as a safety net. Chunks dropped to the honest count, embedding time dropped with it.

.//

in XPath is a footgun when your tree is recursive and you also iterate the tree.

Onto the extraction LLM — the 35B MoE workhorse. It's a reasoning model that emits a <think>…</think>

block before its answer. The first run capped generation at 512 tokens with stop=["\n\n"]

. Result: 0% parse rate. The \n\n

stop fired inside the thinking block, truncating mid-thought; no JSON ever appeared.

OK, remove the bad stop, give it room. Bump to 1024 tokens. Now ~42% parse — better, but a third of outputs were still <think>

with no </think>

: the model hit the token cap still reasoning.

So give it more room. 2048 tokens, 600-second timeout, quality-first. I ran a single hard sentence as a test. It generated 6,144 characters in 269 seconds and still hadn't closed the think block — it was literally mid-sentence, "Let's draft the JSON:", when it ran out of budget. At that rate, 14k sentences would take ~5 days and still fail on the hard ones.

The model wasn't slow. It was non-terminating: on ambiguous inputs it reasoned in circles and never committed to an answer. More tokens didn't help; it just thought more.

The fix is a known trick for reasoning models in raw-completion mode: pre-close the think block in the prompt so the model skips deliberation and answers directly:

prompt = f"...<|im_start|>assistant\n<think>\n\n</think>\n\n"

Latency dropped from "minutes, maybe never" to ~10 seconds, deterministically. The whole 14k run finished in hours, at 99.96% parse.

A reasoning model with no thinking budget is a liability for bulk structured output. If you don't need the chain-of-thought, close it.

No-think mode had its own quirk: on ~14% of the harder sentences, the model returned completely empty output. Not bad JSON — nothing. Deterministic (temperature 0), so retries reproduced the emptiness exactly.

The model, forced to answer immediately, was "blanking" on sentences it found ambiguous. The fix was almost insultingly small: seed the assistant turn with an opening bracket so the model is already inside a JSON array and must continue it:

prompt = f"...<|im_start|>assistant\n<think>\n\n</think>\n\n["

(You then prepend the [

back when parsing, since the completion only returns what comes after the prompt.) This recovered 298 of 301 empties → 99.86% parse on the hard subset.

When a model can output "nothing," constrain the output space so "nothing" isn't reachable.

The last lesson is subtler. The first extraction run mapped a number whenever a sentence had a number near a target-entity keyword. The audit found ~50% of the mapped "effect sizes" weren't the target effect at all — they were regression-predictor t-values (age, sex, medication), correlations with secondary task scores, even positional coordinates (x = -28

) the regex had grabbed as if they were measurements.

That noise produced a confident-but-spurious aggregate signal. Garbage in, significant garbage out.

The fix had two halves, and getting it wrong in an instructive way:

But I over-corrected: my first sharpened prompt rejected so aggressively it returned []

for valid patient-vs-control effects too (1/15 on a sanity sample). The filter and the prompt were fighting — the filter guaranteed the paper was on-topic, but the prompt still demanded an explicit topic keyword in the sentence. Once I told the prompt "you can trust that this sentence is from an on-topic paper; extract the entity's effect and only reject these specific noise types," recall snapped back (9/15) with zero coordinate leakage.

Precision in the mapped set went from ~49% to ~66% at the sentence level; at the paper level — meaning every paper that contributes an effect is genuinely on-topic — it was 100%. Total entries dropped from ~4,900 to ~1,700, almost all of it noise. The residual ~34% sentence-level noise isn't pooled blind, but be precise about what catches it: the load-bearing filter downstream is entity normalization against a controlled vocabulary — off-target entities (age, sex, medication) get dropped there — backed by a validation gate. (Stratifying by measure type and dedup are cleanup, not misclassification removal: a predictor t-value mislabeled as a target effect sails right through those.) The mapping's job is to maximize signal and flag; the controlled-vocabulary step is where the final noise is supposed to die.

The most dangerous extraction failure isn't a crash or a low parse rate. It's clean-looking data that's confidently wrong. Audit what your pipeline

includes, not just what it drops.

<think>

and seeding the output bracket turned a 5-day non-terminating job into a few-hour deterministic one.None of these are exotic. They're the unglamorous correctness work that sits between "the demo runs" and "the numbers are trustworthy" — which, for anything feeding a real analysis, is the whole job.

This pipeline powered the large-scale literature extraction behind our chronic-stress scoping-review preprint ( Research Square).

Tools used: llama.cpp, BGE-M3, Qdrant, Python. All on-prem, all CPU.

source & further reading

dev.to — original article #17 None of It Was for Me: A Year of Building With AI What If GitHub Stopped Tracking Code and Started Tracking Thought? Transforming Your First Repo Prompt with AI Config Kits

Running 35B–400B LLMs on a GPU-less Cluster to Mine 10,000 Papers — and the 4 Bugs That Almost Ruined the Data

Run your AI side-project on zahid.host