Running 35B–400B LLMs on a GPU-less Cluster to Mine 10,000 Papers — and the 4 Bugs That Almost Ruined the Data A developer built a CPU-only, distributed LLM pipeline to extract structured data from 10,000 full-text research papers, using a 35B MoE model running on a cluster of older x86 servers with zero GPUs. The pipeline processed 14,000 candidate sentences in a few hours, but four silent data-quality bugs nearly corrupted the results, including a chunk-index reset that caused a 78% data loss in the vector database. The project demonstrated that CPU-based LLM extraction at scale is feasible, with correctness emerging as the primary challenge over throughput. A field report from building a CPU-only, distributed LLM pipeline for large-scale scientific literature extraction. No GPUs. A lot of quantization. And four silent data-quality bugs that taught me more than the happy path ever did. Our team runs an internal research cluster: a couple dozen older x86 servers, plenty of RAM, zero GPUs . The mandate was to extract structured data — effect sizes, the entity each one describes, and the direction of effect — from ~10,000 full-text research papers, so a downstream meta-analysis could pool them. The obvious 2024-era answer is "send it to a hosted LLM API." That wasn't on the table for data-governance reasons: the corpus had to stay on-prem. So the real question became: Can you do serious LLM extraction at the 10k-document scale with CPUs only? Spoiler: yes — but the interesting part isn't the throughput. It's that correctness , not speed, turned out to be the hard problem. Let me walk through the architecture, then the four bugs that each silently corrupted the data in a different way. Everything is open source and CPU-friendly: requests + ThreadPoolExecutor for orchestration. No Ray, no fancy scheduler — just a queue and one worker bound per node, because each llama.cpp server runs --parallel 1 : on CPU, inference is memory-bandwidth bound, so one in-flight request already saturates the memory bus and batching buys little.Each node is a dual-socket Xeon, ~36 cores total AVX-512 , no accelerator. The 35B MoE generated ~6 tokens/s per node ; with 8 nodes load-balanced, a sentence took ~10s end to end and the full 14k-sentence extraction finished in a few hours. MoE was the unlock for CPU: ~3B active parameters per token means it generates at a usable rate even without a GPU, while delivering quality far above what its ~3B active count alone would suggest. The ~400B Q3 model was reserved for a separate, earlier abstract-level pass — a different job at a different scale, out of scope for this post — where its stronger one-shot reading paid off. On a single CPU node it ran at low single-digit tokens/s, so routing the sentence-level corpus through it was never viable; everything below is the 35B MoE. First, a clarification I had to make repeatedly, because it confuses people it confused me : A vector DB stores d = -0.45 as a text token inside an embedding. It will happily find that sentence by meaning, but it cannot compute over the number. If your goal is to pool effect sizes, embeddings are the wrong tool. You want extraction. So the pipeline is a hybrid : a cheap mechanical pass to find candidate sentences, then an LLM to interpret them. 10k full-text papers │ ├─ ① regex pre-filter mechanical, no understanding │ keep sentences that have a number near a target-entity keyword │ → ~14k candidate sentences │ └─ ② LLM mapping the judgment step each sentence → {entity, metric, direction, value, measure type} → structured JSON for the meta-analysis Regex is the funnel; the LLM is the brain. Neither replaces the other. Now the fun part. The embedding side the RAG corpus had its own chunking pipeline. It looked fine. Counts looked fine. Then someone asked a simple question — "how many points are actually in the collection?" — and the numbers didn't add up: ~1M chunks generated, ~217k points in the DB. A 78% gap. Where did 800k chunks go? The culprit was the point ID. Each chunk got an ID derived from paper id, chunk index . Reasonable — except chunk index was reset to 0 at the start of every section : for section in sections: for j, chunk in enumerate chunk text section : j resets per section point id = make id paper id, j collision: abstract,0 == methods,0 upsert point id, ... So a paper's abstract chunk-0 and its methods chunk-0 and its results chunk-0 all hashed to the same point ID . Qdrant upserts are idempotent by ID, so each new section silently overwrote the previous one. Every paper collapsed to roughly max chunks in any single section points. I confirmed it by replaying the raw chunks: 27,222 chunks across a sample → only 5,672 unique paper id, chunk index pairs. 79.2% collision on the sample, closely matching the 78% gap across the full DB the small delta is just sampling — one is a replayed subset, the other the whole collection . The fix is a one-liner — make chunk index a running counter across the whole paper and derive the ID with a deterministic hash like hashlib /UUID, not Python's per-process hash , so IDs stay stable across runs — but the lesson isn't the fix. It's that a silent overwrite produces a database that looks completely healthy : green status, fast queries, plausible counts. Nothing errors. You only catch it if you reconcile "things I generated" against "things that landed." Reconcile your pipeline's input count against its output count at every hop. Silent data loss doesn't throw. While fixing 1, I re-ran the chunker on a fresh corpus and a sample paper produced 7,588 chunks, of which only 1,897 were unique — 75% duplicates. The XML parser walked sections like this: for sec in body.findall ".//sec" : ALL