I Built a Local LLM Rig to Escape API Bills. Then I Paid OpenAI Again.

A developer running 2asy.ai's filing pipeline built a local LLM rig to escape API costs, but found that OpenAI's batch API outperformed it for large-scale single-document extractions. The local rig remains for live serving and multimodal tasks, while the batch lane moves to OpenAI, achieving 50% cost reduction and zero rate limits.

I run a one-person AI shop. For 2asy.ai's filing pipeline that needs thousands of single-document extractions per cycle, the local rig lost the batch lane and OpenAI Batch won. Per-pipeline, not per-company. The rule that decided it: no cross-document attention. Each filing gets its own prompt window. No string concatenation. The rule came from a Neo4j rollback I already paid for. Quick results. GGML CUDA DISABLE GRAPHS=1 keeps llama.cpp alive when graph optimizer segfaults. googleapis/python-genai issue 1984 is not-planned. gpt-5.4-mini : JSONL line-isolated, 50 percent off, 100-doc nano gate in 2.7 min, zero 429s, around 1 cent per document.The local rig stays for live serving, ER API LLM gate, multimodal, and ablations. The batch lane moves to OpenAI. Full retrospective with the side-by-side table: https://hannune.ai/blog/local-llm-to-openai-batch.html https://hannune.ai/blog/local-llm-to-openai-batch.html