I run a one-person AI shop. For 2asy.ai's filing pipeline that needs thousands of single-document extractions per cycle, the local rig lost the batch lane and OpenAI Batch won. Per-pipeline, not per-company.
The rule that decided it: no cross-document attention. Each filing gets its own prompt window. No string concatenation. The rule came from a Neo4j rollback I already paid for.
Quick results.
GGML_CUDA_DISABLE_GRAPHS=1
keeps llama.cpp alive when graph optimizer segfaults.googleapis/python-genai
issue 1984 is not-planned.gpt-5.4-mini
): JSONL line-isolated, 50 percent off, 100-doc nano gate in 2.7 min, zero 429s, around 1 cent per document.The local rig stays for live serving, ER API LLM gate, multimodal, and ablations. The batch lane moves to OpenAI.
Full retrospective with the side-by-side table: https://hannune.ai/blog/local-llm-to-openai-batch.html