{"slug": "90-cheaper-repo-inference-with-gpt-5-4-nano", "title": "90% cheaper repo inference with GPT-5.4 nano", "summary": "On April 27, 2026, an engineering team reduced the cost of repo inference by 90% by switching from GPT-5.4 to GPT-5.4-nano for a bounded classification step in their orchestration pipeline. The smaller model, paired with a tightened prompt, passed a validation harness with 100% accuracy on test fixtures, cutting per-call costs from $0.0429 to $0.00414. The change demonstrates that narrow, observable orchestration tasks can be reliably handled by smaller models, significantly lowering inference expenses without sacrificing accuracy.", "body_md": "### Daemons do the rest — all the necessary work that nobody owns\n\nA taxonomy of recurring Product and Engineering work that doesn't need a human to remember it every week — just a process to hold the role.\n\nFor bounded orchestration decisions, the right model is often the smallest one that can pass a focused validation loop.\n\nMost of the visible work in an engineering agent happens after it starts touching code: reading files, proposing changes, running tests, and opening PRs. The less visible cost is the orchestration work around that: deciding what context to fetch, which tool to call, and where the work should happen.\n\nRepo inference is one of those steps. When Charlie receives a task, he often needs to decide which customer GitHub repository the task is actually about. The repo-inference step examines the customer’s repo inventory and selects the primary repo for the work. That sounds simple until the signal comes from a Linear comment, a Slack thread, a GitHub webhook, or a request that mentions a product feature rather than a repo name.\n\nAfter the V2 rollout, repo inference became one of our larger orchestrator inference costs. It also ran at roughly 2× the event volume of the next-highest orchestrator inference path. That made it a good place to test a practical question: does this bounded routing step need a larger general model, or can a smaller model pass the validation loop?\n\nThe implementation that rolled out on April 27, 2026 moved repo inference from our `gpt-5.4`\n\npreset to a `gpt-5.4-nano`\n\npreset. The default preset in `infer-task-repo.ts`\n\nchanged from:\n\n`openai/gpt-5.4-low-reasoning-low-verbosity-priority`\n\nTo:\n\n`openai/gpt-5.4-nano-low-reasoning-low-verbosity-priority`\n\nThe preset still used low reasoning, low verbosity, and the priority service tier. The main change was the underlying model: `gpt-5.4-nano`\n\n.\n\nWe also tightened the repo-choice prompt. The older prompt relied on a looser evidence hierarchy. The updated prompt makes the decision process more explicit: first extract the actionable target from the human-facing request, then use direct repo mentions or mapped repo context, then match to inventory routing hints, package names, service names, and top-level paths. It also clarifies that provider names can be data sources rather than implementation targets. A request to summarize what happened in Linear should not automatically route to the Linear integration repo.\n\nSwitching models without reducing ambiguity only moves risk around. The smaller model needed a narrower job, not encouragement.\n\nRepo inference is a bounded classification step. The model is not designing an architecture or writing a migration. It is choosing one repo from a finite inventory, with a small amount of structured context and a constrained output.\n\nThat makes it a good candidate for a smaller model, but only if the step is validated directly. We ran a focused harness over repo-inference fixtures and iterated on the prompt based on failures. The final harness iteration passed all 9 fixtures twice in a row: `9/9 + 9/9`\n\n, meeting the requested two-run 100% accuracy bar.\n\nThat does not prove the model will never choose the wrong repo. It does mean the decision boundary was explicit enough to pass the cases we cared about before rollout. For an orchestrator step like this, the bar is practical: the task is narrow, the output is observable, and failures are cheap enough to catch in review or telemetry.\n\nWe compared a corrected 22-hour pre/post window around the cutover. The exact cutover minute varies depending on whether you use the cost analysis timestamp, the PR merge time, or the first observed nano production call, so the useful claim is the before/after economics, not the minute-by-minute boundary.\n\n| Metric | Before | After | Change |\n|---|---|---|---|\n| Calls | 14,891 | 15,709 | +5% |\n| Total cost | $639.17 | $65.06 | −89.8% |\n| Cost per call | $0.0429 | $0.00414 | −90.4% |\n| Relative cost per call | 1× | ~0.096× | ~10.4× cheaper |\n| Estimated savings over post window | — | ~$574 | — |\n| Annualized estimate | — | ~$229k/year | If traffic mix and volume hold |\n\nThe corrected nano rate card used for the analysis was `$0.20 / $0.02 / $1.25 per 1M tokens`\n\nfor input, cached input, and output.\n\nThere is one useful nuance in the token mix. Post-cutover output tokens were similar to pre-cutover output tokens: about `1.84M`\n\nafter versus `1.98M`\n\nbefore. After the switch, output tokens dominated the remaining cost. That means the next round of savings would likely come less from trimming input and more from keeping the output contract tight.\n\nLatency improved, but the claim is narrower than the cost claim. Direct repo-inference LLM calls were about 10.1% faster at p50 and 7.1% faster on average. p95 was effectively flat, and p99 was worse in the sample because of outliers. We should not describe this as a full routing-phase or end-to-end tail-latency improvement. It was a modest latency win for the direct LLM call and a much larger cost win.\n\nThe useful rule is: use the smallest model that reliably solves a bounded, validated orchestration step.\n\nFor repo inference, that meant five things:\n\nThis is a pattern we expect to reuse. Agent systems run many small orchestration steps: classify, route, select, summarize, decide whether to fetch more context. Some need stronger models. Many do not. The work is separating those cases instead of treating every inference call like it needs the same tool.\n\nIn this case, the repo-inference task was narrow enough, observable enough, and validated enough to move down the model stack. The result was roughly 90% lower cost per call, with call volume slightly up and quality checks passing before rollout.\n\nThat is the kind of optimization that compounds when there are thousands of small orchestration calls per day. The model got smaller; the boundary around it got sharper. That was the important part.", "url": "https://wpnews.pro/news/90-cheaper-repo-inference-with-gpt-5-4-nano", "canonical_source": "https://charlielabs.ai/blog/90-percent-cheaper-repo-inference-with-gpt-54-nano/", "published_at": "2026-05-27 17:30:22+00:00", "updated_at": "2026-05-27 17:46:13.333940+00:00", "lang": "en", "topics": ["artificial-intelligence", "large-language-models", "ai-agents", "ai-products", "ai-infrastructure"], "entities": ["Charlie", "GPT-5.4 nano", "Linear", "Slack", "GitHub"], "alternates": {"html": "https://wpnews.pro/news/90-cheaper-repo-inference-with-gpt-5-4-nano", "markdown": "https://wpnews.pro/news/90-cheaper-repo-inference-with-gpt-5-4-nano.md", "text": "https://wpnews.pro/news/90-cheaper-repo-inference-with-gpt-5-4-nano.txt", "jsonld": "https://wpnews.pro/news/90-cheaper-repo-inference-with-gpt-5-4-nano.jsonld"}}