How to Run Reliable Local LLM Agents on an RTX 3090: A Benchmark (5 Models, Priced in Watts)

A developer benchmarked five local LLM agents on an RTX 3090, finding that the orchestrator, not the model, determines success. GLM-4.5-Air scored 0% with opencode but 93% with a LangGraph agent, while Qwen3-Coder 30B-A3B achieved 100% tool adherence. The benchmark also measured electricity cost per correct task.

I gave GLM-4.5-Air 106B, open weights 12 coding tasks through opencode https://opencode.ai on my RTX 3090. It scored 0% — never edited a single file. Same model, same GPU, same tasks, but driven by a ~150-line LangGraph agent instead: 93% . The model was never the problem. The orchestrator was. Here's the benchmark — including the part nobody else measures, the electricity cost per correct task . | Model | tok/s | opencode adh. | LangGraph adh. | LangGraph coding | LangGraph general | |---|---|---|---|---|---| Qwen3-Coder 30B-A3B | 130 | 92% | 100% | 100% | 100% | GLM-4.5-Air 106B | 5.7 | 0% | 100% | 89% | 100% | | Devstral Small 24B | 49 | 8% | 53% | 8% | 40% | | Seed-OSS 36B | 9.5 | 0% | 7% | 0% | 20% | | DeepSeek-R1-Distill 32B | 6.7 | 0% | 0% | 0% | 0% | Tool-adherence = % of tasks where the model actually called a tool instead of just printing code in chat. It was the master variable. GLM's headline "93%" is its blended score across all 17 tasks: 89% coding + 100% general. Bonus: 128 GB RAM let me run the 106B GLM 23 GB VRAM + 27 GB spilled to RAM — it works, at 5.7 tok/s. Great for fire-and-forget batch jobs, not interactive coding. Pick a tool-use-tuned model Qwen3-Coder 30B-A3B is the all-weather winner → use native tool-calling, not an OpenAI-compat path → keep the harness lean → use RAM for reach, not speed → measure correctness per kWh . 📖 Full write-up with methodology, charts, and the deeper "why" → https://medium.com/@arsen.apostolov/local-llm-agents-on-an-rtx-3090-i-benchmarked-5-models-2-frameworks-and-the-orchestrator-f5fd600ca221 https://medium.com/@arsen.apostolov/local-llm-agents-on-an-rtx-3090-i-benchmarked-5-models-2-frameworks-and-the-orchestrator-f5fd600ca221 ⭐ Every number was priced in watts by homelab-monitor — my open-source tool that turns your GPU's power draw into per-task cost.