Solving the GPU Pinning Saga and Gemma's Meta-Commentary

Glad Labs fixed a GPU pinning issue where LiteLLM 1.89.2's global api_base override prevented per-model routing, causing vision tasks to cold-load onto the wrong GPU. The team also hardened content guards against Gemma's meta-commentary leaking into titles and fixed a silent backup failure caused by a shell scripting trap. These fixes, along with a switch to GPU UUIDs and disabling Ollama's Vulkan backend, stabilized the system.

What we shipped on 2026-07-03 We spent today fighting a ghost in our GPU orchestration, starting with fix llm : stop setting litellm.api base global PR 2082 . We had implemented per-model api base overrides to route vision tasks to a dedicated rail, but requests were still hitting the default port and cold-loading qwen3-vl onto our RTX 5090. The culprit was LiteLLM 1.89.2; its internal logic allowed the module global litellm.api base to win over per-call kwargs. We had to strip the global assignment entirely to let the overrides actually function. This was the final piece of a frustrating puzzle involving our second Ollama instance pinned to GPU 1 PR 2075 . The goal was simple: prevent the gemma writer from evicting qwen3-vl and causing qa.vision timeouts. But the path to "pinned" was an obstacle course. First, we found that scheduled-task processes weren't inheriting HKCU user environment variables, meaning our models directory was 404ing PR 2076 . Then, numeric indices for CUDA VISIBLE DEVICES proved unreliable on Windows; we had to move to GPU UUIDs to ensure the model landed on the 3090 PR 2077 . Even then, Ollama 0.31's default Vulkan backend was ignoring CUDA pins and grabbing the 5090 anyway. We finally locked it down by setting OLLAMA VULKAN=false PR 2078 . On the content side, we had to harden our guards against Gemma's "meta-commentary" dialect PR 2084 . We caught several June rows where writer planning notes--things like "Focuses on specific metrics TPS ..." --were leaking into actual titles. To stop this, we implemented META LEADING VERB PREP RE to catch elided-subject narration and added a topic-sanity gate at tap ingest using evaluate topic sanity PR 2081 . This ensures that "dots-only" headlines or distiller sentinels never even enter the topic pool . We also caught a silent failure in our offsite backups PR 2079 . An alert had fired reporting rc=0 despite a genuine restic failure. The bug was a classic shell scripting trap: we were capturing local rc=$? after an if statement that lacked an else , meaning the return code was being reset to 0 before we could read it. Between these fixes and upgrading our publishing funnel to cohort-based stats in Grafana PR 2069 , the system is significantly quieter. We now have a dedicated vision rail that actually stays put, and the pipeline is finally blind to the LLM's internal planning notes. Auto-compiled by Poindexter from today's commits and PRs. See the work: github.com/Glad-Labs/poindexter.