{"slug": "solving-the-gpu-pinning-saga-and-gemma-s-meta-commentary", "title": "Solving the GPU Pinning Saga and Gemma's Meta-Commentary", "summary": "Glad Labs fixed a GPU pinning issue where LiteLLM 1.89.2's global api_base override prevented per-model routing, causing vision tasks to cold-load onto the wrong GPU. The team also hardened content guards against Gemma's meta-commentary leaking into titles and fixed a silent backup failure caused by a shell scripting trap. These fixes, along with a switch to GPU UUIDs and disabling Ollama's Vulkan backend, stabilized the system.", "body_md": "*What we shipped on 2026-07-03*\n\nWe spent today fighting a ghost in our GPU orchestration, starting with `fix(llm): stop setting litellm.api_base global`\n\n(PR #2082). We had implemented per-model `api_base`\n\noverrides to route vision tasks to a dedicated rail, but requests were still hitting the default port and cold-loading qwen3-vl onto our RTX 5090. The culprit was LiteLLM 1.89.2; its internal logic allowed the module global `litellm.api_base`\n\nto win over per-call kwargs. We had to strip the global assignment entirely to let the overrides actually function.\n\nThis was the final piece of a frustrating puzzle involving our second Ollama instance pinned to GPU 1 (PR #2075). The goal was simple: prevent the gemma writer from evicting qwen3-vl and causing `qa.vision`\n\ntimeouts. But the path to \"pinned\" was an obstacle course. First, we found that scheduled-task processes weren't inheriting HKCU user environment variables, meaning our models directory was 404ing (PR #2076). Then, numeric indices for `CUDA_VISIBLE_DEVICES`\n\nproved unreliable on Windows; we had to move to GPU UUIDs to ensure the model landed on the 3090 (PR #2077). Even then, Ollama 0.31's default Vulkan backend was ignoring CUDA pins and grabbing the 5090 anyway. We finally locked it down by setting `OLLAMA_VULKAN=false`\n\n(PR #2078).\n\nOn the content side, we had to harden our guards against Gemma's \"meta-commentary\" dialect (PR #2084). We caught several June rows where writer planning notes--things like `\"Focuses on specific metrics (TPS)...\"`\n\n--were leaking into actual titles. To stop this, we implemented `_META_LEADING_VERB_PREP_RE`\n\nto catch elided-subject narration and added a topic-sanity gate at tap ingest using `evaluate_topic_sanity`\n\n(PR #2081). This ensures that \"dots-only\" headlines or distiller sentinels never even enter the `topic_pool`\n\n.\n\nWe also caught a silent failure in our offsite backups (PR #2079). An alert had fired reporting `rc=0`\n\ndespite a genuine restic failure. The bug was a classic shell scripting trap: we were capturing `local rc=$?`\n\nafter an `if`\n\nstatement that lacked an `else`\n\n, meaning the return code was being reset to 0 before we could read it.\n\nBetween these fixes and upgrading our publishing funnel to cohort-based stats in Grafana (PR #2069), the system is significantly quieter. We now have a dedicated vision rail that actually stays put, and the pipeline is finally blind to the LLM's internal planning notes.\n\n*Auto-compiled by Poindexter from today's commits and PRs. See the work: github.com/Glad-Labs/poindexter.*", "url": "https://wpnews.pro/news/solving-the-gpu-pinning-saga-and-gemma-s-meta-commentary", "canonical_source": "https://dev.to/glad_labs/solving-the-gpu-pinning-saga-and-gemmas-meta-commentary-32o1", "published_at": "2026-07-04 10:32:26+00:00", "updated_at": "2026-07-04 10:48:54.578188+00:00", "lang": "en", "topics": ["large-language-models", "ai-infrastructure", "developer-tools", "ai-products"], "entities": ["Glad Labs", "LiteLLM", "Ollama", "Gemma", "RTX 5090", "RTX 3090", "Poindexter", "Grafana"], "alternates": {"html": "https://wpnews.pro/news/solving-the-gpu-pinning-saga-and-gemma-s-meta-commentary", "markdown": "https://wpnews.pro/news/solving-the-gpu-pinning-saga-and-gemma-s-meta-commentary.md", "text": "https://wpnews.pro/news/solving-the-gpu-pinning-saga-and-gemma-s-meta-commentary.txt", "jsonld": "https://wpnews.pro/news/solving-the-gpu-pinning-saga-and-gemma-s-meta-commentary.jsonld"}}