# Solving the GPU Pinning Saga and Gemma's Meta-Commentary

> Source: <https://dev.to/glad_labs/solving-the-gpu-pinning-saga-and-gemmas-meta-commentary-32o1>
> Published: 2026-07-04 10:32:26+00:00

*What we shipped on 2026-07-03*

We spent today fighting a ghost in our GPU orchestration, starting with `fix(llm): stop setting litellm.api_base global`

(PR #2082). We had implemented per-model `api_base`

overrides to route vision tasks to a dedicated rail, but requests were still hitting the default port and cold-loading qwen3-vl onto our RTX 5090. The culprit was LiteLLM 1.89.2; its internal logic allowed the module global `litellm.api_base`

to win over per-call kwargs. We had to strip the global assignment entirely to let the overrides actually function.

This was the final piece of a frustrating puzzle involving our second Ollama instance pinned to GPU 1 (PR #2075). The goal was simple: prevent the gemma writer from evicting qwen3-vl and causing `qa.vision`

timeouts. But the path to "pinned" was an obstacle course. First, we found that scheduled-task processes weren't inheriting HKCU user environment variables, meaning our models directory was 404ing (PR #2076). Then, numeric indices for `CUDA_VISIBLE_DEVICES`

proved unreliable on Windows; we had to move to GPU UUIDs to ensure the model landed on the 3090 (PR #2077). Even then, Ollama 0.31's default Vulkan backend was ignoring CUDA pins and grabbing the 5090 anyway. We finally locked it down by setting `OLLAMA_VULKAN=false`

(PR #2078).

On the content side, we had to harden our guards against Gemma's "meta-commentary" dialect (PR #2084). We caught several June rows where writer planning notes--things like `"Focuses on specific metrics (TPS)..."`

--were leaking into actual titles. To stop this, we implemented `_META_LEADING_VERB_PREP_RE`

to catch elided-subject narration and added a topic-sanity gate at tap ingest using `evaluate_topic_sanity`

(PR #2081). This ensures that "dots-only" headlines or distiller sentinels never even enter the `topic_pool`

.

We also caught a silent failure in our offsite backups (PR #2079). An alert had fired reporting `rc=0`

despite a genuine restic failure. The bug was a classic shell scripting trap: we were capturing `local rc=$?`

after an `if`

statement that lacked an `else`

, meaning the return code was being reset to 0 before we could read it.

Between these fixes and upgrading our publishing funnel to cohort-based stats in Grafana (PR #2069), the system is significantly quieter. We now have a dedicated vision rail that actually stays put, and the pipeline is finally blind to the LLM's internal planning notes.

*Auto-compiled by Poindexter from today's commits and PRs. See the work: github.com/Glad-Labs/poindexter.*
