Solving the GPU Pinning Saga and Gemma's Meta-Commentary

wpnews.pro

cd /news/large-language-models/solving-the-gpu-pinning-saga-and-gem… · home › topics › large-language-models › article

[ARTICLE · art-47810] src=dev.to ↗ pub=2026-07-04T10:32Z topic=large-language-models verified=true sentiment=· neutral

Solving the GPU Pinning Saga and Gemma's Meta-Commentary

Glad Labs fixed a GPU pinning issue where LiteLLM 1.89.2's global api_base override prevented per-model routing, causing vision tasks to cold-load onto the wrong GPU. The team also hardened content guards against Gemma's meta-commentary leaking into titles and fixed a silent backup failure caused by a shell scripting trap. These fixes, along with a switch to GPU UUIDs and disabling Ollama's Vulkan backend, stabilized the system.

read2 min views1 publishedJul 4, 2026

What we shipped on 2026-07-03 We spent today fighting a ghost in our GPU orchestration, starting with fix(llm): stop setting litellm.api_base global

(PR #2082). We had implemented per-model api_base overrides to route vision tasks to a dedicated rail, but requests were still hitting the default port and cold- qwen3-vl onto our RTX 5090. The culprit was LiteLLM 1.89.2; its internal logic allowed the module global litellm.api_base

to win over per-call kwargs. We had to strip the global assignment entirely to let the overrides actually function.

This was the final piece of a frustrating puzzle involving our second Ollama instance pinned to GPU 1 (PR #2075). The goal was simple: prevent the gemma writer from evicting qwen3-vl and causing qa.vision

timeouts. But the path to "pinned" was an obstacle course. First, we found that scheduled-task processes weren't inheriting HKCU user environment variables, meaning our models directory was 404ing (PR #2076). Then, numeric indices for CUDA_VISIBLE_DEVICES

proved unreliable on Windows; we had to move to GPU UUIDs to ensure the model landed on the 3090 (PR #2077). Even then, Ollama 0.31's default Vulkan backend was ignoring CUDA pins and grabbing the 5090 anyway. We finally locked it down by setting OLLAMA_VULKAN=false

(PR #2078). On the content side, we had to harden our guards against Gemma's "meta-commentary" dialect (PR #2084). We caught several June rows where writer planning notes--things like "Focuses on specific metrics (TPS)..."

--were leaking into actual titles. To stop this, we implemented _META_LEADING_VERB_PREP_RE

to catch elided-subject narration and added a topic-sanity gate at tap ingest using evaluate_topic_sanity

(PR #2081). This ensures that "dots-only" headlines or distiller sentinels never even enter the topic_pool

We also caught a silent failure in our offsite backups (PR #2079). An alert had fired reporting rc=0

despite a genuine restic failure. The bug was a classic shell scripting trap: we were capturing local rc=$?

after an if

statement that lacked an else

, meaning the return code was being reset to 0 before we could read it.

Between these fixes and upgrading our publishing funnel to cohort-based stats in Grafana (PR #2069), the system is significantly quieter. We now have a dedicated vision rail that actually stays put, and the pipeline is finally blind to the LLM's internal planning notes.

Auto-compiled by Poindexter from today's commits and PRs. See the work: github.com/Glad-Labs/poindexter.

source & further reading

dev.to — original article The True Classification of AI OpenClaw: 210K Stars in 4 Months — Local-First AI Agent Deep Dive My AI memory benchmark said 98.3%. The number was true — and worthless.

~/api · this article 200

$curl api.wpnews.pro/v1/news/solving-the-gpu-pinning-…

Read original on dev.to → dev.to/glad_labs/solving-the-gpu-pinning-saga-an…

mentioned entities

Glad Labs

LiteLLM

Ollama

Gemma

RTX 5090

RTX 3090

Poindexter

Grafana

metadata

slugsolving-the-gpu-pinning-saga-and-gemma-s-meta-commentary

topic#large-language-models

secondary3 topics

sentimentneutral

canonicaldev.to

navigation

← prevMake Any Website AI-Readable: Ge…

next →Prompt Caching in Practice: The …

── more in #large-language-models 4 stories · sorted by recency

dev.to · 4 Jul · #large-language-models

Scaling LLMs: Why Deterministic Hashing Isn't Enough

dev.to · 4 Jul · #large-language-models

Make Any Website AI-Readable: Generating llms.txt Files with Python

dev.to · 4 Jul · #large-language-models

Picking an Agent Framework in 2026: An Honest Verdict on Six of Them

dev.to · 4 Jul · #large-language-models

Google ADK 2.0 Is Stable — Why That Makes the OpenAI Split Matter More

── more on @glad labs 3 stories trending now

wpnews · 27 May · #artificial-intelligence

How I Run Two Claude Accounts as One

wpnews · 30 May · #ai-safety

Nightcord Security Analysis Report - Threat Investigation

wpnews · 28 May · #ai-startups

The Niche SaaS Opportunity Map 2026: Highly Demanded Subscribed Categories Beyond Mainstream

sponsored brought to you by zahid.host 4,200+ EU-deployed projects

reading about agents? ship yours in a single git push.

Run your AI side-project on zahid.host

EU-based hosting, git-push deploys, automatic HTTPS, no cold starts. Free tier with a custom domain — perfect for shipping the agent you just read about.

$git push zahid main

→ Live at https://your-agent.zahid.host ✓

Get free account → Pricing

from €0/mo · no card required