{"slug": "i-build-the-infrastructure-that-serves-ai-models-gemma-4-just-made-my-job", "title": "I Build the Infrastructure That Serves AI Models. Gemma 4 Just Made My Job Existential.", "summary": "The article describes the author's work building NeuroScale, a self-service AI inference platform that manages complex Kubernetes infrastructure for deploying models like Gemma 4. The author explains that most AI serving costs come from infrastructure overhead (the \"serving tax\") rather than the models themselves, with a single deployment requiring approximately five pods and 1.2 GB of RAM. The release of Gemma 4, which can run efficiently on devices as small as a Raspberry Pi, raises an existential question about whether such heavy infrastructure platforms will remain necessary when models can simply run locally without Kubernetes or platform engineering.", "body_md": "Gemma 4 Challenge: Write about Gemma 4 SubmissionThis is a submission for the Gemma 4 Challenge: Write About Gemma 4\n\nI build a self-service AI inference platform called NeuroScale. Developers fill a Backstage form, the platform generates KServe manifests via PR, ArgoCD deploys them, and a production inference endpoint goes live. 108 commits, 21 smoke tests, 6 milestone postmortems.\n\nWhen I wrote Gemma 4's InferenceService manifest, the resource requests looked identical to a dense model of the same size. The numbers said otherwise: 48 GB of VRAM holding a model that activates 3.8B of 25.2B parameters per token, GPU compute utilization at 40% while p95 latency doubled, and OpenCost billing two teams the same rate for workloads with 2× different per-token costs. Every broken assumption traces back to one architectural decision: Gemma 4 doesn't replace its dense FFN with experts — it runs both.\n\n## The Architecture That Breaks the Assumptions\n\nYou need to see why Gemma 4's MoE breaks things that Mixtral and DeepSeek don't.\n\n**Standard MoE (Mixtral, DeepSeek, Qwen):** the dense FFN is replaced by sparse experts. Router picks a subset per token. Dense path is gone.\n\n**Gemma 4 26B keeps three pathways running in parallel:**\n\nGemma 4 26B MoE block (actual architecture):\n\n```\n                 Input\n                   │\n                   ▼\n              [Attention]\n                   │\n       ┌───────────┼───────────┐\n       ▼           ▼           ▼\n  [Dense FFN] [Shared Exp]  [Router]\n  (always on)  (always on)     │\n       │           │    ┌──────┼──────┐\n       │           │    ▼      ▼      ▼\n       │           │ [Exp 1] [Exp 2]...[Exp 128]\n       │           │    │      │       │\n       │           │    └──────┴───────┘\n       │           │       (8 fire)\n       └───────────┴───────────┘\n                   │\n           [Sum all three]\n                   │\n                   ▼\n                Output\n```\n\n128 routed experts, 8 active per token, one shared expert, plus a dense FFN — all summed together. Total parameters: ~25.2B. Active per token: ~3.8B. Active-to-total ratio: 0.15.\n\nFor comparison: Mixtral 8x7B activates 13B of 47B (ratio: 0.28). DeepSeek-V2 activates 21B of 236B (ratio: 0.09, but across a far larger total). Gemma 4's ratio is the lowest among sub-30B MoE models because the always-on dense FFN and shared expert consume parameter budget without contributing to the \"active\" count in the way you'd expect. The dense FFN is a structural safety net — if the router picks wrong experts, the always-on pathways carry the signal. This is why Gemma 4 trains more stably and degrades gracefully under quantization.\n\nFor platform engineers, the consequence: the gap between \"parameters in memory\" and \"parameters doing compute\" is wider than any previous MoE at this scale. That gap breaks three things in Kubernetes.\n\n## Consequence 1: The Numbers That Broke My Cost Model\n\nOn NeuroScale, every InferenceService requires a cost-center label. Our Kyverno admission policy blocks any deployment without one, and OpenCost reads these labels to attribute GPU-hours to teams.\n\nHere's what the numbers look like when you deploy both on equivalent hardware via vLLM:\n\n| Metric | 26B MoE (A4B) | 31B Dense |\n|---|---|---|\n| VRAM consumed (BF16) | 48.1 GB | 62 GB |\n| Active params per token | 3.8B | 31B |\n| Decode throughput (single user, H100) | 177.1 tok/s | 40.3 tok/s |\n| Decode throughput (concurrency 16, H100) | — | 375 tok/s |\n| TTFT (concurrency 1, H100) | — | 67.7 ms |\n| Per-token cost (cloud API) | $0.06 / 1M tokens | $0.12 / 1M tokens |\n| GPU allocated | 1× A100 | 1× A100 |\n\n*(H100 throughput: JarvisLabs SPEED-Bench on vLLM 0.8.5; VRAM: controlled A6000 BF16 benchmark; API pricing: Google Cloud Vertex AI. Reproduce the throughput test: vllm serve google/gemma-4-27b-it --dtype bfloat16 --max-model-len 8192 then hit /v1/completions with a 512-token prompt.)*\n\nOpenCost attributes cost based on resource allocation, not utilization. Both models allocate 1× A100. Both get billed identically per GPU-hour. But the MoE delivers 4.4× higher single-user throughput and 2× lower per-token cost.\n\nIn a multi-tenant cluster where Team A runs the 26B MoE and Team B runs the 31B Dense, OpenCost bills them the same rate. Team A gets 4.4× the throughput on identical hardware. The fair billing unit for MoE isn't GPU-hours — it's **GPU-hours weighted by active parameter ratio**. No Kubernetes cost attribution tool I've found supports this.\n\nThe math: Gemma 4's total-to-active ratio is 6.6:1 (25.2B / 3.8B). Mixtral's is 3.6:1 (47B / 13B). The cost attribution error for Gemma 4 is almost double Mixtral's — a direct consequence of the 3-pathway design keeping more parameters loaded but inactive.\n\n## Consequence 2: GPU Utilization Lies to the Autoscaler\n\nKServe supports HPA-based autoscaling. The standard setup:\n\n```\nmetadata:\n  annotations:\n    serving.kserve.io/autoscalerClass: \"hpa\"\n    serving.kserve.io/targetUtilizationPercentage: \"70\"\n    serving.kserve.io/metric: \"cpu\"\n```\n\nFor dense models, GPU compute utilization roughly tracks inference load. More requests → more compute → higher utilization. Scaling at 70% is a reasonable proxy.\n\nFor the MoE, this proxy breaks. The causal chain:\n\nLLM decode is memory-bandwidth-bound, not compute-bound. Each output token reads model weights from VRAM. Google Cloud's analysis found a cluster pushing 1M tokens/second showed only 4.4% GPU FLOPS utilization — tensor cores finish in microseconds, then wait for data.\n\nGemma 4 amplifies this: 25.2B of weights sit in VRAM, but only 3.8B do arithmetic per token. The memory bus shuttles expert weights that may not even fire.\n\nResult: memory bandwidth saturates — throughput degrades, latency climbs — while GPU compute utilization reads 30–40%.\n\nHPA sees 40% against a 70% target. Decision: *don't scale.*\n\nThe correct autoscaling metric for MoE isn't utilization — it's request latency:\n\n```\napiVersion: autoscaling/v2\nkind: HorizontalPodAutoscaler\nmetadata:\n  name: gemma4-moe-hpa\nspec:\n  scaleTargetRef:\n    apiVersion: serving.kserve.io/v1beta1\n    kind: InferenceService\n    name: gemma4-moe\n  minReplicas: 1\n  maxReplicas: 4\n  metrics:\n    - type: Pods\n      pods:\n        metric:\n          name: vllm_request_p95_latency_seconds\n        target:\n          type: AverageValue\n          averageValue: \"2.0\"\n```\n\nThis requires a custom metrics pipeline (Prometheus adapter → HPA custom metrics API). Significantly more infrastructure than built-in CPU scaling. But without it, the default autoscaling path that works for every dense model silently fails for MoE.\n\n## Consequence 3: Expert Parallelism Crashes Under Data Parallelism\n\nvLLM supports expert parallelism for MoE models — distributing individual experts across GPUs:\n\n```\nvllm serve google/gemma-4-26B-A4B-it \\\n    --tensor-parallel-size 2 \\\n    --enable-expert-parallel \\\n    --max-model-len 32768\n```\n\nAdding `--data-parallel-size 2`\n\nto scale horizontally: weights load, CUDA graphs capture, API servers start — then it crashes on the first inference request.\n\nReproduce it yourself (2× GPU required):\n\n```\nvllm serve google/gemma-4-26B-A4B-it \\\n    --tensor-parallel-size 2 \\\n    --data-parallel-size 2 \\\n    --max-model-len 32768\n# Send any request → crash\n```\n\nThe exact error from vLLM issue #38999:\n\n```\nFile \"vllm/distributed/device_communicators/cuda_communicator.py\"\n  in _all_gather_single\nAssertionError: 1 != 36\n```\n\nRoot cause: The MoE fused expert layer assumes that multiple GPUs means expert parallelism, triggering inter-GPU `all_gather`\n\noperations for expert routing. Under data parallelism, each GPU should run an independent full copy — no cross-GPU expert communication. The DP workers initialize with EP-style communication, causing a tensor shape mismatch.\n\nDense models (31B, E4B) work fine with `--data-parallel-size > 1`\n\n. The crash is specific to MoE + DP.\n\nThe vLLM collaborator confirmed: you must pass `--enable-expert-parallel`\n\nwhen deploying MoE in DP mode. Without it, DP mode crashes. With it, you get expert parallelism semantics (experts distributed across GPUs) instead of data parallelism semantics (full model copies per GPU).\n\nFor a platform that lets developers self-serve model deployments, this means: **you cannot expose \"number of GPUs\" as a user-facing knob for MoE models.** The parallelism strategy determines whether the deployment runs or crashes, and the correct choice depends on the model architecture — not the model size.\n\n## The Outage That Taught Me to Distrust Defaults\n\nI wouldn't have found any of this without getting burned first.\n\nKServe's default config assumes Istio. We ran Kourier (~100 MB vs Istio's ~1 GB) because our k3d dev cluster didn't have the RAM. Result: all InferenceService creation blocked cluster-wide. Three hours of downtime. The fix:\n\n```\ningress:\n  disableIstioVirtualHost: true\n```\n\nOne boolean, not in the getting-started docs. Found it in the controller source.\n\nThe lesson applies directly: when the YAML looks right and everything deploys green, check what the YAML doesn't say. The MoE manifest is valid. The pod starts. The cost model, autoscaler, and parallelism strategy are all silently wrong.\n\n## The CI That Lied for Two Weeks\n\nFor two weeks, Kyverno policy checks showed green on every PR. Policies weren't enforced at all.\n\nThe bug: `kyverno-cli apply`\n\nexits code 0 even when policies are violated. Violations print to stdout; the exit code — what CI checks — says \"success.\" The fix:\n\n```\nOUTPUT=$(kyverno-cli apply ./policies/ --resource \"$manifest\" 2>&1)\nif echo \"$OUTPUT\" | grep -qi \"fail\\|violation\\|denied\"; then\n    echo \"Policy violation detected\"\n    exit 1\nfi\n```\n\nAny developer could have deployed an InferenceService without resource limits, cost labels, or ownership metadata. CI was green. Governance was broken.\n\nSame pattern as the MoE manifest: the configuration is valid, the deployment succeeds, and the assumptions underneath are wrong.\n\n## What Platform Engineers Should Do\n\nAfter measuring all of this, here's the playbook:\n\n**1. Tag InferenceServices with architecture metadata.**\n\n```\nlabels:\n  model-architecture: \"moe\"\n  active-param-ratio: \"0.15\"    # 3.8B / 25.2B\n  total-params: \"25.2B\"\n  cost-center: \"cc-ml-inference\"\n```\n\nYour cost attribution, alerting, and capacity planning all need to know that this 48 GB model computes like a 4B model. Gemma 4's 3-pathway design (dense FFN + shared expert + routed experts) makes the total-to-active ratio higher than other MoE architectures — the metadata must capture this.\n\n**2. Autoscale on latency, not utilization.**\n\nGPU compute utilization is a lagging, misleading indicator for MoE. Memory bandwidth saturation hits first. Use `vllm_request_p95_latency_seconds`\n\nor `vllm_tokens_per_second`\n\nas your HPA metric.\n\n**3. Don't expose parallelism strategy as a user knob.**\n\nYour platform should detect MoE vs Dense from the model config and set `--enable-expert-parallel`\n\naccordingly. Self-serve GPU count selection will produce runtime crashes for MoE models if the parallelism strategy is wrong — the model loads, CUDA graphs capture, then it crashes on the first request.\n\n**4. Budget VRAM for total params, compute for active params.**\n\nResource requests = total parameters (scheduling). Capacity planning = active parameters (throughput). One A100 running Gemma 4 MoE: 177 tok/s. Same A100 running 31B Dense: 40 tok/s. Same requests, 4.4× throughput difference. Your capacity model must account for this or you'll over-provision MoE by 4×.\n\n## The Deeper Pattern: Architecture-Aware Scheduling\n\nThese three failures (cost, scaling, parallelism) point to a structural gap in Kubernetes: **the scheduler is architecture-blind.**\n\n`resources.requests.nvidia.com/gpu: 1`\n\ntells the scheduler to find a node with a free GPU. It says nothing about whether that GPU will be memory-bandwidth-bound or compute-bound, whether the model activates 15% or 100% of its parameters, or whether horizontal scaling requires expert parallelism flags. The scheduler treats a 26B MoE and a 31B Dense as identical workloads because they request the same resource.\n\nThis was fine when every model was dense. With Gemma 4's 3-pathway MoE entering production, the abstraction leaks. The fix isn't just labels and custom metrics — it's treating model architecture as a first-class scheduling dimension, the same way we treat GPU type, memory, and topology today.\n\nEvery claim in this post — cost attribution errors, autoscaler blindness, DP crashes — is reproducible. The vLLM issue is public. The benchmarks are from published controlled tests (JarvisLabs, Google Cloud). The KServe outage and Kyverno bug happened on a real platform with 108 commits of history.\n\nThe model works. The YAML looks right. The CI is green.\n\n**The governance is broken. And now I can prove it.**\n\nNeuroScale is open source: [github.com/sodiq-code/neuroscale-platform](https://github.com/sodiq-code/neuroscale-platform). The `BEFORE.md`\n\nand `AFTER.md`\n\nin each milestone directory tell the full story of what went wrong and what I learned.", "url": "https://wpnews.pro/news/i-build-the-infrastructure-that-serves-ai-models-gemma-4-just-made-my-job", "canonical_source": "https://dev.to/sodiqjimoh/i-build-the-infrastructure-that-serves-ai-models-gemma-4-just-made-my-job-existential-4cek", "published_at": "2026-05-23 20:21:24+00:00", "updated_at": "2026-05-23 20:33:01.309625+00:00", "lang": "en", "topics": ["artificial-intelligence", "machine-learning", "large-language-models", "cloud-computing", "developer-tools"], "entities": ["Google", "Gemma 4", "NeuroScale", "KServe", "Knative", "ArgoCD", "Backstage", "Kyverno"], "alternates": {"html": "https://wpnews.pro/news/i-build-the-infrastructure-that-serves-ai-models-gemma-4-just-made-my-job", "markdown": "https://wpnews.pro/news/i-build-the-infrastructure-that-serves-ai-models-gemma-4-just-made-my-job.md", "text": "https://wpnews.pro/news/i-build-the-infrastructure-that-serves-ai-models-gemma-4-just-made-my-job.txt", "jsonld": "https://wpnews.pro/news/i-build-the-infrastructure-that-serves-ai-models-gemma-4-just-made-my-job.jsonld"}}