# I Build the Infrastructure That Serves AI Models. Gemma 4 Just Made My Job Existential.

> Source: <https://dev.to/sodiqjimoh/i-build-the-infrastructure-that-serves-ai-models-gemma-4-just-made-my-job-existential-4cek>
> Published: 2026-05-23 20:21:24+00:00

Gemma 4 Challenge: Write about Gemma 4 SubmissionThis is a submission for the Gemma 4 Challenge: Write About Gemma 4

I build a self-service AI inference platform called NeuroScale. Developers fill a Backstage form, the platform generates KServe manifests via PR, ArgoCD deploys them, and a production inference endpoint goes live. 108 commits, 21 smoke tests, 6 milestone postmortems.

When I wrote Gemma 4's InferenceService manifest, the resource requests looked identical to a dense model of the same size. The numbers said otherwise: 48 GB of VRAM holding a model that activates 3.8B of 25.2B parameters per token, GPU compute utilization at 40% while p95 latency doubled, and OpenCost billing two teams the same rate for workloads with 2× different per-token costs. Every broken assumption traces back to one architectural decision: Gemma 4 doesn't replace its dense FFN with experts — it runs both.

## The Architecture That Breaks the Assumptions

You need to see why Gemma 4's MoE breaks things that Mixtral and DeepSeek don't.

**Standard MoE (Mixtral, DeepSeek, Qwen):** the dense FFN is replaced by sparse experts. Router picks a subset per token. Dense path is gone.

**Gemma 4 26B keeps three pathways running in parallel:**

Gemma 4 26B MoE block (actual architecture):

```
                 Input
                   │
                   ▼
              [Attention]
                   │
       ┌───────────┼───────────┐
       ▼           ▼           ▼
  [Dense FFN] [Shared Exp]  [Router]
  (always on)  (always on)     │
       │           │    ┌──────┼──────┐
       │           │    ▼      ▼      ▼
       │           │ [Exp 1] [Exp 2]...[Exp 128]
       │           │    │      │       │
       │           │    └──────┴───────┘
       │           │       (8 fire)
       └───────────┴───────────┘
                   │
           [Sum all three]
                   │
                   ▼
                Output
```

128 routed experts, 8 active per token, one shared expert, plus a dense FFN — all summed together. Total parameters: ~25.2B. Active per token: ~3.8B. Active-to-total ratio: 0.15.

For comparison: Mixtral 8x7B activates 13B of 47B (ratio: 0.28). DeepSeek-V2 activates 21B of 236B (ratio: 0.09, but across a far larger total). Gemma 4's ratio is the lowest among sub-30B MoE models because the always-on dense FFN and shared expert consume parameter budget without contributing to the "active" count in the way you'd expect. The dense FFN is a structural safety net — if the router picks wrong experts, the always-on pathways carry the signal. This is why Gemma 4 trains more stably and degrades gracefully under quantization.

For platform engineers, the consequence: the gap between "parameters in memory" and "parameters doing compute" is wider than any previous MoE at this scale. That gap breaks three things in Kubernetes.

## Consequence 1: The Numbers That Broke My Cost Model

On NeuroScale, every InferenceService requires a cost-center label. Our Kyverno admission policy blocks any deployment without one, and OpenCost reads these labels to attribute GPU-hours to teams.

Here's what the numbers look like when you deploy both on equivalent hardware via vLLM:

| Metric | 26B MoE (A4B) | 31B Dense |
|---|---|---|
| VRAM consumed (BF16) | 48.1 GB | 62 GB |
| Active params per token | 3.8B | 31B |
| Decode throughput (single user, H100) | 177.1 tok/s | 40.3 tok/s |
| Decode throughput (concurrency 16, H100) | — | 375 tok/s |
| TTFT (concurrency 1, H100) | — | 67.7 ms |
| Per-token cost (cloud API) | $0.06 / 1M tokens | $0.12 / 1M tokens |
| GPU allocated | 1× A100 | 1× A100 |

*(H100 throughput: JarvisLabs SPEED-Bench on vLLM 0.8.5; VRAM: controlled A6000 BF16 benchmark; API pricing: Google Cloud Vertex AI. Reproduce the throughput test: vllm serve google/gemma-4-27b-it --dtype bfloat16 --max-model-len 8192 then hit /v1/completions with a 512-token prompt.)*

OpenCost attributes cost based on resource allocation, not utilization. Both models allocate 1× A100. Both get billed identically per GPU-hour. But the MoE delivers 4.4× higher single-user throughput and 2× lower per-token cost.

In a multi-tenant cluster where Team A runs the 26B MoE and Team B runs the 31B Dense, OpenCost bills them the same rate. Team A gets 4.4× the throughput on identical hardware. The fair billing unit for MoE isn't GPU-hours — it's **GPU-hours weighted by active parameter ratio**. No Kubernetes cost attribution tool I've found supports this.

The math: Gemma 4's total-to-active ratio is 6.6:1 (25.2B / 3.8B). Mixtral's is 3.6:1 (47B / 13B). The cost attribution error for Gemma 4 is almost double Mixtral's — a direct consequence of the 3-pathway design keeping more parameters loaded but inactive.

## Consequence 2: GPU Utilization Lies to the Autoscaler

KServe supports HPA-based autoscaling. The standard setup:

```
metadata:
  annotations:
    serving.kserve.io/autoscalerClass: "hpa"
    serving.kserve.io/targetUtilizationPercentage: "70"
    serving.kserve.io/metric: "cpu"
```

For dense models, GPU compute utilization roughly tracks inference load. More requests → more compute → higher utilization. Scaling at 70% is a reasonable proxy.

For the MoE, this proxy breaks. The causal chain:

LLM decode is memory-bandwidth-bound, not compute-bound. Each output token reads model weights from VRAM. Google Cloud's analysis found a cluster pushing 1M tokens/second showed only 4.4% GPU FLOPS utilization — tensor cores finish in microseconds, then wait for data.

Gemma 4 amplifies this: 25.2B of weights sit in VRAM, but only 3.8B do arithmetic per token. The memory bus shuttles expert weights that may not even fire.

Result: memory bandwidth saturates — throughput degrades, latency climbs — while GPU compute utilization reads 30–40%.

HPA sees 40% against a 70% target. Decision: *don't scale.*

The correct autoscaling metric for MoE isn't utilization — it's request latency:

```
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: gemma4-moe-hpa
spec:
  scaleTargetRef:
    apiVersion: serving.kserve.io/v1beta1
    kind: InferenceService
    name: gemma4-moe
  minReplicas: 1
  maxReplicas: 4
  metrics:
    - type: Pods
      pods:
        metric:
          name: vllm_request_p95_latency_seconds
        target:
          type: AverageValue
          averageValue: "2.0"
```

This requires a custom metrics pipeline (Prometheus adapter → HPA custom metrics API). Significantly more infrastructure than built-in CPU scaling. But without it, the default autoscaling path that works for every dense model silently fails for MoE.

## Consequence 3: Expert Parallelism Crashes Under Data Parallelism

vLLM supports expert parallelism for MoE models — distributing individual experts across GPUs:

```
vllm serve google/gemma-4-26B-A4B-it \
    --tensor-parallel-size 2 \
    --enable-expert-parallel \
    --max-model-len 32768
```

Adding `--data-parallel-size 2`

to scale horizontally: weights load, CUDA graphs capture, API servers start — then it crashes on the first inference request.

Reproduce it yourself (2× GPU required):

```
vllm serve google/gemma-4-26B-A4B-it \
    --tensor-parallel-size 2 \
    --data-parallel-size 2 \
    --max-model-len 32768
# Send any request → crash
```

The exact error from vLLM issue #38999:

```
File "vllm/distributed/device_communicators/cuda_communicator.py"
  in _all_gather_single
AssertionError: 1 != 36
```

Root cause: The MoE fused expert layer assumes that multiple GPUs means expert parallelism, triggering inter-GPU `all_gather`

operations for expert routing. Under data parallelism, each GPU should run an independent full copy — no cross-GPU expert communication. The DP workers initialize with EP-style communication, causing a tensor shape mismatch.

Dense models (31B, E4B) work fine with `--data-parallel-size > 1`

. The crash is specific to MoE + DP.

The vLLM collaborator confirmed: you must pass `--enable-expert-parallel`

when deploying MoE in DP mode. Without it, DP mode crashes. With it, you get expert parallelism semantics (experts distributed across GPUs) instead of data parallelism semantics (full model copies per GPU).

For a platform that lets developers self-serve model deployments, this means: **you cannot expose "number of GPUs" as a user-facing knob for MoE models.** The parallelism strategy determines whether the deployment runs or crashes, and the correct choice depends on the model architecture — not the model size.

## The Outage That Taught Me to Distrust Defaults

I wouldn't have found any of this without getting burned first.

KServe's default config assumes Istio. We ran Kourier (~100 MB vs Istio's ~1 GB) because our k3d dev cluster didn't have the RAM. Result: all InferenceService creation blocked cluster-wide. Three hours of downtime. The fix:

```
ingress:
  disableIstioVirtualHost: true
```

One boolean, not in the getting-started docs. Found it in the controller source.

The lesson applies directly: when the YAML looks right and everything deploys green, check what the YAML doesn't say. The MoE manifest is valid. The pod starts. The cost model, autoscaler, and parallelism strategy are all silently wrong.

## The CI That Lied for Two Weeks

For two weeks, Kyverno policy checks showed green on every PR. Policies weren't enforced at all.

The bug: `kyverno-cli apply`

exits code 0 even when policies are violated. Violations print to stdout; the exit code — what CI checks — says "success." The fix:

```
OUTPUT=$(kyverno-cli apply ./policies/ --resource "$manifest" 2>&1)
if echo "$OUTPUT" | grep -qi "fail\|violation\|denied"; then
    echo "Policy violation detected"
    exit 1
fi
```

Any developer could have deployed an InferenceService without resource limits, cost labels, or ownership metadata. CI was green. Governance was broken.

Same pattern as the MoE manifest: the configuration is valid, the deployment succeeds, and the assumptions underneath are wrong.

## What Platform Engineers Should Do

After measuring all of this, here's the playbook:

**1. Tag InferenceServices with architecture metadata.**

```
labels:
  model-architecture: "moe"
  active-param-ratio: "0.15"    # 3.8B / 25.2B
  total-params: "25.2B"
  cost-center: "cc-ml-inference"
```

Your cost attribution, alerting, and capacity planning all need to know that this 48 GB model computes like a 4B model. Gemma 4's 3-pathway design (dense FFN + shared expert + routed experts) makes the total-to-active ratio higher than other MoE architectures — the metadata must capture this.

**2. Autoscale on latency, not utilization.**

GPU compute utilization is a lagging, misleading indicator for MoE. Memory bandwidth saturation hits first. Use `vllm_request_p95_latency_seconds`

or `vllm_tokens_per_second`

as your HPA metric.

**3. Don't expose parallelism strategy as a user knob.**

Your platform should detect MoE vs Dense from the model config and set `--enable-expert-parallel`

accordingly. Self-serve GPU count selection will produce runtime crashes for MoE models if the parallelism strategy is wrong — the model loads, CUDA graphs capture, then it crashes on the first request.

**4. Budget VRAM for total params, compute for active params.**

Resource requests = total parameters (scheduling). Capacity planning = active parameters (throughput). One A100 running Gemma 4 MoE: 177 tok/s. Same A100 running 31B Dense: 40 tok/s. Same requests, 4.4× throughput difference. Your capacity model must account for this or you'll over-provision MoE by 4×.

## The Deeper Pattern: Architecture-Aware Scheduling

These three failures (cost, scaling, parallelism) point to a structural gap in Kubernetes: **the scheduler is architecture-blind.**

`resources.requests.nvidia.com/gpu: 1`

tells the scheduler to find a node with a free GPU. It says nothing about whether that GPU will be memory-bandwidth-bound or compute-bound, whether the model activates 15% or 100% of its parameters, or whether horizontal scaling requires expert parallelism flags. The scheduler treats a 26B MoE and a 31B Dense as identical workloads because they request the same resource.

This was fine when every model was dense. With Gemma 4's 3-pathway MoE entering production, the abstraction leaks. The fix isn't just labels and custom metrics — it's treating model architecture as a first-class scheduling dimension, the same way we treat GPU type, memory, and topology today.

Every claim in this post — cost attribution errors, autoscaler blindness, DP crashes — is reproducible. The vLLM issue is public. The benchmarks are from published controlled tests (JarvisLabs, Google Cloud). The KServe outage and Kyverno bug happened on a real platform with 108 commits of history.

The model works. The YAML looks right. The CI is green.

**The governance is broken. And now I can prove it.**

NeuroScale is open source: [github.com/sodiq-code/neuroscale-platform](https://github.com/sodiq-code/neuroscale-platform). The `BEFORE.md`

and `AFTER.md`

in each milestone directory tell the full story of what went wrong and what I learned.
