Every multimodal model you’ve used has the same basic system. Text goes in one way, images go through a vision encoder first, audio goes through an audio encoder first, and then everything gets handed off to the language model in a form it can work with. The encoders are load-bearing and you don’t just remove them.
Google actually removed them.
Gemma 4 12B takes raw image patches and raw audio waveforms and projects them directly into the same embedding space as text tokens. No vision encoder. No audio encoder. One decoder handling everything.
Table of Contents
What encoders actually do and why removing them is a bet
The encoder’s job is translation. A vision encoder takes pixels and converts them into a representation the language model can reason over. An audio encoder does the same for waveforms. They’re trained specifically for this, dedicated components that have learned to compress visual and acoustic information into something a language model can use.
Removing them means the language model has to do that translation itself. Raw image patches go in as lightweight embeddings through a single matrix multiplication. Raw audio gets projected directly into the same dimensional space as text. The LLM backbone takes over from there.
That’s an important architectural bet for two reasons. First, dedicated encoders have years of specialized training behind them. Replacing that with a linear projection layer is aggressive simplification. Second, if it doesn’t work well, you get a model that’s bad at images and audio specifically, the things that make it multimodal in the first place.
Google’s argument, implicit in the design, is that a capable enough language model backbone doesn’t need the translation layer. It can learn to handle raw inputs directly.
The size slot Google chose for this experiment
The Gemma 4 family has five models. The E2B and E4B are edge models built for phones. The 26B MoE and 31B dense are serious compute options. The 12B sits in the middle and it’s the only one Google built encoder-free.
The edge models need every optimization they can get so they keep their encoders. The large models are where Google plays it safe on architecture. The 12B is where they tried something different, capable enough to be useful, small enough that an architectural experiment doesn’t cost as much if it goes sideways.
The practical result is a model that runs on 16GB of VRAM. A MacBook Pro with 16GB unified memory can run this. Most consumer laptops with a recent discrete GPU can run this. That’s the target and the encoder-free design is part of how Google hit it, fewer components means smaller footprint, less latency, and no separate encoder weights to load.
Related: Gemma 4 Makes Local AI Agents Actually Practical The benchmarks that make the bet look smart
All numbers below are from Google’s own evaluation.
On AIME 2026 without tools, 12B scores 77.5 against the 26B MoE’s 88.3. That gap is there but 77.5 from a model running on a laptop is not a number you dismiss. LiveCodeBench puts it at 72.0, again behind the 26B at 77.1 but meaningfully ahead of where a 12B model had any right to be a year ago. GPQA Diamond at 78.8, that’s a hard science reasoning benchmark and 12B is within striking distance of the 26B’s 82.3.
MMMU Pro vision at 69.1, MATH-Vision at 79.7. For a model with no dedicated vision encoder those are genuinely good numbers.
The one place the encoder-free bet shows its cost is OmniDocBench, which tests document understanding. The 12B scores 0.164 average edit distance against the 26B’s 0.149. Lower is better here and the gap is small but it’s the benchmark where fine-grained visual detail matters most, exactly the scenario where a dedicated vision encoder would have the clearest advantage.
Limitations
Audio is capped at 30 seconds and it’s a hard limit. Anything longer gets cut off. For transcription of short voice notes or quick translations it’s fine. For anything resembling a real conversation or a meeting recording, it’s not the right tool.
Video support is also absent on the 12B entirely. The 26B and 31B handle video. The 12B doesn’t. If video understanding matters for your use case, you’re looking at a different model in the family.
Context window is 256K, which sounds large until you compare it to what’s shipping elsewhere right now. Minimax M3 just launched with 1M. For most local use cases 256K is plenty but worth knowing where the ceiling is.
Long context retrieval is also where the 12B shows its size most clearly. MRCR v2 8-needle at 128K scores 43.4 against the 31B’s 66.4. Hiding information deep in a long document and asking the model to find it is harder at this size.
How to run it
The quickest path is Ollama or LM Studio, both support the 12B and neither requires much setup. The weights are on Hugging Face under Apache 2.0, about as clean a license as you get for a Google model.
For thinking mode, set enable_thinking=True in the chat template. It’s off by default. Worth turning on for anything involving multi-step reasoning or complex document analysis. For simple queries leave it off, the latency difference is noticeable. Another thing to note before you begin: image content should go before text in your prompt, audio after. Google flags this in the model card and it affects output quality enough that it’s not just a suggestion.
The bet paid off at this size
Google took the middle slot in the Gemma 4 family and used it to remove components that every other multimodal model treats as non-negotiable. No vision encoder, no audio encoder, one decoder doing everything. The benchmarks didn’t collapse. On reasoning and coding they came in closer to the 26B than the parameter gap would suggest.
Whether this architecture scales up is the genuinely open question. The 31B kept its encoders. Google wasn’t ready to make this bet at full size. But at 12B, running on a laptop, the encoder-free design holds up well enough.
The weights are on Hugging Face now. Worth finding out for yourself.