Google Deepmind has released Gemma 4 12B, an open AI model that brings multimodal capabilities to everyday laptops. It processes text, images, and audio natively without separate encoders, cutting processing time, memory use, and latency, according to Google. The model runs locally with just 16 GB of RAM and nearly matches the 26B model—twice its size—across benchmarks, Google says. It's also the first mid-sized Gemma model with native audio processing.
Gemma 4 12B handles speech recognition, code generation, and video analysis. Per the Developer Guide, it can parse multi-minute video clips by analyzing frames and audio together. In one demo, it chewed through a five-minute Google I/O keynote clip: 313 frames at one per second, plus audio.
The model is available on Hugging Face, Ollama, LM Studio, and other platforms, licensed under Apache 2.0 for commercial use.
AI News Without the Hype – Curated by Humans
Subscribe to THE DECODER for ad-free reading, a weekly AI newsletter, our exclusive "AI Radar" frontier report six times a year, full archive access, and access to our comment section.
Subscribe now