Hugging Face and Cerebras bring Gemma 4 to real-time voice AI

wpnews.pro

cd /news/artificial-intelligence/hugging-face-and-cerebras-bring-gemm… · home › topics › artificial-intelligence › article

[ARTICLE · art-46842] src=huggingface.co ↗ pub=2026-07-01T00:00Z topic=artificial-intelligence verified=true sentiment=↑ positive

Hugging Face and Cerebras bring Gemma 4 to real-time voice AI

Hugging Face and Cerebras have partnered to create a real-time voice AI pipeline using Google DeepMind's Gemma 4, Nvidia's Parakeet, and Alibaba's Qwen3TTS, achieving low-latency speech-to-speech interaction. The open-source system, already powering over 9,000 Reachy Mini robots, aims to eliminate frustrating delays in conversational AI by leveraging Cerebras's fast inference for the language model bottleneck.

read2 min views13 publishedJul 1, 2026

Hugging Face and Cerebras bring Gemma 4 to real-time voice AI — Image: Hugging Face Blog

🎙 1

HF Realtime Voice

Voice chat over WebSocket against a HF speech-to-speech

The result is a speech-to-speech experience that feels dramatically more natural. Instead of waiting for an AI to respond, conversations flow with the responsiveness users expect from human interaction.

The demo is built as a real-time speech-to-speech pipeline. Each part of the system is modular, open, and replaceable, making it easy for developers to adapt the stack for different assistants, robots, products, or research projects.

This creates a fully open speech-to-speech loop:

Speech input
  -> speech recognition with Nvidia's Parakeet
  -> Gemma 4 VLM inference on Cerebras
  -> text-to-speech with Alibaba's Qwen3TTS
  -> spoken response

The architecture brings together the strength of the open-source AI ecosystem: Cerebras for fast inference, Google DeepMind’s Gemma 4 31B for the language model, and Qwen for text-to-speech. Every layer can be inspected, modified, and extended by the developers

Today, some production systems see a reasonable median latency while still experiencing frustrating multi-second delays at the P95. Those delays become even more noticeable when tool calls or multimodal steps require multiple turns.

Cerebras helps solve one of the most important bottlenecks in the stack: the language-model response time. By making inference dramatically faster and more stable, Cerebras allows the rest of the Hugging Face pipeline to shine.

That stability is especially important at the long tail. Many systems can deliver acceptable median response times, but occasional slow responses still make conversations feel unreliable.

This same Hugging Face speech-to-speech pipeline already powers Reachy Mini robots, with more than 9,000 robots in the wild. For robots, voice assistants, and embodied AI, responsiveness is not a cosmetic improvement. It is what makes the interaction feel alive.

The motivation to use Cerebras is therefore not simply cost reduction. It is low latency, predictable performance, and the ability to create real-time experiences that feel natural at scale.

This collaboration reflects a shared belief that the future of AI will be both open and performant. Open-source models, open infrastructure, and breakthrough inference speed together create a foundation for the next generation of conversational AI.

We invite developers to explore the demo, experiment with the code, and help shape what comes next for real-time voice AI.

Demo: Hugging Face Space

Repository: huggingface/speech-to-speech

Voice chat over WebSocket against a HF speech-to-speech

source & further reading

huggingface.co — original article Any need for a tester and challenger for AI models? LLM Agents Need a Cognitive Grammar, Not Just More Tools The Checklist You Write Forces AI to Stop - Instruction Completion Protocol

~/api · this article 200

$curl api.wpnews.pro/v1/news/hugging-face-and-cerebra…

Read original on huggingface.co → huggingface.co/blog/cerebras-gemma4-voice-ai

mentioned entities

Hugging Face

Cerebras

Google DeepMind

Gemma 4

Nvidia

Parakeet

Alibaba

Qwen3TTS

metadata

slughugging-face-and-cerebras-bring-gemma-4-to-real-time-voice-ai

topic#artificial-intelligence

secondary4 topics

sentimentpositive

canonicalhuggingface.co

navigation

← prevMississippi District Attorney's …

next →"How to Stop AI Agent Skills, Ho…

── more in #artificial-intelligence 4 stories · sorted by recency

pub.towardsai.net · 4 Jul · #artificial-intelligence

Can Your Computer Run Nvidia’s 550B Model? Not Even Close, and the Reason Is Fascinating

kim-ai-gpu.github.io · 4 Jul · #artificial-intelligence

Create your own AI, then watch it battle others in your browser

runagentrun.co.uk · 3 Jul · #artificial-intelligence

A Gemma 4 fine-tune targets marketing copy

lesswrong.com · 1 Jul · #artificial-intelligence

A Black Box Made Less Opaque (part 4)

── more on @hugging face 3 stories trending now

wpnews · 30 May · #ai-safety

Nightcord Security Analysis Report - Threat Investigation

wpnews · 26 May · #ai-infrastructure

ML-KEM + X-Wing Patches Posted For Linux To Help With Post-Quantum Security

wpnews · 4 Jul · #artificial-intelligence

Istota, a personal AI operating system

sponsored brought to you by zahid.host 4,200+ EU-deployed projects

reading about agents? ship yours in a single git push.

Run your AI side-project on zahid.host

EU-based hosting, git-push deploys, automatic HTTPS, no cold starts. Free tier with a custom domain — perfect for shipping the agent you just read about.

$git push zahid main

→ Live at https://your-agent.zahid.host ✓

Get free account → Pricing

from €0/mo · no card required