# How to Build a Private Offline Voice Assistant with Gemma 4 12B: A Complete Local Setup Guide

> Source: <https://dev.to/unfairhq/how-to-build-a-private-offline-voice-assistant-with-gemma-4-12b-a-complete-local-setup-guide-20nn>
> Published: 2026-06-17 00:48:27+00:00

*A developer’s guide to running Google’s 11.95B-parameter multimodal model with local STT/TTS on a 16 GB laptop under Apache 2.0.*

**TL;DR:** Download Gemma 4 12B (~6.7 GB at 4-bit) into a local runtime such as Google AI Edge Gallery, pair it with a local STT/TTS stack, and expose a local endpoint. The 11.95B-parameter model fits on a 16 GB laptop, runs offline under Apache 2.0, and keeps all voice data on-device.

Before downloading the model, verify your machine has at least 16 GB of RAM and plan your voice pipeline around the model’s strict 30-second audio ceiling. At 4-bit quantization, Gemma 4 12B’s 11.95 billion parameters compress to roughly 6.7 GB. After loading the weights, the remaining ~9 GB must cover the operating system, the inference framework overhead, and any local audio capture or STT services. If you are running other local models or Home Assistant addons concurrently, budget even more conservatively.

Check available memory before launching the stack:

```
free -h
```

Aim to have at least 14 GB free at idle; anything less risks swapping during inference.

The model enforces a hard 30-second audio limit. Exceeding it will cause inference to fail or truncate, so your client must enforce a maximum recording duration. A common approach is to chunk incoming streams or fall back to text input for complex multi-part commands. Split existing recordings at the boundary with ffmpeg:

```
ffmpeg -i input.wav -f segment -segment_time 30 -c copy chunk_%03d.wav
```

This produces 30-second WAV files that stay within the limit. Feed each chunk separately, or switch to a text fallback when a user’s utterance exceeds one segment.

To run Gemma 4 12B without external API calls, install a local inference runtime first. The Google AI Edge Gallery is one supported deployment option for both phones and laptops, and releases are delivered as standard OS-specific packages: a Windows .exe installer, a macOS zip bundle, and a Linux package.

Because this runtime serves as the execution backend for your voice pipeline, completing the installation before downloading model weights avoids path and permission errors during setup.

On macOS, download the zip archive, extract it, and drag the resulting application into your system Applications folder. Standard user permissions are sufficient for most local inference workloads when the app resides in the Applications directory. If you prefer the command line, a common approach is to locate the downloaded bundle and move it in one step:

```
cd ~/Downloads && unzip *.zip && mv *.app /Applications/
```

On Windows, launch the downloaded .exe installer and proceed through the setup prompts until the wizard finishes. A per-user install is usually adequate, with administrator elevation required only if you explicitly choose a system-wide program directory. You can also trigger the installer non-interactively once it is saved to your Downloads folder:

``` php
$exe = Get-ChildItem "$env:USERPROFILE\Downloads" -Filter *.exe | Select-Object -First 1
Start-Process -FilePath $exe.FullName -Wait
```

Linux users should install the provided package using the distribution’s native package manager; because formats vary by release, refer to the supplied readme for the exact dpkg, rpm, or AppImage command. After installation completes on any platform, open the runtime and verify that the local inference engine is active before pulling the Gemma 4 12B weights. Keeping this layer fully offline ensures voice data never leaves the device.

Loading Gemma 4 12B at 4-bit quantization reduces its memory footprint to roughly 6.7 GB, letting the entire model stay resident in RAM on a 16 GB laptop. Select the 4-bit option in your local inference UI or configuration file immediately after importing the model weights.

At 11.95 billion parameters, the full-precision weights would exceed typical consumer memory limits, but 4-bit compression brings private, on-device deployment within reach. In tools like Google AI Edge Gallery, select the 4-bit quantization profile during the model-import step. Because Gemma 4 is encoder-free and processes audio in a single pass, keeping the entire model in RAM is especially critical—any disk access during inference multiplies latency for multimodal inputs. After initialization, verify the process is locked in physical memory and not swapping before you attach speech-to-text or text-to-speech services; even occasional paging destroys the low latency required for conversational voice interaction. On Linux, confirm swap usage is zero with:

```
grep VmSwap /proc/$(pgrep -f gemma)/status
```

On macOS, monitor memory pressure while the model loads:

```
memory_pressure && vm_stat 1
```

If you see swap growth or pressure warnings, reduce the context window or close other applications until the process stabilizes entirely in RAM. A fully resident model avoids the round-trip disk delay that would otherwise make real-time assistant responses unusable. Treat this verification as a mandatory gate: only after confirming stable, swap-free residency should you layer on the speech pipeline.

Because current local assistant frameworks still require separate speech and model components, you must bridge a local STT engine and a local TTS service to Gemma 4 12B; the STT text feeds into Gemma’s text context, and the generated reply routes to the TTS service, even though the model natively ingests audio in a single pass. Until front-end conversation agents expose that native audio path, a text pipeline is the only workable architecture, and it conveniently sidesteps Gemma’s hard 30-second audio limit. Splitting the pipeline this way also lets you upgrade either speech component independently of the quantized model.

For the STT layer, a local ONNX/Parakeet model can deliver subsecond transcription latency. Load the ONNX graph and run inference on the captured waveform:

``` python
import numpy as np, onnxruntime as ort
session = ort.InferenceSession("parakeet.onnx")
inputs = {session.get_inputs()[0].name: waveform}
text = session.run(None, inputs)[0]
```

Pass the resulting transcript to your local Gemma endpoint. A common pattern is to POST the text to a local inference server and stream back the response:

``` python
import requests, json
r = requests.post("http://localhost:11434/api/generate",
    json={"model": "gemma4:12b", "prompt": text, "stream": False})
reply = r.json()["response"]
```

Finally, send the reply to a local TTS service. A typical setup pushes the synthesized string to an on-device Piper or similar HTTP endpoint and writes the returned audio to your speaker queue:

```
audio = requests.post("http://localhost:5000/tts",
    json={"text": reply}, stream=True)
# playback(audio.content)
```

This keeps the full loop offline: the STT model runs locally, Gemma runs locally, and the TTS service runs locally.

Build the voice command loop by capturing microphone audio, sending it to a local STT service, forwarding the resulting transcript to your local Gemma 4 inference endpoint, and passing the generated reply to a local TTS engine for immediate playback. You must enforce the model’s hard 30-second audio cap by halting microphone capture before that limit; anything longer will exceed the model’s single-pass audio window and trigger truncation or rejection.

A common approach is to record raw PCM audio with `sounddevice`

, flush it to a 16 kHz mono WAV file, and POST it to a local Whisper-compatible STT server listening on port 9000. Once the STT returns the transcript, construct a concise prompt formatted as a direct action command in the style of the Voice Edit pattern—phrasing like “Restructure these notes into an executive summary” or “Translate this into Hindi”—and POST that payload to your local Gemma 4 inference endpoint running under Ollama or llama.cpp on localhost. Avoid conversational preamble; the model executes faster when the instruction is explicit and scoped to a single action.

``` python
import sounddevice as sd, requests, wave
frames = sd.rec(int(30 * 16000), samplerate=16000,
                channels=1, dtype='int16')
sd.wait()
with wave.open("cmd.wav", "wb") as f:
    f.setnchannels(1); f.setsampwidth(2); f.setframerate(16000)
    f.writeframes(frames.tobytes())
```

Submit the recorded file to the STT layer and retrieve the text:

```
curl -X POST http://localhost:9000/v1/audio/transcriptions \
  -F file="@cmd.wav" -F model="whisper-base"
```

Forward the transcript to the local Gemma 4 API as an explicit agentic instruction:

```
r = requests.post("http://localhost:11434/api/generate",
    json={"model": "gemma4:12b",
          "prompt": f"Restructure into an executive summary: {transcript}"})
reply = r.json()["response"]
```

Finally, push the model’s text output to your local TTS service—such as Piper or Coqui—and play the synthesized audio stream through your default sound device. Keep the loop strictly sequential: record, transcribe, infer, speak, then return to listening so only one audio stream is active at any moment and the pipeline stays synchronized.

Keep the stack offline by binding the inference client to a local-only endpoint and dropping all outbound traffic for the process at the system firewall. Because Gemma 4 12B ships under Apache 2.0 with no commercial restrictions and inference runs entirely on-device, no audio or text ever leaves your machine, and all sensitive multimodal data remains on the hardware that owns it.

At the client layer, disable remote base URLs and any automatic fallback to hosted APIs. A common approach is to initialize the SDK against a local inference server so every request stays on the loopback interface and never attempts an external resolver:

``` python
from openai import OpenAI

client = OpenAI(
    base_url="http://127.0.0.1:11434/v1",
    api_key="not-needed-local"
)
```

For a hard offline guarantee, add a firewall rule that denies the voice assistant process any outbound route:

```
sudo iptables -A OUTPUT -m owner --uid-owner assistant -j DROP
```

The 11.95 billion parameter weights compress to roughly 6.7 GB at 4-bit quantization, so the full audio-to-text pipeline executes in local RAM without cloud encoders or API dependencies. The hard 30-second audio limit also bounds each inference batch to what fits on-device. After starting the assistant, verify isolation by capturing packets during a voice query: if traffic leaves the loopback adapter, the stack is not truly offline.

The encoder-free architecture reads audio in a single pass, but most local assistant platforms and conversation agents still require separate speech components today. Until those frameworks natively stream raw audio to the model, you should keep a local STT layer in the pipeline.

Yes. Google DeepMind specifies that Gemma 4 12B runs on a 16 GB laptop. At 4-bit quantization the weights occupy roughly 6.7 GB, leaving headroom for the OS and your STT/TTS services if you manage memory carefully.

Gemma 4 12B has a hard 30-second audio limit. A common approach is to chunk long utterances or switch to a text-based prompt once you exceed that boundary.

No. The model is Apache 2.0 and open-weight, so after you download the quantized weights and install the local runtime, the entire voice assistant operates offline. No API keys or cloud endpoints are required.

No. The weights ship under Apache 2.0 with no commercial restrictions, removing legal friction for on-device deployments.

*Sources consulted while researching this guide, included so you can verify the details and go deeper. Listing them is not a claim that every line was independently fact-checked.*

*I packaged the setup above into a ready-to-use kit — **Gemma 4 12B Local Multimodal Build Kit (13 Items)** — for anyone who'd rather copy-paste than wire it from scratch: [https://unfairhq.gumroad.com/l/nkylsz](https://unfairhq.gumroad.com/l/nkylsz?utm_source=devto&utm_medium=article&utm_campaign=gemma-4-12b-local-multimodal-build-kit-1).*
