{"slug": "shipping-gemma-4-speech-recognition-in-a-windows-net-desktop-app-a-5-variant", "title": "Shipping Gemma 4 speech recognition in a Windows .NET desktop app: a 5-variant model-selection tour", "summary": "The article describes integrating Google's Gemma 4 speech recognition model into Parlotype, a privacy-focused Windows voice-to-text desktop app that runs entirely on-device. The author evaluated five available GGUF variants of Gemma 4 (E2B and E4B in BF16, Q4_K_M, and Q8_0 formats) against Whisper models on LibriSpeech test-other samples to determine the best combination of accuracy, speed, and disk footprint. The chosen runtime was llama-server due to its cross-vendor GPU support, no Python dependency, and stable HTTP API, with the final model selection and benchmark data published in the project's documentation.", "body_md": "*This is a submission for the Gemma 4 Challenge: Build with Gemma 4*\n\n## What I Built\n\n**Parlotype** is a voice-to-text desktop app for Windows. It is built with .NET 10 and Avalonia UI. You hold a global hotkey, speak, then release it. Your text appears in whatever app you were typing into. All speech recognition runs on your machine. No cloud, no audio leaves the machine.\n\nGoogle released Gemma 4 in April 2026. It has a native multimodal audio path. I added it as an alternative speech engine alongside the existing Whisper.net pipeline. You pick Whisper or Gemma 4 in Settings. The rest of the audio pipeline (WASAPI capture, then Silero VAD, then text injection) stays the same.\n\nThe interesting part, and what this post is mostly about, is which Gemma 4 variant to ship. The `ggml-org`\n\nGGUF repo publishes five variants (E2B and E4B, each in BF16, Q4_K_M, and Q8_0, except where the repo skips one). The model card does not tell you which combination of accuracy, speed, and disk footprint you will actually get. So I ran each one on the same dataset, picked a default, and shipped.\n\n## Demo\n\nThe video shows the engine selector, the model picker with five variants, and a live dictation with Gemma 4.\n\n## Code\n\nSource, ADRs, and benchmark configs: [github.com/mdemin729/parlotype](https://github.com/mdemin729/parlotype)\n\nRelevant entry points:\n\n-\n: the recognizer that talks to`src/Parlotype.Platform/Speech/LlamaCppSpeechRecognizer.cs`\n\n`llama-server`\n\n. -\n: the 5-variant catalog.`src/Parlotype.Core/Speech/Gemma4ModelInfo.cs`\n\n-\nthrough`docs/decisions/025-gemma4-llamacpp-desktop.md`\n\n: the ADR series covering the integration.`030-configurable-gemma4-prompts.md`\n\n-\n: the benchmark data behind the choices below.`results/comparison-libri-speech-test-other-2026-05-23-cuda.md`\n\n## How I Used Gemma 4\n\n### Why a separate engine at all\n\nWhisper is great on clean read English. It gets noticeably worse on conversational or noisy audio. Gemma 4 has a conformer audio encoder. Google's own evaluations show it reaching 4.17% WER on LibriSpeech-test-clean, which is competitive with much larger Whisper variants. For a voice-to-text app, the typical user is dictating to themselves into a focused text field. That noise profile is closer to \"clean read\" than to \"AMI meeting\", so Gemma 4 is a real alternative. Giving people the choice felt right. Either way, privacy does not depend on which model is loaded.\n\n### Why `llama-server`\n\nas the runtime\n\nI looked at several inference paths before picking `llama-server`\n\n, the HTTP server from llama.cpp. The constraints were: no cloud, Windows desktop, single end-user installer, cross-vendor GPU support, no Python runtime in the user's install.\n\n`onnxruntime-genai`\n\ndoes not support Gemma 4's architecture yet (per-layer embeddings, variable head dimensions). Tracking issue: [microsoft/onnxruntime-genai#2062](https://github.com/microsoft/onnxruntime-genai/issues/2062). A Python sidecar works, but it pulls Python and CUDA into the user's install. That is a non-starter for non-developer users. LLamaSharp's P/Invoke bindings lock you to one llama.cpp build at compile time, so switching from Vulkan to CUDA means re-compiling. Ollama does not support Gemma audio yet ([ollama/ollama#15333](https://github.com/ollama/ollama/issues/15333)). Lemonade is AMD-only.\n\n`llama-server`\n\nwith the pre-built Vulkan/CUDA Windows binaries hits all of these. Cross-vendor GPU support from one download. A stable OpenAI-compatible HTTP API at `/v1/chat/completions`\n\n, with `input_audio`\n\nblocks for audio. A release cadence I can manage from in-app updates. [ADR-025](https://github.com/mdemin729/parlotype/blob/gemma4-challenge/docs/decisions/025-gemma4-llamacpp-desktop.md) has the longer version of this decision.\n\n### Picking a variant: the benchmark\n\nThe catalog has five variants. That is what `ggml-org/gemma-4-E2B-it-GGUF`\n\nand `ggml-org/gemma-4-E4B-it-GGUF`\n\nactually publish, not what I would ideally pick (see [ADR-029](https://github.com/mdemin729/parlotype/blob/gemma4-challenge/docs/decisions/029-gemma4-model-download-ui.md)):\n\n| ModelId | GGUF | Size on disk (with bf16 mmproj) |\n|---|---|---|\n`gemma-4-E2B-it-Q8_0` |\nE2B Q8_0 | ~5.5 GiB |\n`gemma-4-E2B-it-bf16` |\nE2B BF16 | ~9.6 GiB |\n`gemma-4-E4B-it-Q4_K_M` |\nE4B Q4_K_M | ~5.9 GiB |\n`gemma-4-E4B-it-Q8_0` |\nE4B Q8_0 | ~8.4 GiB |\n`gemma-4-E4B-it-bf16` |\nE4B BF16 | ~15 GiB |\n\nE2B has no Q4_K_M. That asset does not exist in the repo. I learned this when manual testing returned a 404. After that, I rebuilt the catalog from the actual file lists on HuggingFace.\n\nI ran each variant against Whisper (Small, Medium, LargeV3Turbo) on 50 samples of LibriSpeech `test-other`\n\n, which is the \"harder\" English split. Same machine, same warm-up methodology, both engines on CUDA. Whisper used greedy decoding (beam=1) so the runs are reproducible.\n\n| Rank | Engine | Model | WER % | CER % | RTF | Model load (s) |\n|---|---|---|---|---|---|---|\n| 1 | Whisper (CUDA) | `LargeV3Turbo` |\n11.48 |\n4.97 | 0.055 | 1.31 |\n| 2 | Whisper (CUDA) | `Medium` |\n12.18 | 5.41 | 0.073 | 1.28 |\n| 3 | Whisper (CUDA) | `Small` |\n13.10 | 5.87 | 0.034 |\n0.71 |\n| 4 | Gemma 4 (llama.cpp) | `E2B-it-BF16` |\n13.15 | 4.95 |\n0.038 | 6.70 |\n| 5 | Gemma 4 (llama.cpp) | `E4B-it-Q4_K_M` |\n13.82 | 5.80 | 0.038 | 6.73 |\n| 6 | Gemma 4 (llama.cpp) | `E4B-it-BF16` |\n14.20 | 5.40 | 0.038 | 6.72 |\n| 7 | Gemma 4 (llama.cpp) | `E4B-it-Q8_0` |\n14.39 | 5.79 | 0.044 | 9.25 |\n| 8 | Gemma 4 (llama.cpp) | `E2B-it-Q8_0` |\n19.22 | 8.95 | 0.315 | 6.74 |\n\nThree things from the table:\n\n-\n(4.95%). It barely beats Whisper`E2B-it-BF16`\n\nhas the lowest CER of any model here`LargeV3Turbo`\n\n(4.97%), but it still beats it. WER and CER do not always agree, and at this size class Gemma's character-level errors are unusually small. -\nThat is close to Whisper`E4B-it-Q4_K_M`\n\n(the shipping default) is at 13.82% WER and 0.038 RTF.`Small`\n\n(13.10% WER and 0.034 RTF) at about the same on-disk size. The Q4_K_M quant is the right floor for shipping. It gives people Gemma 4 without asking them to download 15 GiB. -\nRTF 0.315, which is 8x slower than the other Gemma variants. WER 19.22%. The first benchmark attempt crashed`E2B-it-Q8_0`\n\nis broken on this dataset.`llama-server`\n\nmid-sample because the model emitted a stray`<|channel>`\n\nreasoning token that the chat-template parser could not handle. I keep this variant selectable in the catalog for experimentation, but the user-facing default avoids it.\n\n### What I picked, and why\n\nThe shipping default is ** gemma-4-E4B-it-Q4_K_M**. About 5.9 GiB on disk, 13.82% WER on this dataset, 0.038 RTF. E2B-BF16 is technically more accurate, but it takes 9.6 GiB. That is not worth it for a tiny WER edge. E4B Q8 and BF16 are there for people who want maximum accuracy and have the disk space. E2B-Q8 stays in the catalog with a \"known issue\" tag.\n\nThe model picker shows all five so people can experiment. But the default is the one I would install on a friend's machine without thinking about it.\n\n## Architecture\n\nGemma 4 sits behind the same `ISpeechRecognizer`\n\ninterface as Whisper. A `DelegatingSpeechRecognizer`\n\n(backed by a small `SpeechRecognizerFactory`\n\n) picks one or the other at init time, based on the user's engine setting. The `LlamaCppSpeechRecognizer`\n\nowns a child `llama-server.exe`\n\nprocess. It posts audio as a base64 WAV blob to `/v1/chat/completions`\n\n:\n\n``` js\n// Excerpt from LlamaCppSpeechRecognizer.cs\nvar body = new\n{\n    messages = new[]\n    {\n        new\n        {\n            role = \"user\",\n            content = new object[]\n            {\n                new { type = \"text\", text = promptText },\n                new { type = \"input_audio\", input_audio = new { data = base64, format = \"wav\" } }\n            }\n        }\n    },\n    stream = false\n};\nusing var response = await _httpClient.PostAsJsonAsync(\n    \"/v1/chat/completions\", body, cancellationToken);\n```\n\nSame capture, same VAD, different recognizer:\n\nThe `llama-server`\n\nbinary itself is also managed by the app. [ADR-026](https://github.com/mdemin729/parlotype/blob/gemma4-challenge/docs/decisions/026-managed-llama-server-install.md) covers the catalog/installer/registry subsystem that downloads Vulkan or CUDA builds from llama.cpp's GitHub Releases on demand. Users do not pick paths in a folder browser. They pick a backend in a list and hit Install. That subsystem is about 1,800 lines on its own and probably deserves its own post.\n\nThe transcription prompt is also user-editable. [ADR-030](https://github.com/mdemin729/parlotype/blob/gemma4-challenge/docs/decisions/030-configurable-gemma4-prompts.md) turned the hardcoded prompt into a small registry with a built-in default and a `{language}`\n\nplaceholder. The placeholder is there for a future feature that picks the source language from the active keyboard layout.\n\n## What this taught me\n\nThree things I learned from doing this:\n\n-\n**The model card's headline numbers do not transfer to your stack.** Google's reported 4.17% WER on LibriSpeech-clean is real. But the path from \"the model can do 4.17%\" to \"my app does 13.82% on noisy audio with the quantization that fits on user disks\" goes through five variant choices, a runtime choice, and the measurement methodology. Benchmark on your own stack. -\n**Most of the work is in the catalog, not in the inference call.** The actual`/v1/chat/completions`\n\nHTTP call is about 30 lines of code. The variant catalog, the download manager, the side-by-side install of llama-server backends, the prompt registry. That is where most of the engineering went. -\n**Asymmetric quantization coverage is the rule, not the exception.** E2B has no Q4_K_M in the published GGUFs. The catalog has to reflect what is actually on HuggingFace, not what would be theoretically nicest.\n\n## Try Parlotype\n\n- Repo:\n[github.com/mdemin729/parlotype](https://github.com/mdemin729/parlotype) - Windows only for now. .NET 10, MIT licensed.\n- Pick Gemma 4 in Settings -> Speech Engine. The in-app installer downloads\n`llama-server`\n\nand the GGUF for you.\n\n*Maksim Demin is a .NET engineer building Parlotype, a voice-to-text desktop app. He writes about cross-platform .NET, Avalonia, and local AI.*", "url": "https://wpnews.pro/news/shipping-gemma-4-speech-recognition-in-a-windows-net-desktop-app-a-5-variant", "canonical_source": "https://dev.to/mdemin729/shipping-gemma-4-speech-recognition-in-a-windows-net-desktop-app-a-5-variant-model-selection-tour-2l8i", "published_at": "2026-05-24 03:51:31+00:00", "updated_at": "2026-05-24 04:01:08.892339+00:00", "lang": "en", "topics": ["artificial-intelligence", "machine-learning", "large-language-models", "open-source", "developer-tools"], "entities": ["Google", "Gemma 4", "Parlotype", "Whisper.net", "Silero VAD", "Avalonia UI", ".NET", "ggml-org"], "alternates": {"html": "https://wpnews.pro/news/shipping-gemma-4-speech-recognition-in-a-windows-net-desktop-app-a-5-variant", "markdown": "https://wpnews.pro/news/shipping-gemma-4-speech-recognition-in-a-windows-net-desktop-app-a-5-variant.md", "text": "https://wpnews.pro/news/shipping-gemma-4-speech-recognition-in-a-windows-net-desktop-app-a-5-variant.txt", "jsonld": "https://wpnews.pro/news/shipping-gemma-4-speech-recognition-in-a-windows-net-desktop-app-a-5-variant.jsonld"}}