{"slug": "notari-voice-notes-that-never-leave-your-phone-structured-by-gemma-4", "title": "Notari — voice notes that never leave your phone, structured by Gemma 4", "summary": "Notari is an Android app that records, transcribes, and structures voice notes into Markdown format entirely on-device, without ever writing audio to disk or requesting internet permission. The app uses Android's SpeechRecognizer for transcription and Google's Gemma 4 E2B model running locally via LiteRT-LM to convert transcripts into structured JSON notes. All processing occurs locally on the device, with the audio held only in RAM and the app's privacy guarantees being verifiable through airplane mode testing, manifest inspection, and network monitoring.", "body_md": "This is a submission for the Gemma 4 Challenge: Build with Gemma 4\nNotari is an Android app that records a voice note, transcribes it, and turns it into a clean, structured Markdown note — entirely on-device. The audio is held in RAM, never written to disk, and the app doesn't even request the INTERNET\npermission.\nI keep voice memos: meeting decisions, half-formed ideas at 11pm, reminders I'll forget by the time I'm home. The two app categories that should solve this don't:\nSo I built the third option. Voice notes are the kind of content where privacy isn't a marketing veneer — it's a precondition for using the tool at all. If the app feels like it might leak, I won't dictate the thing that matters most. So the privacy guarantee had to be load-bearing, not optional.\nThe pipeline is short on purpose:\nMic ─▶ Android SpeechRecognizer ─▶ Gemma 4 E2B (LiteRT-LM) ─▶ JSON ─▶ Room\nSpeechRecognizer\nruns in continuous-listen mode so the user can pause naturally without the recognizer giving up. The OS owns the audio buffer; the app only ever sees the text Flow.mentions[]\n, a Markdown body — and nothing else.No step touches the network. The INTERNET\npermission isn't declared in the merged manifest, and a CI gate fails the build if anyone ever adds it.\ndocs/\nin the repoThe privacy promise is verifiable. Run it in airplane mode. Inspect the manifest. Sniff the network. Nothing leaves the device — that's the whole point.\nI chose Gemma 4 E2B (Effective-2B, INT4-quantized, ~1.5 GB on disk) running locally via LiteRT-LM. Three reasons made E2B the right fit — not E4B, not a cloud model:\ncom.google.ai.edge.litertlm:litertlm-android\nreads .litertlm\nfiles directly, supports GPU and CPU backends, and exposes the Engine\n/ Session\nAPI the rest of the app is built around.The app is opinionated: it transforms the transcript faithfully, never paraphrases meaning, and never invents dates, names, or facts. That guarantee is enforced by the prompt and verified by adversarial fixtures in core/inference/src/test/resources/prompt-eval/\n.\nThe model is asked for one thing: a JSON object matching a fixed schema. No prose, no Markdown fences, no \"Sure! Here's the structured note:\" preamble.\n{\n\"language\": \"<bcp47>\",\n\"title\": \"<short, no trailing punctuation>\",\n\"tags\": [\"<lowercase-kebab>\"],\n\"mentions\": [\n{ \"surface_form\": \"<datetime span>\", \"iso_resolved\": \"<ISO-8601 or null>\" }\n],\n\"body_markdown\": \"<Markdown>\"\n}\nThe prompt is versioned in core/inference/src/main/assets/prompts/structure_note_vN.txt\nand referenced from AssetPromptLoader.ACTIVE_PROMPT\n. Every change is a versioned, file-based change with a corresponding ADR. The active version is v10, evolved through ten rounds of real-corpus testing — and the evolution itself is most of what I learned about the model:\nCURRENT TIMESTAMP\nblock so the model could resolve \"tomorrow at 3pm\" to a real ISO instant.mentions[]\nwith named entities, dropping checkboxes for spoken commitments, collapsing enumerations into prose, never using headings on multi-topic notes). The fix in every case was changing the framing from \"you may\" to \"REQUIRED\".Even with a strict prompt, real E2B output has variance: trailing commas, occasional Markdown fences, an extra explanation after the closing brace. The parser strips any leading or trailing Markdown code fences, trims everything before the first {\nand after the last balanced }\n, and hands the cleaned slice to Moshi configured as lenient. If that fails, we retry once with a stricter RETURN JSON ONLY. NO OTHER TEXT.\npreamble; if that fails, we fall back to saving the raw transcript as a plain-text note. The user always keeps their content.\nThe most important thing this app does is not write audio to disk. Ever. SpeechRecognizer\nowns the buffer; the app only ever sees a Flow<TranscriptChunk>\nof strings. When the user stops, awaitClose\ncalls recognizer.destroy()\nand the buffer goes with it. There is no .wav\n, .m4a\n, .aac\n, or .tmp\nfile in the app's data directory at any point. The check is one line:\nadb shell run-as com.voicenotemd.debug find /data/data/com.voicenotemd.debug -type f\nThe output lists the Room database, the DataStore settings, and the model files — and nothing audio. I verified this live during, before, and after a recording.\nLiteRT-LM supports both Backend.GPU()\nand Backend.CPU()\n. GPU is faster on decode, but GPU init fails on some devices (the Pixel 6a's Mali-G78 in my testing). The session factory probes GPU and recovers to CPU:\nrunCatching { engineFactory(Backend.GPU()) }\n.recoverCatching { engineFactory(Backend.CPU()) }\n.getOrThrow()\nOn the reference Pixel 6a (CPU fallback) a 1000-character note structures in ~50-60s; on a device that gets the GPU path it's ~15-25s.\nThe engine is ~1.5 GB resident — most of a 4 GB device's budget. LiteRtLmGemmaSession\nimplements ComponentCallbacks2\nand releases the engine on onTrimMemory(TRIM_MEMORY_BACKGROUND)\n, reloading lazily. To hide cold-start, warmUp()\nis fire-and-forget from CaptureViewModel.init\n— by the time the user has tapped the mic and started talking, the engine is already loading.\nThe prompt detects the input language and produces the title, tags, body, and datetime surface forms in that language. Datetimes resolve against the device timezone, so \"domani alle 15\", \"tomorrow at 3pm\", and \"mañana a las 3\" all produce real ISO instants. Supported at v1: English, Italian, Spanish, French, German, Portuguese. The UI is English-only in v1 — UI localization is a roadmap item.\nEvery note is, by construction, a portable Markdown file with YAML frontmatter (Note.toMarkdownWithFrontmatter()\n). Drop it into an Obsidian vault, a Logseq graph, or any folder you sync — tags, resolved datetimes, headings, and checkboxes all carry with it. The privacy promise isn't just \"we don't send your data\", it's \"your data was always yours\".\nFraming matters more than I expected. Going from \"use checkboxes for tasks\" to \"REQUIRED: every I need to / must is a - [ ]\ncheckbox\" was the single largest quality jump. E2B respects directives far more reliably than permissions.\nFew-shot examples are tokens, not magic — and they can leak. Early prompts had eight to ten examples; E2B over-mimicked their length and, worse, sometimes copied their content into the user's note. Cutting to three short, low-salience examples fixed both the bloat and the leakage. This was the scariest bug of the project precisely because it violated the core \"transform, don't invent\" promise.\nSchema is the strongest hint. An inline schema block plus three worked examples beats every \"be sure to return valid JSON\" instruction. The strict-retry pass works because it isn't asking for new content — just restating the schema with louder caps locks.\nIt can do temporal reasoning if you give it the time. Without CURRENT TIMESTAMP\nin the prompt, every relative date came back null\n. With it, ~95% of relative dates resolve correctly across the six languages.\nIt can't be a fact source. Anything that requires recall — \"the dentist I always go to\" — is hallucination territory. The contract is transform, never augment, and I verify it with adversarial fixtures.\nLatency is real but tameable. ~60s on a Pixel 6a (CPU) sounds long until you remember the user just spent 60 seconds dictating. Pre-warming the engine and showing a clear progress affordance turns it into \"I see something happening\" rather than \"is this frozen?\". On the GPU path it's ~15-25s.\nThree upgrades I deliberately cut from v1 to ship within the competition window:\nSpeechRecognizer\nwith Gemma 4 E2B's multimodal audio input so transcription, language detection, and structuring all happen in one forward pass.mentions[]\ncould surface an on-device \"Add to calendar\" affordance via Intent.ACTION_INSERT\n. Still no network.Notari was built solo across the two weeks of the Gemma 4 Challenge. The model file is downloaded once, manually, from Google AI — no analytics, no telemetry, no surprises. The name takes its cue from the Latin notarius — the historically trusted recorder of spoken statements. That, in two syllables, is the product.", "url": "https://wpnews.pro/news/notari-voice-notes-that-never-leave-your-phone-structured-by-gemma-4", "canonical_source": "https://dev.to/oierreaemme/notari-voice-notes-that-never-leave-your-phone-structured-by-gemma-4-2dac", "published_at": "2026-05-23 22:37:20+00:00", "updated_at": "2026-05-23 23:02:31.214358+00:00", "lang": "en", "topics": ["artificial-intelligence", "machine-learning", "large-language-models", "open-source", "products"], "entities": ["Notari", "Gemma 4", "Gemma 4 E2B", "Android", "LiteRT-LM", "Room", "SpeechRecognizer"], "alternates": {"html": "https://wpnews.pro/news/notari-voice-notes-that-never-leave-your-phone-structured-by-gemma-4", "markdown": "https://wpnews.pro/news/notari-voice-notes-that-never-leave-your-phone-structured-by-gemma-4.md", "text": "https://wpnews.pro/news/notari-voice-notes-that-never-leave-your-phone-structured-by-gemma-4.txt", "jsonld": "https://wpnews.pro/news/notari-voice-notes-that-never-leave-your-phone-structured-by-gemma-4.jsonld"}}