# LocalFind Gemma — AI-Powered Semantic Search and Chat for Your Local Files

> Source: <https://dev.to/malik_the_dev/localfind-gemma-ai-powered-semantic-search-and-chat-for-your-local-files-4fi9>
> Published: 2026-05-23 19:29:07+00:00

This is a submission for the Gemma 4 Challenge: Build with Gemma 4
LocalFind Gemma is a fully local, privacy-first semantic search engine for your own files — documents, images, and audio — powered by Gemma 4 running on Ollama.
Most search tools match filenames or keywords. LocalFind Gemma understands content:
nomic-embed-text-v2-moe
embedding model supports ~100
languages in a shared vector space. Search in French, find English documents.Supported file types: PDF, DOCX, TXT, MD, CSV, JPG, PNG, GIF, BMP, WEBP, MP3, WAV, FLAC, M4A.
Everything — Gemma 4, Whisper, the ChromaDB vector store — runs on your machine. No API keys, no cloud, no data leaving your device. There's also an optional Claude Desktop integration via MCP for files you're comfortable sharing with a third party.
https://github.com/maliklovable1-spec/localfind-gemma
Gemma 4 isn't just the chat model here — it's active at three distinct points in the pipeline:
1. Index time: captioning every image
When you sync a folder, each image is sent to Gemma 4 via Ollama's vision API. The caption is embedded and stored permanently in ChromaDB. Future searches use the stored caption; the model isn't called again unless you re-sync. This means fast search without repeated inference.
2. Agent reasoning and tool use
The conversational agent runs on gemma4:e4b
(the recommended default). It decides when to search, what query to issue, and how to synthesise results into a direct answer rather than just returning file paths.
I chose e4b over e2b because it follows tool-use instructions more reliably — which matters a lot in an agentic loop where the model needs to decide between search, image reading, and response synthesis. e2b is also supported for users with less RAM (~12 GB vs 16 GB).
3. Live image reading
When the agent finds an image relevant to your question, it sends the image bytes directly to Ollama's native /api/chat
API with your question as context. Gemma 4 reads the image and the agent uses that to answer you. The bytes go from your disk to your local Ollama process —nowhere else.
A note on audio
Gemma 4 E2B and E4B natively support audio transcription at the architecture level — multilingual, up to 30 seconds, built into the model. LocalFind Gemma currently uses Whisper for audio because
Ollama doesn't expose audio input via its API yet. Once Ollama ships that support
([issue #11798(https://github.com/ollama/ollama/issues/11798)), the transcription backend can
switch to Gemma 4 — the architecture is already designed with that transition in mind, though it will require some code changes depending on how Ollama exposes the audio API.
