LocalFind Gemma — AI-Powered Semantic Search and Chat for Your Local Files **Summary:** LocalFind Gemma is a privacy-focused, fully local semantic search engine for personal files (documents, images, audio) powered by Gemma 4 via Ollama. Unlike keyword-based tools, it understands content using embeddings for multilingual search and employs Gemma 4 at three pipeline stages: indexing image captions, driving an agentic reasoning and tool-use system, and performing live image reading. All processing occurs on the user's machine with no cloud dependency, though an optional Claude Desktop integration is available. This is a submission for the Gemma 4 Challenge: Build with Gemma 4 LocalFind Gemma is a fully local, privacy-first semantic search engine for your own files — documents, images, and audio — powered by Gemma 4 running on Ollama. Most search tools match filenames or keywords. LocalFind Gemma understands content: nomic-embed-text-v2-moe embedding model supports ~100 languages in a shared vector space. Search in French, find English documents.Supported file types: PDF, DOCX, TXT, MD, CSV, JPG, PNG, GIF, BMP, WEBP, MP3, WAV, FLAC, M4A. Everything — Gemma 4, Whisper, the ChromaDB vector store — runs on your machine. No API keys, no cloud, no data leaving your device. There's also an optional Claude Desktop integration via MCP for files you're comfortable sharing with a third party. https://github.com/maliklovable1-spec/localfind-gemma Gemma 4 isn't just the chat model here — it's active at three distinct points in the pipeline: 1. Index time: captioning every image When you sync a folder, each image is sent to Gemma 4 via Ollama's vision API. The caption is embedded and stored permanently in ChromaDB. Future searches use the stored caption; the model isn't called again unless you re-sync. This means fast search without repeated inference. 2. Agent reasoning and tool use The conversational agent runs on gemma4:e4b the recommended default . It decides when to search, what query to issue, and how to synthesise results into a direct answer rather than just returning file paths. I chose e4b over e2b because it follows tool-use instructions more reliably — which matters a lot in an agentic loop where the model needs to decide between search, image reading, and response synthesis. e2b is also supported for users with less RAM ~12 GB vs 16 GB . 3. Live image reading When the agent finds an image relevant to your question, it sends the image bytes directly to Ollama's native /api/chat API with your question as context. Gemma 4 reads the image and the agent uses that to answer you. The bytes go from your disk to your local Ollama process —nowhere else. A note on audio Gemma 4 E2B and E4B natively support audio transcription at the architecture level — multilingual, up to 30 seconds, built into the model. LocalFind Gemma currently uses Whisper for audio because Ollama doesn't expose audio input via its API yet. Once Ollama ships that support issue 11798 https://github.com/ollama/ollama/issues/11798 , the transcription backend can switch to Gemma 4 — the architecture is already designed with that transition in mind, though it will require some code changes depending on how Ollama exposes the audio API.