Show HN: I run a vision model on every screenshot, locally, on a 4GB GPU

ScreenMind, an open-source AI memory tool, analyzes every screenshot locally using Gemma 4 on a 4GB GPU, offering 100% private, searchable screen history as a privacy-first alternative to Microsoft's Recall.

Captures your screen → Analyzes with Gemma 4 → Builds a searchable AI memory 100% local. 100% private. Zero cloud dependencies. Features -features · · -how-gemma-4-is-used Gemma 4 Deep Dive · -quick-start Quick Start · -architecture Architecture · -agent-platform Agent Platform · -mcp-server-claude--cursor--vs-code MCP API | Agents | Chat with your memory | |---|---| Microsoft showed the world wants screen-aware AI with Recall.But Recall stores data in plaintext, sends telemetry, and was met with massive privacy backlash. ScreenMind is the open-source, privacy-first alternative — every screenshot analyzed, every insight generated, every search result — all computed locally using Gemma 4's multimodal capabilities.It's not just a screen recorder. It's an AI memoryyou can talk to, search through, and build automations on top of. 📸 Smart Capture — Content-change detection, not a fixed timer. Captures when your screen actually changes. 🔬 Gemma 4 Vision Analysis — Every screenshot analyzed: app detection, activity categorization, mood, scene description, spatial layout regions. 🔍 Hybrid Search — Semantic embeddings MiniLM + FTS5 keyword search. Find anything by meaning , not just keywords. 💬 Chat with Memory — Conversational RAG with follow-up support. Ask "what did Ishaa say on Discord?" → get the actual message. 🎙️ Voice Memos — Hold Ctrl+Shift+V → Gemma 4's native audio encoder transcribes. Screenshot captured alongside. 🎤 Meeting Transcription — Auto-detects Zoom/Teams/Meet, records audio, transcribes, generates structured summaries. 📊 Analytics Dashboard — Category breakdown, top apps, hourly heatmap, meeting stats, focus metrics. ⏪ Day Rewind — Timelapse playback of your entire day with play/pause/scrub/speed controls. Three Analysis Modes — Accurate ~76s, deep thinking + layout , Balanced ~40s, thinking , or Fast ~12s, no thinking . You choose. Per-App pHash Cache — 3-tier caching with app-aware staleness. Communication apps refresh faster than IDEs. Significantly fewer inference calls. Chat-First GPU Priority — Chat cancels in-flight analysis instantly. GPU freed in <1s. Auto-Pause Heavy Apps — Games, video editors, 3D software detected → capture pauses automatically. 100% Local — All data stays on your machine. Zero network calls after initial model download. No telemetry. Ever. Sensitive Data Filter — Auto-redacts credit cards, SSNs, API keys, passwords before storage. Encryption at Rest — AES encryption for screenshots Fernet + OS keyring . Dashboard PIN Lock — Session-based auth with configurable auto-lock timeout. Incognito Mode — One-click pause. Nothing recorded. 🔌 Integrations & Extensibility | Integration | Description | |---|---| 🤖 Agent Platform | Build automations in Markdown English or Python. Drop a file, get an agent. | 🔌 MCP Server | Expose screen history to Claude Desktop, Cursor, VS Code | 📓 Obsidian | Auto-sync daily summaries to your vault | 📋 Notion | Push summaries to a Notion database | 🪝 Webhooks | Fire events to Slack, Discord, IFTTT HMAC signed, auto-retry | 🔔 Smart Notifications | Distraction alerts, break reminders | ⭐ Auto-Bookmark | Keyword triggers git push , deploy auto-flag important moments | | Hotkey | Action | |---|---| Ctrl+Shift+B | 📸 Instant bookmarked capture | Ctrl+Shift+P | ⏸ Toggle pause/resume | Ctrl+Shift+V | 🎤 Hold to record voice memo | All hotkeys customizable from Settings. Gemma 4 E2B is not a bolt-on — it's architecturally load-bearing. ScreenMind uses all three modalities : Every screenshot is sent to Gemma 4 with OCR context. It returns structured JSON: - App name, activity category, summary, detailed context - Mood classification, confidence score - Rich scene description every visible element inventoried - Layout regions sidebar, chat area, toolbar boundaries Three modes: Accurate — single call with thinking ~76s . Best layout detection. Balanced — thinking enabled, analysis-only ~40s . Richer descriptions than Fast. Fast — no-thinking prefill trick ~12s . Layout via OCR clustering instead. Gemma 4 E2B has a native audio encoder. ScreenMind uses it for: - Voice memo transcription hold hotkey → speak → release - Meeting transcription 15s chunks, map-reduce summarization for long meetings No Whisper dependency. One model handles everything. Daily summaries with deep reasoning think=True Chat answers grounded in actual screen data text-first RAG with vision fallback Agent execution — Gemma processes markdown agent prompts with injected screen data | Constraint | Why It Rules Out Alternatives | |---|---| Must run continuously in background | Rules out 12B+ models too heavy | Must understand screenshots natively | Rules out text-only models | Must stay 100% local for privacy | Rules out cloud APIs | Must handle audio natively | Rules out models without audio encoder | Must be fast enough for 30s cycle | E2B processes in 12-76s depending on mode | Gemma 4 E2B is the only model that checks all five boxes. Requirements:Python 3.10+ · GPU recommended 4GB+ VRAM · ~5GB disk for model git clone https://github.com/ayushh0110/ScreenMind.git cd ScreenMind python -m venv venv venv\Scripts\activate Windows source venv/bin/activate macOS/Linux pip install -r requirements.txt python main.py 3️⃣ Open → http://127.0.0.1:7777 http://127.0.0.1:7777 http://127.0.0.1:7777 http://127.0.0.1:7777 On first run, ScreenMind will: - Auto-download Gemma 4 E2B GGUF model ~5GB, one time - Start llama-server in background - Show the welcome screen to set up an optional PIN - Create ~/.screenmind/ for data storage ⚙️ Optional: Configure via .env cp .env.example .env Edit capture interval, blocked apps, hotkeys, etc. Or configure everything from the Settings tab in the dashboard. ┌─────────────────────────────────────────────────────────────────────┐ │ ScreenMind │ │ │ │ ┌────────────┐ ┌──────────────┐ ┌─────────────────────────┐ │ │ │ Capture │───▶│ Async Queue │───▶│ Analysis Worker │ │ │ │ Worker │ │ max: 100 │ │ │ │ │ │ │ └──────────────┘ │ ┌───────────────────┐ │ │ │ │ • Screen │ │ │ Per-App pHash │ │ │ │ │ • Window │ │ │ Cache 3-tier │ │ │ │ │ • Dedup │ │ └───────────────────┘ │ │ │ │ • A11y │ │ │ │ │ │ │ • Privacy │ │ ▼ │ │ │ └────────────┘ │ ┌───────────────────┐ │ │ │ │ │ EasyOCR │ │ │ │ ┌────────────┐ │ │ text extract │ │ │ │ │ Audio │ │ └───────────────────┘ │ │ │ │ Worker │ │ │ │ │ │ │ │ │ ▼ │ │ │ │ • Meeting │ │ ┌───────────────────┐ │ │ │ │ detect │ │ │ Gemma 4 E2B │ │ │ │ │ • Record │ │ │ via llama.cpp │ │ │ │ │ • Transcr. │ │ │ Vision + Audio │ │ │ │ └────────────┘ │ └───────────────────┘ │ │ │ │ │ │ │ │ ┌────────────┐ │ ▼ │ │ │ │ Agent │ │ ┌───────────────────┐ │ │ │ │ Scheduler │ │ │ Layout Analyzer │ │ │ │ │ │ │ │ spatial OCR │ │ │ │ │ • .md AI │ │ └───────────────────┘ │ │ │ │ • .py code │ │ │ │ │ │ └────────────┘ │ ▼ │ │ │ │ ┌───────────────────┐ │ │ │ │ │ MiniLM-L6-v2 │ │ │ │ │ │ embeddings │ │ │ │ │ └───────────────────┘ │ │ │ └─────────────────────────┘ │ │ │ │ │ ▼ │ │ ┌───────────────────┐ │ │ │ SQLite WAL │ │ │ │ + FTS5 index │ │ │ └─────────┬─────────┘ │ │ │ │ │ ┌───────────────────────────────────────────────┘ │ │ │ │ │ ▼ │ │ ┌───────────────────────────────────────────────────────────────┐ │ │ │ FastAPI REST Server │ │ │ │ /timeline · /search · /chat · /stats · /agents · /mcp │ │ │ │ │ │ │ │ ┌───────────────────────────────────────────────────────┐ │ │ │ │ │ Web Dashboard Vanilla JS SPA │ │ │ │ │ │ Timeline · Chat · Search · Analytics · Agents · Settings │ │ │ │ └───────────────────────────────────────────────────────┘ │ │ │ └───────────────────────────────────────────────────────────────┘ │ └─────────────────────────────────────────────────────────────────────┘ Screenshot → EasyOCR text → Gemma 4 E2B understanding → MiniLM embeddings → SQLite + FTS5 ↑ OCR text fed as context Gemma sees image + reads text Four AI models working in concert, with Gemma 4 as the brain: EasyOCR — extracts raw screen text Gemma 4 E2B — understands what you're doing vision + reasoning MiniLM-L6-v2 — generates semantic vectors for natural language search FTS5 — indexes text for instant keyword search ScreenMind includes a full agent/plugin system. Build any automation on top of your screen data. | Mode | File Type | For | Example | |---|---|---|---| | 🤖 AI Agent | .md | Everyone | Write a prompt in English → Gemma runs it on your data | | 🐍 Python Plugin | .py | Developers | Full code with SDK access, state persistence, LLM calls | --- name: Daily Focus Report schedule: every 6h data: timeline, apps, mood output: local, obsidian --- Analyze my screen activity and generate a focus report: - How many hours of deep work vs shallow work? - What were my main distractions? - Give me a focus score out of 10. Drop this file in ~/.screenmind/agents/ — it runs automatically. python from screenmind sdk import ScreenMindSDK sdk = ScreenMindSDK "my-tracker" Get today's activities filtered by app activities = sdk.get activities app="Chrome", limit=20 Persistent state across runs last count = sdk.load state "url count", 0 urls = sdk.get urls visited sdk.save state "url count", len urls Ask Gemma GPU-safe — waits for idle insight = sdk.ask gemma f"Summarize these URLs: {urls}" print insight Markdown agents declare what data they need: | Selector | Injects | |---|---| timeline | Recent activities with timestamps, apps, summaries | apps | App usage counts + category breakdown | urls | URLs visited extracted from browser address bars | meetings | Meeting summaries and durations | mood | Mood/sentiment from screen analysis | Data injection auto-scales to your model's context window. daily-journal.md — First-person journal entry from your day focus-report.md — Focus score, deep work hours, distractions meeting-actions.md — Extract action items from meeting transcripts code-changelog.md — Summarize coding activity commits, files, repos ScreenMind exposes your screen history to any MCP-compatible AI tool: python mcp server.py stdio transport Claude Desktop config ~/.claude/claude desktop config.json : { "mcpServers": { "screenmind": { "command": "python", "args": "C:/path/to/screenmind/mcp server.py" } } } | Tool | Description | |---|---| search screen | Semantic + keyword search across all history | get recent activity | Last N activities with full details | get activity by time | Activities for a specific date/time range | get daily summary | AI-generated daily summary | capture now | Trigger instant screenshot | get stats | Usage statistics | search audio | Search meeting transcripts | get screenshot | Retrieve screenshot path by activity ID | Full Swagger docs at http://127.0.0.1:7777/docs | Method | Endpoint | Description | |---|---|---| GET | /api/status | System health, worker stats | GET | /api/timeline?date=2026-05-21 | Activities for a date | GET | /api/search?q=debugging auth | Hybrid semantic + keyword search | POST | /api/chat | Conversational AI with screen memory SSE stream | GET | /api/stats?range=day | Analytics categories, apps, meetings | GET | /api/rewind?date=2026-05-21 | Timelapse frames | POST | /api/summary/generate | Generate AI daily summary | GET | /api/agents | List all agents | POST | /api/agents/{name}/run | Trigger agent execution | POST | /api/capture/pause | Pause capture | POST | /api/incognito/toggle | Toggle incognito mode | All settings configurable via .env , environment variables, or the Settings dashboard persists to settings.json . | Variable | Default | Description | |---|---|---| CAPTURE INTERVAL | 40 | Seconds between captures | ANALYSIS MODE | merged | merged accurate, ~76s or fast ~12s | PERFORMANCE MODE | balanced | GPU layers: minimal / balanced / maximum | BLOCKED APPS | empty | Comma-separated apps to never capture | MEETING TRANSCRIPTION | false | Auto-transcribe when meeting apps detected | RETENTION DAYS | 7 | Auto-delete data older than N days 0 = forever | ENCRYPTION ENABLED | false | Encrypt screenshots at rest | SENSITIVE FILTER ENABLED | true | Redact credit cards, SSNs, API keys | See .env.example for the full list. | Layer | Technology | Why | |---|---|---| Vision + Audio AI | Gemma 4 E2B via llama.cpp | Only model with vision + audio + reasoning that runs locally on 4GB VRAM | Inference Server | llama-server llama.cpp | Direct GGUF inference, OpenAI-compatible API | OCR | EasyOCR | Extracts screen text fed to Gemma as context | Embeddings | all-MiniLM-L6-v2 | 80MB, runs on CPU, 384-dim vectors for semantic search | Backend | FastAPI + Uvicorn | Async-first, auto-generated API docs | Database | SQLite WAL + FTS5 | Zero-config, concurrent reads, full-text search | Capture | mss + ctypes/UI Automation | Native screen capture + accessibility text extraction | Frontend | Vanilla JS + CSS | No build step, instant load, dark glassmorphism UI | Platform | Windows / macOS / Linux | Abstraction layer with OS-specific adapters | | Scenario | Behavior | |---|---| llama-server not running | Auto-starts on launch. Captures continue; analysis retried with backoff. | Model not downloaded | Auto-downloads GGUF on first start via HuggingFace. | GPU out of memory | Detects OOM, retries with delay, re-queues on persistent failure. | Duplicate frames | pHash dedup skips identical screenshots threshold: 8 hamming distance . | Stale queue items | Captures 3 min old auto-skipped. Backfilled during idle. | App in blocklist | Silently skips — no screenshot saved. | Meeting app closed | Process-alive check + silence detection + 5-min hard timeout. | Chat during analysis | Cancels in-flight inference, frees GPU in <1s, re-queues analysis. | Crash recovery | Stale meetings cleaned on startup. Unanalyzed entries backfilled. | The web dashboard at http://127.0.0.1:7777 features: Timeline — Browse activities by date with thumbnails, AI summaries, category badges Chat — Conversational AI with screen memory. Ask anything about your history. Search — Semantic + keyword hybrid search with OCR highlighting on screenshots Analytics — Category charts, top apps, hourly heatmap, meeting stats Rewind — Timelapse player with play/pause/scrub/speed controls Memos — Voice memo list with audio player Agents — Create, edit, run, and monitor agents Settings — 8 organized sections: Shortcuts, Capture, AI, Audio, Privacy, Automation, Integrations, Storage Dark glassmorphism UI. No build step. Instant load. Contributions welcome Here are some high-impact areas: - 🍎 macOS/Linux testing — platform adapters exist, need real hardware testing - 🐳 Docker container — one-command setup - 🧩 Community agent registry — share agents between users - 🌐 Browser extension — richer URL/tab context - 📤 Export formats — Markdown, CSV, JSON If you find ScreenMind useful, please consider: ⭐ Star this repo — it helps others discover the project 🍴 Fork it — build your own agents and features 🐛 Report issues — help us improve 📣 Share it — tell others about privacy-first AI MIT License — see LICENSE /ayushh0110/ScreenMind/blob/main/LICENSE for details. Built with 🧠 Gemma 4 E2B · 🔒 100% Local · 🚀 Zero Cloud Dependencies Vision + Audio + Reasoning — all three modalities, one model, your machine. Made with ❤️ by ayushh0110