{"slug": "show-hn-i-run-a-vision-model-on-every-screenshot-locally-on-a-4gb-gpu", "title": "Show HN: I run a vision model on every screenshot, locally, on a 4GB GPU", "summary": "ScreenMind, an open-source AI memory tool, analyzes every screenshot locally using Gemma 4 on a 4GB GPU, offering 100% private, searchable screen history as a privacy-first alternative to Microsoft's Recall.", "body_md": "**Captures your screen → Analyzes with Gemma 4 → Builds a searchable AI memory**\n\n**100% local. 100% private. Zero cloud dependencies.**\n\n[ Features](#-features) ·\n\n[·](#-how-gemma-4-is-used)\n\n**Gemma 4 Deep Dive**[·](#-quick-start)\n\n**Quick Start**[·](#-architecture)\n\n**Architecture**[·](#-agent-platform)\n\n**Agent Platform**[·](#-mcp-server-claude--cursor--vs-code)\n\n**MCP**\n\n**API**| Agents | Chat with your memory |\n|---|---|\n\nMicrosoft showed the world wants screen-aware AI with Recall.But Recall stores data in plaintext, sends telemetry, and was met with massive privacy backlash. ScreenMind is the open-source, privacy-first alternative — every screenshot analyzed, every insight generated, every search result — all computed locally using Gemma 4's multimodal capabilities.It's not just a screen recorder. It's an\n\nAI memoryyou can talk to, search through, and build automations on top of.\n\n**📸 Smart Capture**— Content-change detection, not a fixed timer. Captures when your screen*actually*changes.**🔬 Gemma 4 Vision Analysis**— Every screenshot analyzed: app detection, activity categorization, mood, scene description, spatial layout regions.**🔍 Hybrid Search**— Semantic embeddings (MiniLM) + FTS5 keyword search. Find anything by*meaning*, not just keywords.**💬 Chat with Memory**— Conversational RAG with follow-up support. Ask \"what did Ishaa say on Discord?\" → get the actual message.**🎙️ Voice Memos**— Hold`Ctrl+Shift+V`\n\n→ Gemma 4's native audio encoder transcribes. Screenshot captured alongside.**🎤 Meeting Transcription**— Auto-detects Zoom/Teams/Meet, records audio, transcribes, generates structured summaries.**📊 Analytics Dashboard**— Category breakdown, top apps, hourly heatmap, meeting stats, focus metrics.**⏪ Day Rewind**— Timelapse playback of your entire day with play/pause/scrub/speed controls.\n\n**Three Analysis Modes**— Accurate (~76s, deep thinking + layout), Balanced (~40s, thinking), or Fast (~12s, no thinking). You choose.** Per-App pHash Cache**— 3-tier caching with app-aware staleness. Communication apps refresh faster than IDEs. Significantly fewer inference calls.**Chat-First GPU Priority**— Chat cancels in-flight analysis instantly. GPU freed in <1s.** Auto-Pause Heavy Apps**— Games, video editors, 3D software detected → capture pauses automatically.\n\n**100% Local**— All data stays on your machine. Zero network calls after initial model download. No telemetry. Ever.** Sensitive Data Filter**— Auto-redacts credit cards, SSNs, API keys, passwords before storage.** Encryption at Rest**— AES encryption for screenshots (Fernet + OS keyring).** Dashboard PIN Lock**— Session-based auth with configurable auto-lock timeout.** Incognito Mode**— One-click pause. Nothing recorded.\n\n**🔌 Integrations & Extensibility**\n\n| Integration | Description |\n|---|---|\n🤖 Agent Platform |\nBuild automations in Markdown (English) or Python. Drop a file, get an agent. |\n🔌 MCP Server |\nExpose screen history to Claude Desktop, Cursor, VS Code |\n📓 Obsidian |\nAuto-sync daily summaries to your vault |\n📋 Notion |\nPush summaries to a Notion database |\n🪝 Webhooks |\nFire events to Slack, Discord, IFTTT (HMAC signed, auto-retry) |\n🔔 Smart Notifications |\nDistraction alerts, break reminders |\n⭐ Auto-Bookmark |\nKeyword triggers (`git push` , `deploy` ) auto-flag important moments |\n\n| Hotkey | Action |\n|---|---|\n`Ctrl+Shift+B` |\n📸 Instant bookmarked capture |\n`Ctrl+Shift+P` |\n⏸ Toggle pause/resume |\n`Ctrl+Shift+V` |\n🎤 Hold to record voice memo |\n\nAll hotkeys customizable from Settings.\n\nGemma 4 E2B is not a bolt-on — it's architecturally load-bearing. ScreenMind uses **all three modalities**:\n\nEvery screenshot is sent to Gemma 4 with OCR context. It returns structured JSON:\n\n- App name, activity category, summary, detailed context\n- Mood classification, confidence score\n- Rich scene description (every visible element inventoried)\n- Layout regions (sidebar, chat area, toolbar boundaries)\n\n**Three modes:**\n\n**Accurate**— single call with thinking (~76s). Best layout detection.** Balanced**— thinking enabled, analysis-only (~40s). Richer descriptions than Fast.** Fast**— no-thinking prefill trick (~12s). Layout via OCR clustering instead.\n\nGemma 4 E2B has a native audio encoder. ScreenMind uses it for:\n\n- Voice memo transcription (hold hotkey → speak → release)\n- Meeting transcription (15s chunks, map-reduce summarization for long meetings)\n\nNo Whisper dependency. One model handles everything.\n\n**Daily summaries** with deep reasoning (`think=True`\n\n)**Chat answers** grounded in actual screen data (text-first RAG with vision fallback)**Agent execution**— Gemma processes markdown agent prompts with injected screen data\n\n| Constraint | Why It Rules Out Alternatives |\n|---|---|\nMust run continuously in background |\nRules out 12B+ models (too heavy) |\nMust understand screenshots natively |\nRules out text-only models |\nMust stay 100% local for privacy |\nRules out cloud APIs |\nMust handle audio natively |\nRules out models without audio encoder |\nMust be fast enough for 30s cycle |\nE2B processes in 12-76s depending on mode |\n\nGemma 4 E2B is the only model that checks all five boxes.\n\nRequirements:Python 3.10+ · GPU recommended (4GB+ VRAM) · ~5GB disk for model\n\n```\ngit clone https://github.com/ayushh0110/ScreenMind.git\ncd ScreenMind\n\npython -m venv venv\nvenv\\Scripts\\activate        # Windows\n# source venv/bin/activate   # macOS/Linux\n\npip install -r requirements.txt\npython main.py\n```\n\n#### 3️⃣ Open → [http://127.0.0.1:7777](http://127.0.0.1:7777)\n\n[http://127.0.0.1:7777](http://127.0.0.1:7777)\n\nOn first run, ScreenMind will:\n\n- Auto-download Gemma 4 E2B GGUF model (~5GB, one time)\n- Start\n`llama-server`\n\nin background - Show the welcome screen to set up an optional PIN\n- Create\n`~/.screenmind/`\n\nfor data storage\n\n**⚙️ Optional: Configure via .env**\n\n```\ncp .env.example .env\n# Edit capture interval, blocked apps, hotkeys, etc.\n```\n\nOr configure everything from the **Settings** tab in the dashboard.\n\n```\n┌─────────────────────────────────────────────────────────────────────┐\n│                          ScreenMind                                  │\n│                                                                     │\n│  ┌────────────┐    ┌──────────────┐    ┌─────────────────────────┐ │\n│  │  Capture   │───▶│  Async Queue │───▶│    Analysis Worker      │ │\n│  │  Worker    │    │  (max: 100)  │    │                         │ │\n│  │            │    └──────────────┘    │  ┌───────────────────┐  │ │\n│  │ • Screen   │                        │  │  Per-App pHash    │  │ │\n│  │ • Window   │                        │  │  Cache (3-tier)   │  │ │\n│  │ • Dedup    │                        │  └───────────────────┘  │ │\n│  │ • A11y     │                        │           │             │ │\n│  │ • Privacy  │                        │           ▼             │ │\n│  └────────────┘                        │  ┌───────────────────┐  │ │\n│                                        │  │   EasyOCR         │  │ │\n│  ┌────────────┐                        │  │   (text extract)  │  │ │\n│  │   Audio    │                        │  └───────────────────┘  │ │\n│  │   Worker   │                        │           │             │ │\n│  │            │                        │           ▼             │ │\n│  │ • Meeting  │                        │  ┌───────────────────┐  │ │\n│  │   detect   │                        │  │   Gemma 4 E2B     │  │ │\n│  │ • Record   │                        │  │   (via llama.cpp) │  │ │\n│  │ • Transcr. │                        │  │   Vision + Audio  │  │ │\n│  └────────────┘                        │  └───────────────────┘  │ │\n│                                        │           │             │ │\n│  ┌────────────┐                        │           ▼             │ │\n│  │   Agent    │                        │  ┌───────────────────┐  │ │\n│  │  Scheduler │                        │  │  Layout Analyzer  │  │ │\n│  │            │                        │  │  (spatial OCR)    │  │ │\n│  │ • .md AI   │                        │  └───────────────────┘  │ │\n│  │ • .py code │                        │           │             │ │\n│  └────────────┘                        │           ▼             │ │\n│                                        │  ┌───────────────────┐  │ │\n│                                        │  │  MiniLM-L6-v2     │  │ │\n│                                        │  │  (embeddings)     │  │ │\n│                                        │  └───────────────────┘  │ │\n│                                        └─────────────────────────┘ │\n│                                                    │               │\n│                                                    ▼               │\n│                                        ┌───────────────────┐       │\n│                                        │   SQLite (WAL)    │       │\n│                                        │   + FTS5 index    │       │\n│                                        └─────────┬─────────┘       │\n│                                                  │                 │\n│  ┌───────────────────────────────────────────────┘                 │\n│  │                                                                 │\n│  ▼                                                                 │\n│  ┌───────────────────────────────────────────────────────────────┐ │\n│  │                    FastAPI REST Server                         │ │\n│  │  /timeline · /search · /chat · /stats · /agents · /mcp       │ │\n│  │                                                               │ │\n│  │  ┌───────────────────────────────────────────────────────┐   │ │\n│  │  │           Web Dashboard (Vanilla JS SPA)               │   │ │\n│  │  │  Timeline · Chat · Search · Analytics · Agents · Settings │ │\n│  │  └───────────────────────────────────────────────────────┘   │ │\n│  └───────────────────────────────────────────────────────────────┘ │\n└─────────────────────────────────────────────────────────────────────┘\nScreenshot → EasyOCR (text) → Gemma 4 E2B (understanding) → MiniLM (embeddings) → SQLite + FTS5\n                                     ↑\n                              OCR text fed as context\n                              (Gemma sees image + reads text)\n```\n\nFour AI models working in concert, with Gemma 4 as the brain:\n\n**EasyOCR**— extracts raw screen text** Gemma 4 E2B**— understands what you're doing (vision + reasoning)** MiniLM-L6-v2**— generates semantic vectors for natural language search** FTS5**— indexes text for instant keyword search\n\nScreenMind includes a full agent/plugin system. Build any automation on top of your screen data.\n\n| Mode | File Type | For | Example |\n|---|---|---|---|\n| 🤖 AI Agent | `.md` |\nEveryone | Write a prompt in English → Gemma runs it on your data |\n| 🐍 Python Plugin | `.py` |\nDevelopers | Full code with SDK access, state persistence, LLM calls |\n\n```\n---\nname: Daily Focus Report\nschedule: every 6h\ndata: timeline, apps, mood\noutput: local, obsidian\n---\n\nAnalyze my screen activity and generate a focus report:\n- How many hours of deep work vs shallow work?\n- What were my main distractions?\n- Give me a focus score out of 10.\n```\n\nDrop this file in `~/.screenmind/agents/`\n\n— it runs automatically.\n\n``` python\nfrom screenmind_sdk import ScreenMindSDK\n\nsdk = ScreenMindSDK(\"my-tracker\")\n\n# Get today's activities filtered by app\nactivities = sdk.get_activities(app=\"Chrome\", limit=20)\n\n# Persistent state across runs\nlast_count = sdk.load_state(\"url_count\", 0)\nurls = sdk.get_urls_visited()\nsdk.save_state(\"url_count\", len(urls))\n\n# Ask Gemma (GPU-safe — waits for idle)\ninsight = sdk.ask_gemma(f\"Summarize these URLs: {urls}\")\nprint(insight)\n```\n\nMarkdown agents declare what data they need:\n\n| Selector | Injects |\n|---|---|\n`timeline` |\nRecent activities with timestamps, apps, summaries |\n`apps` |\nApp usage counts + category breakdown |\n`urls` |\nURLs visited (extracted from browser address bars) |\n`meetings` |\nMeeting summaries and durations |\n`mood` |\nMood/sentiment from screen analysis |\n\nData injection auto-scales to your model's context window.\n\n**daily-journal.md**— First-person journal entry from your day** focus-report.md**— Focus score, deep work hours, distractions** meeting-actions.md**— Extract action items from meeting transcripts** code-changelog.md**— Summarize coding activity (commits, files, repos)\n\nScreenMind exposes your screen history to any MCP-compatible AI tool:\n\n```\npython mcp_server.py  # stdio transport\n```\n\n**Claude Desktop config** (`~/.claude/claude_desktop_config.json`\n\n):\n\n```\n{\n  \"mcpServers\": {\n    \"screenmind\": {\n      \"command\": \"python\",\n      \"args\": [\"C:/path/to/screenmind/mcp_server.py\"]\n    }\n  }\n}\n```\n\n| Tool | Description |\n|---|---|\n`search_screen` |\nSemantic + keyword search across all history |\n`get_recent_activity` |\nLast N activities with full details |\n`get_activity_by_time` |\nActivities for a specific date/time range |\n`get_daily_summary` |\nAI-generated daily summary |\n`capture_now` |\nTrigger instant screenshot |\n`get_stats` |\nUsage statistics |\n`search_audio` |\nSearch meeting transcripts |\n`get_screenshot` |\nRetrieve screenshot path by activity ID |\n\nFull Swagger docs at `http://127.0.0.1:7777/docs`\n\n| Method | Endpoint | Description |\n|---|---|---|\n`GET` |\n`/api/status` |\nSystem health, worker stats |\n`GET` |\n`/api/timeline?date=2026-05-21` |\nActivities for a date |\n`GET` |\n`/api/search?q=debugging auth` |\nHybrid semantic + keyword search |\n`POST` |\n`/api/chat` |\nConversational AI with screen memory (SSE stream) |\n`GET` |\n`/api/stats?range=day` |\nAnalytics (categories, apps, meetings) |\n`GET` |\n`/api/rewind?date=2026-05-21` |\nTimelapse frames |\n`POST` |\n`/api/summary/generate` |\nGenerate AI daily summary |\n`GET` |\n`/api/agents` |\nList all agents |\n`POST` |\n`/api/agents/{name}/run` |\nTrigger agent execution |\n`POST` |\n`/api/capture/pause` |\nPause capture |\n`POST` |\n`/api/incognito/toggle` |\nToggle incognito mode |\n\nAll settings configurable via `.env`\n\n, environment variables, or the **Settings** dashboard (persists to `settings.json`\n\n).\n\n| Variable | Default | Description |\n|---|---|---|\n`CAPTURE_INTERVAL` |\n`40` |\nSeconds between captures |\n`ANALYSIS_MODE` |\n`merged` |\n`merged` (accurate, ~76s) or `fast` (~12s) |\n`PERFORMANCE_MODE` |\n`balanced` |\nGPU layers: `minimal` / `balanced` / `maximum` |\n`BLOCKED_APPS` |\n(empty) |\nComma-separated apps to never capture |\n`MEETING_TRANSCRIPTION` |\n`false` |\nAuto-transcribe when meeting apps detected |\n`RETENTION_DAYS` |\n`7` |\nAuto-delete data older than N days (0 = forever) |\n`ENCRYPTION_ENABLED` |\n`false` |\nEncrypt screenshots at rest |\n`SENSITIVE_FILTER_ENABLED` |\n`true` |\nRedact credit cards, SSNs, API keys |\n\nSee\n\n`.env.example`\n\nfor the full list.\n\n| Layer | Technology | Why |\n|---|---|---|\nVision + Audio AI |\nGemma 4 E2B (via llama.cpp) | Only model with vision + audio + reasoning that runs locally on 4GB VRAM |\nInference Server |\nllama-server (llama.cpp) | Direct GGUF inference, OpenAI-compatible API |\nOCR |\nEasyOCR | Extracts screen text fed to Gemma as context |\nEmbeddings |\nall-MiniLM-L6-v2 | 80MB, runs on CPU, 384-dim vectors for semantic search |\nBackend |\nFastAPI + Uvicorn | Async-first, auto-generated API docs |\nDatabase |\nSQLite (WAL) + FTS5 | Zero-config, concurrent reads, full-text search |\nCapture |\nmss + ctypes/UI Automation | Native screen capture + accessibility text extraction |\nFrontend |\nVanilla JS + CSS | No build step, instant load, dark glassmorphism UI |\nPlatform |\nWindows / macOS / Linux | Abstraction layer with OS-specific adapters |\n\n| Scenario | Behavior |\n|---|---|\nllama-server not running |\nAuto-starts on launch. Captures continue; analysis retried with backoff. |\nModel not downloaded |\nAuto-downloads GGUF on first start via HuggingFace. |\nGPU out of memory |\nDetects OOM, retries with delay, re-queues on persistent failure. |\nDuplicate frames |\npHash dedup skips identical screenshots (threshold: 8 hamming distance). |\nStale queue items |\nCaptures >3 min old auto-skipped. Backfilled during idle. |\nApp in blocklist |\nSilently skips — no screenshot saved. |\nMeeting app closed |\nProcess-alive check + silence detection + 5-min hard timeout. |\nChat during analysis |\nCancels in-flight inference, frees GPU in <1s, re-queues analysis. |\nCrash recovery |\nStale meetings cleaned on startup. Unanalyzed entries backfilled. |\n\nThe web dashboard at `http://127.0.0.1:7777`\n\nfeatures:\n\n**Timeline**— Browse activities by date with thumbnails, AI summaries, category badges** Chat**— Conversational AI with screen memory. Ask anything about your history.** Search**— Semantic + keyword hybrid search with OCR highlighting on screenshots** Analytics**— Category charts, top apps, hourly heatmap, meeting stats** Rewind**— Timelapse player with play/pause/scrub/speed controls** Memos**— Voice memo list with audio player** Agents**— Create, edit, run, and monitor agents** Settings**— 8 organized sections: Shortcuts, Capture, AI, Audio, Privacy, Automation, Integrations, Storage\n\nDark glassmorphism UI. No build step. Instant load.\n\nContributions welcome! Here are some high-impact areas:\n\n- 🍎\n**macOS/Linux testing**— platform adapters exist, need real hardware testing - 🐳\n**Docker container**— one-command setup - 🧩\n**Community agent registry**— share agents between users - 🌐\n**Browser extension**— richer URL/tab context - 📤\n**Export formats**— Markdown, CSV, JSON\n\nIf you find ScreenMind useful, please consider:\n\n**⭐ Star this repo**— it helps others discover the project**🍴 Fork it**— build your own agents and features**🐛 Report issues**— help us improve**📣 Share it**— tell others about privacy-first AI\n\nMIT License — see [LICENSE](/ayushh0110/ScreenMind/blob/main/LICENSE) for details.\n\n**Built with 🧠 Gemma 4 E2B · 🔒 100% Local · 🚀 Zero Cloud Dependencies**\n\n*Vision + Audio + Reasoning — all three modalities, one model, your machine.*\n\nMade with ❤️ by ayushh0110", "url": "https://wpnews.pro/news/show-hn-i-run-a-vision-model-on-every-screenshot-locally-on-a-4gb-gpu", "canonical_source": "https://github.com/ayushh0110/ScreenMind", "published_at": "2026-06-13 23:12:54+00:00", "updated_at": "2026-06-13 23:32:41.908073+00:00", "lang": "en", "topics": ["computer-vision", "large-language-models", "ai-products", "ai-tools", "ai-ethics"], "entities": ["ScreenMind", "Gemma 4", "Microsoft", "Recall", "MiniLM", "Claude", "Cursor", "VS Code"], "alternates": {"html": "https://wpnews.pro/news/show-hn-i-run-a-vision-model-on-every-screenshot-locally-on-a-4gb-gpu", "markdown": "https://wpnews.pro/news/show-hn-i-run-a-vision-model-on-every-screenshot-locally-on-a-4gb-gpu.md", "text": "https://wpnews.pro/news/show-hn-i-run-a-vision-model-on-every-screenshot-locally-on-a-4gb-gpu.txt", "jsonld": "https://wpnews.pro/news/show-hn-i-run-a-vision-model-on-every-screenshot-locally-on-a-4gb-gpu.jsonld"}}