Show HN: I run a vision model on every screenshot, locally, on a 4GB GPU

wpnews.pro

Captures your screen → Analyzes with Gemma 4 → Builds a searchable AI memory

100% local. 100% private. Zero cloud dependencies.

Features ·

·

Gemma 4 Deep Dive·

Quick Start·

Architecture·

Agent Platform·

MCP

API| Agents | Chat with your memory | |---|---|

Microsoft showed the world wants screen-aware AI with Recall.But Recall stores data in plaintext, sends telemetry, and was met with massive privacy backlash. ScreenMind is the open-source, privacy-first alternative — every screenshot analyzed, every insight generated, every search result — all computed locally using Gemma 4's multimodal capabilities.It's not just a screen recorder. It's an

AI memoryyou can talk to, search through, and build automations on top of.

📸 Smart Capture— Content-change detection, not a fixed timer. Captures when your screenactuallychanges.🔬 Gemma 4 Vision Analysis— Every screenshot analyzed: app detection, activity categorization, mood, scene description, spatial layout regions.🔍 Hybrid Search— Semantic embeddings (MiniLM) + FTS5 keyword search. Find anything bymeaning, not just keywords.💬 Chat with Memory— Conversational RAG with follow-up support. Ask "what did Ishaa say on Discord?" → get the actual message.🎙️ Voice Memos— HoldCtrl+Shift+V

→ Gemma 4's native audio encoder transcribes. Screenshot captured alongside.🎤 Meeting Transcription— Auto-detects Zoom/Teams/Meet, records audio, transcribes, generates structured summaries.📊 Analytics Dashboard— Category breakdown, top apps, hourly heatmap, meeting stats, focus metrics.⏪ Day Rewind— Timelapse playback of your entire day with /scrub/speed controls.

Three Analysis Modes— Accurate (~76s, deep thinking + layout), Balanced (~40s, thinking), or Fast (~12s, no thinking). You choose.** Per-App pHash Cache**— 3-tier caching with app-aware staleness. Communication apps refresh faster than IDEs. Significantly fewer inference calls.Chat-First GPU Priority— Chat cancels in-flight analysis instantly. GPU freed in <1s.** Auto- Heavy Apps**— Games, video editors, 3D software detected → capture s automatically.

100% Local— All data stays on your machine. Zero network calls after initial model download. No telemetry. Ever.** Sensitive Data Filter**— Auto-redacts credit cards, SSNs, API keys, passwords before storage.** Encryption at Rest**— AES encryption for screenshots (Fernet + OS keyring).** Dashboard PIN Lock**— Session-based auth with configurable auto-lock timeout.** Incognito Mode**— One-click . Nothing recorded.

🔌 Integrations & Extensibility

Integration	Description
🤖 Agent Platform
Build automations in Markdown (English) or Python. Drop a file, get an agent.
🔌 MCP Server
Expose screen history to Claude Desktop, Cursor, VS Code
📓 Obsidian
Auto-sync daily summaries to your vault
📋 Notion
Push summaries to a Notion database
🪝 Webhooks
Fire events to Slack, Discord, IFTTT (HMAC signed, auto-retry)
🔔 Smart Notifications
Distraction alerts, break reminders
⭐ Auto-Bookmark
Keyword triggers (`git push` , `deploy` ) auto-flag important moments

Hotkey	Action
`Ctrl+Shift+B`
📸 Instant bookmarked capture
`Ctrl+Shift+P`
⏸ Toggle /resume
`Ctrl+Shift+V`
🎤 Hold to record voice memo

All hotkeys customizable from Settings.

Gemma 4 E2B is not a bolt-on — it's architecturally load-bearing. ScreenMind uses all three modalities:

Every screenshot is sent to Gemma 4 with OCR context. It returns structured JSON:

App name, activity category, summary, detailed context
Mood classification, confidence score
Rich scene description (every visible element inventoried)
Layout regions (sidebar, chat area, toolbar boundaries)

Three modes:

Accurate— single call with thinking (~76s). Best layout detection.** Balanced**— thinking enabled, analysis-only (~40s). Richer descriptions than Fast.** Fast**— no-thinking prefill trick (~12s). Layout via OCR clustering instead.

Gemma 4 E2B has a native audio encoder. ScreenMind uses it for:

Voice memo transcription (hold hotkey → speak → release)
Meeting transcription (15s chunks, map-reduce summarization for long meetings)

No Whisper dependency. One model handles everything.

Daily summaries with deep reasoning (think=True

)Chat answers grounded in actual screen data (text-first RAG with vision fallback)Agent execution— Gemma processes markdown agent prompts with injected screen data

Constraint	Why It Rules Out Alternatives
Must run continuously in background
Rules out 12B+ models (too heavy)
Must understand screenshots natively
Rules out text-only models
Must stay 100% local for privacy
Rules out cloud APIs
Must handle audio natively
Rules out models without audio encoder
Must be fast enough for 30s cycle
E2B processes in 12-76s depending on mode

Gemma 4 E2B is the only model that checks all five boxes.

Requirements:Python 3.10+ · GPU recommended (4GB+ VRAM) · ~5GB disk for model

git clone https://github.com/ayushh0110/ScreenMind.git
cd ScreenMind

python -m venv venv
venv\Scripts\activate        # Windows

pip install -r requirements.txt
python main.py

3️⃣ Open → http://127.0.0.1:7777

http://127.0.0.1:7777

On first run, ScreenMind will:

Auto-download Gemma 4 E2B GGUF model (~5GB, one time)
Start llama-server

in background - Show the welcome screen to set up an optional PIN

Create ~/.screenmind/

for data storage

⚙️ Optional: Configure via .env

cp .env.example .env

Or configure everything from the Settings tab in the dashboard.

┌─────────────────────────────────────────────────────────────────────┐
│                          ScreenMind                                  │
│                                                                     │
│  ┌────────────┐    ┌──────────────┐    ┌─────────────────────────┐ │
│  │  Capture   │───▶│  Async Queue │───▶│    Analysis Worker      │ │
│  │  Worker    │    │  (max: 100)  │    │                         │ │
│  │            │    └──────────────┘    │  ┌───────────────────┐  │ │
│  │ • Screen   │                        │  │  Per-App pHash    │  │ │
│  │ • Window   │                        │  │  Cache (3-tier)   │  │ │
│  │ • Dedup    │                        │  └───────────────────┘  │ │
│  │ • A11y     │                        │           │             │ │
│  │ • Privacy  │                        │           ▼             │ │
│  └────────────┘                        │  ┌───────────────────┐  │ │
│                                        │  │   EasyOCR         │  │ │
│  ┌────────────┐                        │  │   (text extract)  │  │ │
│  │   Audio    │                        │  └───────────────────┘  │ │
│  │   Worker   │                        │           │             │ │
│  │            │                        │           ▼             │ │
│  │ • Meeting  │                        │  ┌───────────────────┐  │ │
│  │   detect   │                        │  │   Gemma 4 E2B     │  │ │
│  │ • Record   │                        │  │   (via llama.cpp) │  │ │
│  │ • Transcr. │                        │  │   Vision + Audio  │  │ │
│  └────────────┘                        │  └───────────────────┘  │ │
│                                        │           │             │ │
│  ┌────────────┐                        │           ▼             │ │
│  │   Agent    │                        │  ┌───────────────────┐  │ │
│  │  Scheduler │                        │  │  Layout Analyzer  │  │ │
│  │            │                        │  │  (spatial OCR)    │  │ │
│  │ • .md AI   │                        │  └───────────────────┘  │ │
│  │ • .py code │                        │           │             │ │
│  └────────────┘                        │           ▼             │ │
│                                        │  ┌───────────────────┐  │ │
│                                        │  │  MiniLM-L6-v2     │  │ │
│                                        │  │  (embeddings)     │  │ │
│                                        │  └───────────────────┘  │ │
│                                        └─────────────────────────┘ │
│                                                    │               │
│                                                    ▼               │
│                                        ┌───────────────────┐       │
│                                        │   SQLite (WAL)    │       │
│                                        │   + FTS5 index    │       │
│                                        └─────────┬─────────┘       │
│                                                  │                 │
│  ┌───────────────────────────────────────────────┘                 │
│  │                                                                 │
│  ▼                                                                 │
│  ┌───────────────────────────────────────────────────────────────┐ │
│  │                    FastAPI REST Server                         │ │
│  │  /timeline · /search · /chat · /stats · /agents · /mcp       │ │
│  │                                                               │ │
│  │  ┌───────────────────────────────────────────────────────┐   │ │
│  │  │           Web Dashboard (Vanilla JS SPA)               │   │ │
│  │  │  Timeline · Chat · Search · Analytics · Agents · Settings │ │
│  │  └───────────────────────────────────────────────────────┘   │ │
│  └───────────────────────────────────────────────────────────────┘ │
└─────────────────────────────────────────────────────────────────────┘
Screenshot → EasyOCR (text) → Gemma 4 E2B (understanding) → MiniLM (embeddings) → SQLite + FTS5
                                     ↑
                              OCR text fed as context
                              (Gemma sees image + reads text)

Four AI models working in concert, with Gemma 4 as the brain:

EasyOCR— extracts raw screen text** Gemma 4 E2B**— understands what you're doing (vision + reasoning)** MiniLM-L6-v2**— generates semantic vectors for natural language search** FTS5**— indexes text for instant keyword search

ScreenMind includes a full agent/plugin system. Build any automation on top of your screen data.

Mode	File Type	For	Example
🤖 AI Agent	`.md`
Everyone	Write a prompt in English → Gemma runs it on your data
🐍 Python Plugin	`.py`
Developers	Full code with SDK access, state persistence, LLM calls

---
name: Daily Focus Report
schedule: every 6h
data: timeline, apps, mood
output: local, obsidian
---

Analyze my screen activity and generate a focus report:
- How many hours of deep work vs shallow work?
- What were my main distractions?
- Give me a focus score out of 10.

Drop this file in ~/.screenmind/agents/

— it runs automatically.

from screenmind_sdk import ScreenMindSDK

sdk = ScreenMindSDK("my-tracker")

activities = sdk.get_activities(app="Chrome", limit=20)

last_count = sdk.load_state("url_count", 0)
urls = sdk.get_urls_visited()
sdk.save_state("url_count", len(urls))

insight = sdk.ask_gemma(f"Summarize these URLs: {urls}")
print(insight)

Markdown agents declare what data they need:

Selector	Injects
`timeline`
Recent activities with timestamps, apps, summaries
`apps`
App usage counts + category breakdown
`urls`
URLs visited (extracted from browser address bars)
`meetings`
Meeting summaries and durations
`mood`
Mood/sentiment from screen analysis

Data injection auto-scales to your model's context window.

daily-journal.md— First-person journal entry from your day** focus-report.md**— Focus score, deep work hours, distractions** meeting-actions.md**— Extract action items from meeting transcripts** code-changelog.md**— Summarize coding activity (commits, files, repos)

ScreenMind exposes your screen history to any MCP-compatible AI tool:

python mcp_server.py  # stdio transport

Claude Desktop config (~/.claude/claude_desktop_config.json

):

{
  "mcpServers": {
    "screenmind": {
      "command": "python",
      "args": ["C:/path/to/screenmind/mcp_server.py"]
    }
  }
}

Tool	Description
`search_screen`
Semantic + keyword search across all history
`get_recent_activity`
Last N activities with full details
`get_activity_by_time`
Activities for a specific date/time range
`get_daily_summary`
AI-generated daily summary
`capture_now`
Trigger instant screenshot
`get_stats`
Usage statistics
`search_audio`
Search meeting transcripts
`get_screenshot`
Retrieve screenshot path by activity ID

Full Swagger docs at http://127.0.0.1:7777/docs

Method	Endpoint	Description
`GET`
`/api/status`
System health, worker stats
`GET`
`/api/timeline?date=2026-05-21`
Activities for a date
`GET`
`/api/search?q=debugging auth`
Hybrid semantic + keyword search
`POST`
`/api/chat`
Conversational AI with screen memory (SSE stream)
`GET`
`/api/stats?range=day`
Analytics (categories, apps, meetings)
`GET`
`/api/rewind?date=2026-05-21`
Timelapse frames
`POST`
`/api/summary/generate`
Generate AI daily summary
`GET`
`/api/agents`
List all agents
`POST`
`/api/agents/{name}/run`
Trigger agent execution
`POST`
`/api/capture/`
capture
`POST`
`/api/incognito/toggle`
Toggle incognito mode

All settings configurable via .env

, environment variables, or the Settings dashboard (persists to settings.json

).

Variable	Default	Description
`CAPTURE_INTERVAL`
`40`
Seconds between captures
`ANALYSIS_MODE`
`merged`
`merged` (accurate, ~76s) or `fast` (~12s)
`PERFORMANCE_MODE`
`balanced`
GPU layers: `minimal` / `balanced` / `maximum`
`BLOCKED_APPS`
(empty)
Comma-separated apps to never capture
`MEETING_TRANSCRIPTION`
`false`
Auto-transcribe when meeting apps detected
`RETENTION_DAYS`
`7`
Auto-delete data older than N days (0 = forever)
`ENCRYPTION_ENABLED`
`false`
Encrypt screenshots at rest
`SENSITIVE_FILTER_ENABLED`
`true`
Redact credit cards, SSNs, API keys

See

.env.example

for the full list.

Layer	Technology	Why
Vision + Audio AI
Gemma 4 E2B (via llama.cpp)	Only model with vision + audio + reasoning that runs locally on 4GB VRAM
Inference Server
llama-server (llama.cpp)	Direct GGUF inference, OpenAI-compatible API
OCR
EasyOCR	Extracts screen text fed to Gemma as context
Embeddings
all-MiniLM-L6-v2	80MB, runs on CPU, 384-dim vectors for semantic search
Backend
FastAPI + Uvicorn	Async-first, auto-generated API docs
Database
SQLite (WAL) + FTS5	Zero-config, concurrent reads, full-text search
Capture
mss + ctypes/UI Automation	Native screen capture + accessibility text extraction
Frontend
Vanilla JS + CSS	No build step, instant load, dark glassmorphism UI
Platform
Windows / macOS / Linux	Abstraction layer with OS-specific adapters

Scenario	Behavior
llama-server not running
Auto-starts on launch. Captures continue; analysis retried with backoff.
Model not downloaded
Auto-downloads GGUF on first start via HuggingFace.
GPU out of memory
Detects OOM, retries with delay, re-queues on persistent failure.
Duplicate frames
pHash dedup skips identical screenshots (threshold: 8 hamming distance).
Stale queue items
Captures >3 min old auto-skipped. Backfilled during idle.
App in blocklist
Silently skips — no screenshot saved.
Meeting app closed
Process-alive check + silence detection + 5-min hard timeout.
Chat during analysis
Cancels in-flight inference, frees GPU in <1s, re-queues analysis.
Crash recovery
Stale meetings cleaned on startup. Unanalyzed entries backfilled.

The web dashboard at http://127.0.0.1:7777

features:

Timeline— Browse activities by date with thumbnails, AI summaries, category badges** Chat**— Conversational AI with screen memory. Ask anything about your history.** Search**— Semantic + keyword hybrid search with OCR highlighting on screenshots** Analytics**— Category charts, top apps, hourly heatmap, meeting stats** Rewind**— Timelapse player with /scrub/speed controls** Memos**— Voice memo list with audio player** Agents**— Create, edit, run, and monitor agents** Settings**— 8 organized sections: Shortcuts, Capture, AI, Audio, Privacy, Automation, Integrations, Storage

Dark glassmorphism UI. No build step. Instant load.

Contributions welcome! Here are some high-impact areas:

🍎 macOS/Linux testing— platform adapters exist, need real hardware testing - 🐳 Docker container— one-command setup - 🧩 Community agent registry— share agents between users - 🌐 Browser extension— richer URL/tab context - 📤 Export formats— Markdown, CSV, JSON

If you find ScreenMind useful, please consider:

⭐ Star this repo— it helps others discover the project🍴 Fork it— build your own agents and features🐛 Report issues— help us improve📣 Share it— tell others about privacy-first AI

MIT License — see LICENSE for details.

Built with 🧠 Gemma 4 E2B · 🔒 100% Local · 🚀 Zero Cloud Dependencies

Vision + Audio + Reasoning — all three modalities, one model, your machine.

Made with ❤️ by ayushh0110

source & further reading

github.com — original article

Show HN: I run a vision model on every screenshot, locally, on a 4GB GPU

3️⃣ Open → http://127.0.0.1:7777

Run your AI side-project on zahid.host