# Show HN: I run a vision model on every screenshot, locally, on a 4GB GPU

> Source: <https://github.com/ayushh0110/ScreenMind>
> Published: 2026-06-13 23:12:54+00:00

**Captures your screen → Analyzes with Gemma 4 → Builds a searchable AI memory**

**100% local. 100% private. Zero cloud dependencies.**

[ Features](#-features) ·

[·](#-how-gemma-4-is-used)

**Gemma 4 Deep Dive**[·](#-quick-start)

**Quick Start**[·](#-architecture)

**Architecture**[·](#-agent-platform)

**Agent Platform**[·](#-mcp-server-claude--cursor--vs-code)

**MCP**

**API**| Agents | Chat with your memory |
|---|---|

Microsoft showed the world wants screen-aware AI with Recall.But Recall stores data in plaintext, sends telemetry, and was met with massive privacy backlash. ScreenMind is the open-source, privacy-first alternative — every screenshot analyzed, every insight generated, every search result — all computed locally using Gemma 4's multimodal capabilities.It's not just a screen recorder. It's an

AI memoryyou can talk to, search through, and build automations on top of.

**📸 Smart Capture**— Content-change detection, not a fixed timer. Captures when your screen*actually*changes.**🔬 Gemma 4 Vision Analysis**— Every screenshot analyzed: app detection, activity categorization, mood, scene description, spatial layout regions.**🔍 Hybrid Search**— Semantic embeddings (MiniLM) + FTS5 keyword search. Find anything by*meaning*, not just keywords.**💬 Chat with Memory**— Conversational RAG with follow-up support. Ask "what did Ishaa say on Discord?" → get the actual message.**🎙️ Voice Memos**— Hold`Ctrl+Shift+V`

→ Gemma 4's native audio encoder transcribes. Screenshot captured alongside.**🎤 Meeting Transcription**— Auto-detects Zoom/Teams/Meet, records audio, transcribes, generates structured summaries.**📊 Analytics Dashboard**— Category breakdown, top apps, hourly heatmap, meeting stats, focus metrics.**⏪ Day Rewind**— Timelapse playback of your entire day with play/pause/scrub/speed controls.

**Three Analysis Modes**— Accurate (~76s, deep thinking + layout), Balanced (~40s, thinking), or Fast (~12s, no thinking). You choose.** Per-App pHash Cache**— 3-tier caching with app-aware staleness. Communication apps refresh faster than IDEs. Significantly fewer inference calls.**Chat-First GPU Priority**— Chat cancels in-flight analysis instantly. GPU freed in <1s.** Auto-Pause Heavy Apps**— Games, video editors, 3D software detected → capture pauses automatically.

**100% Local**— All data stays on your machine. Zero network calls after initial model download. No telemetry. Ever.** Sensitive Data Filter**— Auto-redacts credit cards, SSNs, API keys, passwords before storage.** Encryption at Rest**— AES encryption for screenshots (Fernet + OS keyring).** Dashboard PIN Lock**— Session-based auth with configurable auto-lock timeout.** Incognito Mode**— One-click pause. Nothing recorded.

**🔌 Integrations & Extensibility**

| Integration | Description |
|---|---|
🤖 Agent Platform |
Build automations in Markdown (English) or Python. Drop a file, get an agent. |
🔌 MCP Server |
Expose screen history to Claude Desktop, Cursor, VS Code |
📓 Obsidian |
Auto-sync daily summaries to your vault |
📋 Notion |
Push summaries to a Notion database |
🪝 Webhooks |
Fire events to Slack, Discord, IFTTT (HMAC signed, auto-retry) |
🔔 Smart Notifications |
Distraction alerts, break reminders |
⭐ Auto-Bookmark |
Keyword triggers (`git push` , `deploy` ) auto-flag important moments |

| Hotkey | Action |
|---|---|
`Ctrl+Shift+B` |
📸 Instant bookmarked capture |
`Ctrl+Shift+P` |
⏸ Toggle pause/resume |
`Ctrl+Shift+V` |
🎤 Hold to record voice memo |

All hotkeys customizable from Settings.

Gemma 4 E2B is not a bolt-on — it's architecturally load-bearing. ScreenMind uses **all three modalities**:

Every screenshot is sent to Gemma 4 with OCR context. It returns structured JSON:

- App name, activity category, summary, detailed context
- Mood classification, confidence score
- Rich scene description (every visible element inventoried)
- Layout regions (sidebar, chat area, toolbar boundaries)

**Three modes:**

**Accurate**— single call with thinking (~76s). Best layout detection.** Balanced**— thinking enabled, analysis-only (~40s). Richer descriptions than Fast.** Fast**— no-thinking prefill trick (~12s). Layout via OCR clustering instead.

Gemma 4 E2B has a native audio encoder. ScreenMind uses it for:

- Voice memo transcription (hold hotkey → speak → release)
- Meeting transcription (15s chunks, map-reduce summarization for long meetings)

No Whisper dependency. One model handles everything.

**Daily summaries** with deep reasoning (`think=True`

)**Chat answers** grounded in actual screen data (text-first RAG with vision fallback)**Agent execution**— Gemma processes markdown agent prompts with injected screen data

| Constraint | Why It Rules Out Alternatives |
|---|---|
Must run continuously in background |
Rules out 12B+ models (too heavy) |
Must understand screenshots natively |
Rules out text-only models |
Must stay 100% local for privacy |
Rules out cloud APIs |
Must handle audio natively |
Rules out models without audio encoder |
Must be fast enough for 30s cycle |
E2B processes in 12-76s depending on mode |

Gemma 4 E2B is the only model that checks all five boxes.

Requirements:Python 3.10+ · GPU recommended (4GB+ VRAM) · ~5GB disk for model

```
git clone https://github.com/ayushh0110/ScreenMind.git
cd ScreenMind

python -m venv venv
venv\Scripts\activate        # Windows
# source venv/bin/activate   # macOS/Linux

pip install -r requirements.txt
python main.py
```

#### 3️⃣ Open → [http://127.0.0.1:7777](http://127.0.0.1:7777)

[http://127.0.0.1:7777](http://127.0.0.1:7777)

On first run, ScreenMind will:

- Auto-download Gemma 4 E2B GGUF model (~5GB, one time)
- Start
`llama-server`

in background - Show the welcome screen to set up an optional PIN
- Create
`~/.screenmind/`

for data storage

**⚙️ Optional: Configure via .env**

```
cp .env.example .env
# Edit capture interval, blocked apps, hotkeys, etc.
```

Or configure everything from the **Settings** tab in the dashboard.

```
┌─────────────────────────────────────────────────────────────────────┐
│                          ScreenMind                                  │
│                                                                     │
│  ┌────────────┐    ┌──────────────┐    ┌─────────────────────────┐ │
│  │  Capture   │───▶│  Async Queue │───▶│    Analysis Worker      │ │
│  │  Worker    │    │  (max: 100)  │    │                         │ │
│  │            │    └──────────────┘    │  ┌───────────────────┐  │ │
│  │ • Screen   │                        │  │  Per-App pHash    │  │ │
│  │ • Window   │                        │  │  Cache (3-tier)   │  │ │
│  │ • Dedup    │                        │  └───────────────────┘  │ │
│  │ • A11y     │                        │           │             │ │
│  │ • Privacy  │                        │           ▼             │ │
│  └────────────┘                        │  ┌───────────────────┐  │ │
│                                        │  │   EasyOCR         │  │ │
│  ┌────────────┐                        │  │   (text extract)  │  │ │
│  │   Audio    │                        │  └───────────────────┘  │ │
│  │   Worker   │                        │           │             │ │
│  │            │                        │           ▼             │ │
│  │ • Meeting  │                        │  ┌───────────────────┐  │ │
│  │   detect   │                        │  │   Gemma 4 E2B     │  │ │
│  │ • Record   │                        │  │   (via llama.cpp) │  │ │
│  │ • Transcr. │                        │  │   Vision + Audio  │  │ │
│  └────────────┘                        │  └───────────────────┘  │ │
│                                        │           │             │ │
│  ┌────────────┐                        │           ▼             │ │
│  │   Agent    │                        │  ┌───────────────────┐  │ │
│  │  Scheduler │                        │  │  Layout Analyzer  │  │ │
│  │            │                        │  │  (spatial OCR)    │  │ │
│  │ • .md AI   │                        │  └───────────────────┘  │ │
│  │ • .py code │                        │           │             │ │
│  └────────────┘                        │           ▼             │ │
│                                        │  ┌───────────────────┐  │ │
│                                        │  │  MiniLM-L6-v2     │  │ │
│                                        │  │  (embeddings)     │  │ │
│                                        │  └───────────────────┘  │ │
│                                        └─────────────────────────┘ │
│                                                    │               │
│                                                    ▼               │
│                                        ┌───────────────────┐       │
│                                        │   SQLite (WAL)    │       │
│                                        │   + FTS5 index    │       │
│                                        └─────────┬─────────┘       │
│                                                  │                 │
│  ┌───────────────────────────────────────────────┘                 │
│  │                                                                 │
│  ▼                                                                 │
│  ┌───────────────────────────────────────────────────────────────┐ │
│  │                    FastAPI REST Server                         │ │
│  │  /timeline · /search · /chat · /stats · /agents · /mcp       │ │
│  │                                                               │ │
│  │  ┌───────────────────────────────────────────────────────┐   │ │
│  │  │           Web Dashboard (Vanilla JS SPA)               │   │ │
│  │  │  Timeline · Chat · Search · Analytics · Agents · Settings │ │
│  │  └───────────────────────────────────────────────────────┘   │ │
│  └───────────────────────────────────────────────────────────────┘ │
└─────────────────────────────────────────────────────────────────────┘
Screenshot → EasyOCR (text) → Gemma 4 E2B (understanding) → MiniLM (embeddings) → SQLite + FTS5
                                     ↑
                              OCR text fed as context
                              (Gemma sees image + reads text)
```

Four AI models working in concert, with Gemma 4 as the brain:

**EasyOCR**— extracts raw screen text** Gemma 4 E2B**— understands what you're doing (vision + reasoning)** MiniLM-L6-v2**— generates semantic vectors for natural language search** FTS5**— indexes text for instant keyword search

ScreenMind includes a full agent/plugin system. Build any automation on top of your screen data.

| Mode | File Type | For | Example |
|---|---|---|---|
| 🤖 AI Agent | `.md` |
Everyone | Write a prompt in English → Gemma runs it on your data |
| 🐍 Python Plugin | `.py` |
Developers | Full code with SDK access, state persistence, LLM calls |

```
---
name: Daily Focus Report
schedule: every 6h
data: timeline, apps, mood
output: local, obsidian
---

Analyze my screen activity and generate a focus report:
- How many hours of deep work vs shallow work?
- What were my main distractions?
- Give me a focus score out of 10.
```

Drop this file in `~/.screenmind/agents/`

— it runs automatically.

``` python
from screenmind_sdk import ScreenMindSDK

sdk = ScreenMindSDK("my-tracker")

# Get today's activities filtered by app
activities = sdk.get_activities(app="Chrome", limit=20)

# Persistent state across runs
last_count = sdk.load_state("url_count", 0)
urls = sdk.get_urls_visited()
sdk.save_state("url_count", len(urls))

# Ask Gemma (GPU-safe — waits for idle)
insight = sdk.ask_gemma(f"Summarize these URLs: {urls}")
print(insight)
```

Markdown agents declare what data they need:

| Selector | Injects |
|---|---|
`timeline` |
Recent activities with timestamps, apps, summaries |
`apps` |
App usage counts + category breakdown |
`urls` |
URLs visited (extracted from browser address bars) |
`meetings` |
Meeting summaries and durations |
`mood` |
Mood/sentiment from screen analysis |

Data injection auto-scales to your model's context window.

**daily-journal.md**— First-person journal entry from your day** focus-report.md**— Focus score, deep work hours, distractions** meeting-actions.md**— Extract action items from meeting transcripts** code-changelog.md**— Summarize coding activity (commits, files, repos)

ScreenMind exposes your screen history to any MCP-compatible AI tool:

```
python mcp_server.py  # stdio transport
```

**Claude Desktop config** (`~/.claude/claude_desktop_config.json`

):

```
{
  "mcpServers": {
    "screenmind": {
      "command": "python",
      "args": ["C:/path/to/screenmind/mcp_server.py"]
    }
  }
}
```

| Tool | Description |
|---|---|
`search_screen` |
Semantic + keyword search across all history |
`get_recent_activity` |
Last N activities with full details |
`get_activity_by_time` |
Activities for a specific date/time range |
`get_daily_summary` |
AI-generated daily summary |
`capture_now` |
Trigger instant screenshot |
`get_stats` |
Usage statistics |
`search_audio` |
Search meeting transcripts |
`get_screenshot` |
Retrieve screenshot path by activity ID |

Full Swagger docs at `http://127.0.0.1:7777/docs`

| Method | Endpoint | Description |
|---|---|---|
`GET` |
`/api/status` |
System health, worker stats |
`GET` |
`/api/timeline?date=2026-05-21` |
Activities for a date |
`GET` |
`/api/search?q=debugging auth` |
Hybrid semantic + keyword search |
`POST` |
`/api/chat` |
Conversational AI with screen memory (SSE stream) |
`GET` |
`/api/stats?range=day` |
Analytics (categories, apps, meetings) |
`GET` |
`/api/rewind?date=2026-05-21` |
Timelapse frames |
`POST` |
`/api/summary/generate` |
Generate AI daily summary |
`GET` |
`/api/agents` |
List all agents |
`POST` |
`/api/agents/{name}/run` |
Trigger agent execution |
`POST` |
`/api/capture/pause` |
Pause capture |
`POST` |
`/api/incognito/toggle` |
Toggle incognito mode |

All settings configurable via `.env`

, environment variables, or the **Settings** dashboard (persists to `settings.json`

).

| Variable | Default | Description |
|---|---|---|
`CAPTURE_INTERVAL` |
`40` |
Seconds between captures |
`ANALYSIS_MODE` |
`merged` |
`merged` (accurate, ~76s) or `fast` (~12s) |
`PERFORMANCE_MODE` |
`balanced` |
GPU layers: `minimal` / `balanced` / `maximum` |
`BLOCKED_APPS` |
(empty) |
Comma-separated apps to never capture |
`MEETING_TRANSCRIPTION` |
`false` |
Auto-transcribe when meeting apps detected |
`RETENTION_DAYS` |
`7` |
Auto-delete data older than N days (0 = forever) |
`ENCRYPTION_ENABLED` |
`false` |
Encrypt screenshots at rest |
`SENSITIVE_FILTER_ENABLED` |
`true` |
Redact credit cards, SSNs, API keys |

See

`.env.example`

for the full list.

| Layer | Technology | Why |
|---|---|---|
Vision + Audio AI |
Gemma 4 E2B (via llama.cpp) | Only model with vision + audio + reasoning that runs locally on 4GB VRAM |
Inference Server |
llama-server (llama.cpp) | Direct GGUF inference, OpenAI-compatible API |
OCR |
EasyOCR | Extracts screen text fed to Gemma as context |
Embeddings |
all-MiniLM-L6-v2 | 80MB, runs on CPU, 384-dim vectors for semantic search |
Backend |
FastAPI + Uvicorn | Async-first, auto-generated API docs |
Database |
SQLite (WAL) + FTS5 | Zero-config, concurrent reads, full-text search |
Capture |
mss + ctypes/UI Automation | Native screen capture + accessibility text extraction |
Frontend |
Vanilla JS + CSS | No build step, instant load, dark glassmorphism UI |
Platform |
Windows / macOS / Linux | Abstraction layer with OS-specific adapters |

| Scenario | Behavior |
|---|---|
llama-server not running |
Auto-starts on launch. Captures continue; analysis retried with backoff. |
Model not downloaded |
Auto-downloads GGUF on first start via HuggingFace. |
GPU out of memory |
Detects OOM, retries with delay, re-queues on persistent failure. |
Duplicate frames |
pHash dedup skips identical screenshots (threshold: 8 hamming distance). |
Stale queue items |
Captures >3 min old auto-skipped. Backfilled during idle. |
App in blocklist |
Silently skips — no screenshot saved. |
Meeting app closed |
Process-alive check + silence detection + 5-min hard timeout. |
Chat during analysis |
Cancels in-flight inference, frees GPU in <1s, re-queues analysis. |
Crash recovery |
Stale meetings cleaned on startup. Unanalyzed entries backfilled. |

The web dashboard at `http://127.0.0.1:7777`

features:

**Timeline**— Browse activities by date with thumbnails, AI summaries, category badges** Chat**— Conversational AI with screen memory. Ask anything about your history.** Search**— Semantic + keyword hybrid search with OCR highlighting on screenshots** Analytics**— Category charts, top apps, hourly heatmap, meeting stats** Rewind**— Timelapse player with play/pause/scrub/speed controls** Memos**— Voice memo list with audio player** Agents**— Create, edit, run, and monitor agents** Settings**— 8 organized sections: Shortcuts, Capture, AI, Audio, Privacy, Automation, Integrations, Storage

Dark glassmorphism UI. No build step. Instant load.

Contributions welcome! Here are some high-impact areas:

- 🍎
**macOS/Linux testing**— platform adapters exist, need real hardware testing - 🐳
**Docker container**— one-command setup - 🧩
**Community agent registry**— share agents between users - 🌐
**Browser extension**— richer URL/tab context - 📤
**Export formats**— Markdown, CSV, JSON

If you find ScreenMind useful, please consider:

**⭐ Star this repo**— it helps others discover the project**🍴 Fork it**— build your own agents and features**🐛 Report issues**— help us improve**📣 Share it**— tell others about privacy-first AI

MIT License — see [LICENSE](/ayushh0110/ScreenMind/blob/main/LICENSE) for details.

**Built with 🧠 Gemma 4 E2B · 🔒 100% Local · 🚀 Zero Cloud Dependencies**

*Vision + Audio + Reasoning — all three modalities, one model, your machine.*

Made with ❤️ by ayushh0110
