Captures your screen β Analyzes with Gemma 4 β Builds a searchable AI memory
100% local. 100% private. Zero cloud dependencies.
Features Β·
Gemma 4 Deep DiveΒ·
Quick StartΒ·
ArchitectureΒ·
Agent PlatformΒ·
MCP
API| Agents | Chat with your memory | |---|---|
Microsoft showed the world wants screen-aware AI with Recall.But Recall stores data in plaintext, sends telemetry, and was met with massive privacy backlash. ScreenMind is the open-source, privacy-first alternative β every screenshot analyzed, every insight generated, every search result β all computed locally using Gemma 4's multimodal capabilities.It's not just a screen recorder. It's an
AI memoryyou can talk to, search through, and build automations on top of.
πΈ Smart Captureβ Content-change detection, not a fixed timer. Captures when your screenactuallychanges.π¬ Gemma 4 Vision Analysisβ Every screenshot analyzed: app detection, activity categorization, mood, scene description, spatial layout regions.π Hybrid Searchβ Semantic embeddings (MiniLM) + FTS5 keyword search. Find anything bymeaning, not just keywords.π¬ Chat with Memoryβ Conversational RAG with follow-up support. Ask "what did Ishaa say on Discord?" β get the actual message.ποΈ Voice Memosβ HoldCtrl+Shift+V
β Gemma 4's native audio encoder transcribes. Screenshot captured alongside.π€ Meeting Transcriptionβ Auto-detects Zoom/Teams/Meet, records audio, transcribes, generates structured summaries.π Analytics Dashboardβ Category breakdown, top apps, hourly heatmap, meeting stats, focus metrics.βͺ Day Rewindβ Timelapse playback of your entire day with /scrub/speed controls.
Three Analysis Modesβ Accurate (~76s, deep thinking + layout), Balanced (~40s, thinking), or Fast (~12s, no thinking). You choose.** Per-App pHash Cache**β 3-tier caching with app-aware staleness. Communication apps refresh faster than IDEs. Significantly fewer inference calls.Chat-First GPU Priorityβ Chat cancels in-flight analysis instantly. GPU freed in <1s.** Auto- Heavy Apps**β Games, video editors, 3D software detected β capture s automatically.
100% Localβ All data stays on your machine. Zero network calls after initial model download. No telemetry. Ever.** Sensitive Data Filter**β Auto-redacts credit cards, SSNs, API keys, passwords before storage.** Encryption at Rest**β AES encryption for screenshots (Fernet + OS keyring).** Dashboard PIN Lock**β Session-based auth with configurable auto-lock timeout.** Incognito Mode**β One-click . Nothing recorded.
π Integrations & Extensibility
| Integration | Description |
|---|---|
| π€ Agent Platform | |
| Build automations in Markdown (English) or Python. Drop a file, get an agent. | |
| π MCP Server | |
| Expose screen history to Claude Desktop, Cursor, VS Code | |
| π Obsidian | |
| Auto-sync daily summaries to your vault | |
| π Notion | |
| Push summaries to a Notion database | |
| πͺ Webhooks | |
| Fire events to Slack, Discord, IFTTT (HMAC signed, auto-retry) | |
| π Smart Notifications | |
| Distraction alerts, break reminders | |
| β Auto-Bookmark | |
Keyword triggers (git push , deploy ) auto-flag important moments |
| Hotkey | Action |
|---|---|
Ctrl+Shift+B |
|
| πΈ Instant bookmarked capture | |
Ctrl+Shift+P |
|
| βΈ Toggle /resume | |
Ctrl+Shift+V |
|
| π€ Hold to record voice memo |
All hotkeys customizable from Settings.
Gemma 4 E2B is not a bolt-on β it's architecturally load-bearing. ScreenMind uses all three modalities:
Every screenshot is sent to Gemma 4 with OCR context. It returns structured JSON:
- App name, activity category, summary, detailed context
- Mood classification, confidence score
- Rich scene description (every visible element inventoried)
- Layout regions (sidebar, chat area, toolbar boundaries)
Three modes:
Accurateβ single call with thinking (~76s). Best layout detection.** Balanced**β thinking enabled, analysis-only (~40s). Richer descriptions than Fast.** Fast**β no-thinking prefill trick (~12s). Layout via OCR clustering instead.
Gemma 4 E2B has a native audio encoder. ScreenMind uses it for:
- Voice memo transcription (hold hotkey β speak β release)
- Meeting transcription (15s chunks, map-reduce summarization for long meetings)
No Whisper dependency. One model handles everything.
Daily summaries with deep reasoning (think=True
)Chat answers grounded in actual screen data (text-first RAG with vision fallback)Agent executionβ Gemma processes markdown agent prompts with injected screen data
| Constraint | Why It Rules Out Alternatives |
|---|---|
| Must run continuously in background | |
| Rules out 12B+ models (too heavy) | |
| Must understand screenshots natively | |
| Rules out text-only models | |
| Must stay 100% local for privacy | |
| Rules out cloud APIs | |
| Must handle audio natively | |
| Rules out models without audio encoder | |
| Must be fast enough for 30s cycle | |
| E2B processes in 12-76s depending on mode |
Gemma 4 E2B is the only model that checks all five boxes.
Requirements:Python 3.10+ Β· GPU recommended (4GB+ VRAM) Β· ~5GB disk for model
git clone https://github.com/ayushh0110/ScreenMind.git
cd ScreenMind
python -m venv venv
venv\Scripts\activate # Windows
pip install -r requirements.txt
python main.py
3οΈβ£ Open β http://127.0.0.1:7777
On first run, ScreenMind will:
- Auto-download Gemma 4 E2B GGUF model (~5GB, one time)
- Start
llama-server
in background - Show the welcome screen to set up an optional PIN
- Create
~/.screenmind/
for data storage
βοΈ Optional: Configure via .env
cp .env.example .env
Or configure everything from the Settings tab in the dashboard.
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β ScreenMind β
β β
β ββββββββββββββ ββββββββββββββββ βββββββββββββββββββββββββββ β
β β Capture βββββΆβ Async Queue βββββΆβ Analysis Worker β β
β β Worker β β (max: 100) β β β β
β β β ββββββββββββββββ β βββββββββββββββββββββ β β
β β β’ Screen β β β Per-App pHash β β β
β β β’ Window β β β Cache (3-tier) β β β
β β β’ Dedup β β βββββββββββββββββββββ β β
β β β’ A11y β β β β β
β β β’ Privacy β β βΌ β β
β ββββββββββββββ β βββββββββββββββββββββ β β
β β β EasyOCR β β β
β ββββββββββββββ β β (text extract) β β β
β β Audio β β βββββββββββββββββββββ β β
β β Worker β β β β β
β β β β βΌ β β
β β β’ Meeting β β βββββββββββββββββββββ β β
β β detect β β β Gemma 4 E2B β β β
β β β’ Record β β β (via llama.cpp) β β β
β β β’ Transcr. β β β Vision + Audio β β β
β ββββββββββββββ β βββββββββββββββββββββ β β
β β β β β
β ββββββββββββββ β βΌ β β
β β Agent β β βββββββββββββββββββββ β β
β β Scheduler β β β Layout Analyzer β β β
β β β β β (spatial OCR) β β β
β β β’ .md AI β β βββββββββββββββββββββ β β
β β β’ .py code β β β β β
β ββββββββββββββ β βΌ β β
β β βββββββββββββββββββββ β β
β β β MiniLM-L6-v2 β β β
β β β (embeddings) β β β
β β βββββββββββββββββββββ β β
β βββββββββββββββββββββββββββ β
β β β
β βΌ β
β βββββββββββββββββββββ β
β β SQLite (WAL) β β
β β + FTS5 index β β
β βββββββββββ¬ββββββββββ β
β β β
β βββββββββββββββββββββββββββββββββββββββββββββββββ β
β β β
β βΌ β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β FastAPI REST Server β β
β β /timeline Β· /search Β· /chat Β· /stats Β· /agents Β· /mcp β β
β β β β
β β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β β
β β β Web Dashboard (Vanilla JS SPA) β β β
β β β Timeline Β· Chat Β· Search Β· Analytics Β· Agents Β· Settings β β
β β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Screenshot β EasyOCR (text) β Gemma 4 E2B (understanding) β MiniLM (embeddings) β SQLite + FTS5
β
OCR text fed as context
(Gemma sees image + reads text)
Four AI models working in concert, with Gemma 4 as the brain:
EasyOCRβ extracts raw screen text** Gemma 4 E2B**β understands what you're doing (vision + reasoning)** MiniLM-L6-v2**β generates semantic vectors for natural language search** FTS5**β indexes text for instant keyword search
ScreenMind includes a full agent/plugin system. Build any automation on top of your screen data.
| Mode | File Type | For | Example |
|---|---|---|---|
| π€ AI Agent | .md |
||
| Everyone | Write a prompt in English β Gemma runs it on your data | ||
| π Python Plugin | .py |
||
| Developers | Full code with SDK access, state persistence, LLM calls |
---
name: Daily Focus Report
schedule: every 6h
data: timeline, apps, mood
output: local, obsidian
---
Analyze my screen activity and generate a focus report:
- How many hours of deep work vs shallow work?
- What were my main distractions?
- Give me a focus score out of 10.
Drop this file in ~/.screenmind/agents/
β it runs automatically.
from screenmind_sdk import ScreenMindSDK
sdk = ScreenMindSDK("my-tracker")
activities = sdk.get_activities(app="Chrome", limit=20)
last_count = sdk.load_state("url_count", 0)
urls = sdk.get_urls_visited()
sdk.save_state("url_count", len(urls))
insight = sdk.ask_gemma(f"Summarize these URLs: {urls}")
print(insight)
Markdown agents declare what data they need:
| Selector | Injects |
|---|---|
timeline |
|
| Recent activities with timestamps, apps, summaries | |
apps |
|
| App usage counts + category breakdown | |
urls |
|
| URLs visited (extracted from browser address bars) | |
meetings |
|
| Meeting summaries and durations | |
mood |
|
| Mood/sentiment from screen analysis |
Data injection auto-scales to your model's context window.
daily-journal.mdβ First-person journal entry from your day** focus-report.md**β Focus score, deep work hours, distractions** meeting-actions.md**β Extract action items from meeting transcripts** code-changelog.md**β Summarize coding activity (commits, files, repos)
ScreenMind exposes your screen history to any MCP-compatible AI tool:
python mcp_server.py # stdio transport
Claude Desktop config (~/.claude/claude_desktop_config.json
):
{
"mcpServers": {
"screenmind": {
"command": "python",
"args": ["C:/path/to/screenmind/mcp_server.py"]
}
}
}
| Tool | Description |
|---|---|
search_screen |
|
| Semantic + keyword search across all history | |
get_recent_activity |
|
| Last N activities with full details | |
get_activity_by_time |
|
| Activities for a specific date/time range | |
get_daily_summary |
|
| AI-generated daily summary | |
capture_now |
|
| Trigger instant screenshot | |
get_stats |
|
| Usage statistics | |
search_audio |
|
| Search meeting transcripts | |
get_screenshot |
|
| Retrieve screenshot path by activity ID |
Full Swagger docs at http://127.0.0.1:7777/docs
| Method | Endpoint | Description |
|---|---|---|
GET |
||
/api/status |
||
| System health, worker stats | ||
GET |
||
/api/timeline?date=2026-05-21 |
||
| Activities for a date | ||
GET |
||
/api/search?q=debugging auth |
||
| Hybrid semantic + keyword search | ||
POST |
||
/api/chat |
||
| Conversational AI with screen memory (SSE stream) | ||
GET |
||
/api/stats?range=day |
||
| Analytics (categories, apps, meetings) | ||
GET |
||
/api/rewind?date=2026-05-21 |
||
| Timelapse frames | ||
POST |
||
/api/summary/generate |
||
| Generate AI daily summary | ||
GET |
||
/api/agents |
||
| List all agents | ||
POST |
||
/api/agents/{name}/run |
||
| Trigger agent execution | ||
POST |
||
/api/capture/ |
||
| capture | ||
POST |
||
/api/incognito/toggle |
||
| Toggle incognito mode |
All settings configurable via .env
, environment variables, or the Settings dashboard (persists to settings.json
).
| Variable | Default | Description |
|---|---|---|
CAPTURE_INTERVAL |
||
40 |
||
| Seconds between captures | ||
ANALYSIS_MODE |
||
merged |
||
merged (accurate, ~76s) or fast (~12s) |
||
PERFORMANCE_MODE |
||
balanced |
||
GPU layers: minimal / balanced / maximum |
||
BLOCKED_APPS |
||
| (empty) | ||
| Comma-separated apps to never capture | ||
MEETING_TRANSCRIPTION |
||
false |
||
| Auto-transcribe when meeting apps detected | ||
RETENTION_DAYS |
||
7 |
||
| Auto-delete data older than N days (0 = forever) | ||
ENCRYPTION_ENABLED |
||
false |
||
| Encrypt screenshots at rest | ||
SENSITIVE_FILTER_ENABLED |
||
true |
||
| Redact credit cards, SSNs, API keys |
See
.env.example
for the full list.
| Layer | Technology | Why |
|---|---|---|
| Vision + Audio AI | ||
| Gemma 4 E2B (via llama.cpp) | Only model with vision + audio + reasoning that runs locally on 4GB VRAM | |
| Inference Server | ||
| llama-server (llama.cpp) | Direct GGUF inference, OpenAI-compatible API | |
| OCR | ||
| EasyOCR | Extracts screen text fed to Gemma as context | |
| Embeddings | ||
| all-MiniLM-L6-v2 | 80MB, runs on CPU, 384-dim vectors for semantic search | |
| Backend | ||
| FastAPI + Uvicorn | Async-first, auto-generated API docs | |
| Database | ||
| SQLite (WAL) + FTS5 | Zero-config, concurrent reads, full-text search | |
| Capture | ||
| mss + ctypes/UI Automation | Native screen capture + accessibility text extraction | |
| Frontend | ||
| Vanilla JS + CSS | No build step, instant load, dark glassmorphism UI | |
| Platform | ||
| Windows / macOS / Linux | Abstraction layer with OS-specific adapters |
| Scenario | Behavior |
|---|---|
| llama-server not running | |
| Auto-starts on launch. Captures continue; analysis retried with backoff. | |
| Model not downloaded | |
| Auto-downloads GGUF on first start via HuggingFace. | |
| GPU out of memory | |
| Detects OOM, retries with delay, re-queues on persistent failure. | |
| Duplicate frames | |
| pHash dedup skips identical screenshots (threshold: 8 hamming distance). | |
| Stale queue items | |
| Captures >3 min old auto-skipped. Backfilled during idle. | |
| App in blocklist | |
| Silently skips β no screenshot saved. | |
| Meeting app closed | |
| Process-alive check + silence detection + 5-min hard timeout. | |
| Chat during analysis | |
| Cancels in-flight inference, frees GPU in <1s, re-queues analysis. | |
| Crash recovery | |
| Stale meetings cleaned on startup. Unanalyzed entries backfilled. |
The web dashboard at http://127.0.0.1:7777
features:
Timelineβ Browse activities by date with thumbnails, AI summaries, category badges** Chat**β Conversational AI with screen memory. Ask anything about your history.** Search**β Semantic + keyword hybrid search with OCR highlighting on screenshots** Analytics**β Category charts, top apps, hourly heatmap, meeting stats** Rewind**β Timelapse player with /scrub/speed controls** Memos**β Voice memo list with audio player** Agents**β Create, edit, run, and monitor agents** Settings**β 8 organized sections: Shortcuts, Capture, AI, Audio, Privacy, Automation, Integrations, Storage
Dark glassmorphism UI. No build step. Instant load.
Contributions welcome! Here are some high-impact areas:
- π macOS/Linux testingβ platform adapters exist, need real hardware testing - π³ Docker containerβ one-command setup - π§© Community agent registryβ share agents between users - π Browser extensionβ richer URL/tab context - π€ Export formatsβ Markdown, CSV, JSON
If you find ScreenMind useful, please consider:
β Star this repoβ it helps others discover the projectπ΄ Fork itβ build your own agents and featuresπ Report issuesβ help us improveπ£ Share itβ tell others about privacy-first AI
MIT License β see LICENSE for details.
Built with π§ Gemma 4 E2B Β· π 100% Local Β· π Zero Cloud Dependencies
Vision + Audio + Reasoning β all three modalities, one model, your machine.
Made with β€οΈ by ayushh0110