{"slug": "offline-mac-voice-assistant-control-your-computer-with-zero-data-leaving", "title": "Offline Mac Voice Assistant – Control Your Computer with Zero Data Leaving", "summary": "LocalClicky, a new open-source voice assistant for Mac, runs entirely offline using local AI models like Whisper.cpp and Ollama, ensuring no voice data, screenshots, or commands leave the user's machine. The tool offers full desktop control, including app management, file operations, and video editing, and is available via GitHub.", "body_md": "**Control your Mac with your voice. Completely offline.**\n\nYour voice, your screen, your commands — nothing leaves your machine. No cloud APIs. No API keys. No subscriptions.\n\nEvery cloud voice assistant makes the same tradeoff: you get convenience, they get your data. Your audio gets uploaded. Your screen gets sent to a server. Your commands get logged.\n\nLocalClicky breaks that tradeoff. Everything runs on your hardware:\n\n— transcription, runs locally[Whisper.cpp](https://github.com/ggerganov/whisper.cpp)(qwen3, gemma4) — AI reasoning and vision, runs locally[Ollama](https://ollama.com)**macOS say**— text-to-speech, built into your Mac** PyAutoGUI**— cursor and click control\n\nNo data leaves your machine. Not your voice. Not your screenshots. Not your commands.\n\n- Sits in the menubar — no Dock icon, stays out of the way\n- Say\n**\"Hey Jarvis\"**→ starts a session — stays active until you say goodbye **Voice Activity Detection**— auto-stops recording when you stop talking (no fixed timeout)** Sees your screen on demand**— vision model (gemma4:e4b) takes a screenshot when needed** Moves your cursor and clicks**based on what it sees on screen** Controls your Mac**: open/quit apps, adjust volume, control Spotify, manage files, run shell commands, inject JS into Chrome** Edits videos**: trim, mute, merge, speed up, resize, add text — all via ffmpeg, no upload** Creates reminders**with natural language dates- Multi-round tool calling — runs commands, checks results, confirms or retries\n- Conversation memory across the session (last 10 exchanges)\n**Session mode**— chain commands back-to-back without repeating the wake word\n\n| Icon | State |\n|---|---|\n| 🎙️ | Idle / ready |\n| 👂 | Listening for \"Computer\" |\n| 🔴 | Recording your voice |\n| 🔄 | Transcribing |\n| 🤔 | Thinking (Ollama) |\n| 🔊 | Speaking response |\n| Error |\n\n```\nbrew install whisper-cpp\n\n# Download the base English model\nmkdir -p /opt/homebrew/share/whisper-cpp/models\ncurl -L -o /opt/homebrew/share/whisper-cpp/models/ggml-base.en.bin \\\n  \"https://huggingface.co/ggerganov/whisper.cpp/resolve/main/ggml-base.en.bin\"\nbrew install ollama\n\n# Start Ollama\nollama serve\n\n# Pull the models\nollama pull qwen3:8b     # command model — tool calling, Mac control\nollama pull gemma4:e4b   # vision model — sees your screen when needed\nbrew install ffmpeg\n```\n\nRequired for any video editing commands. Skip if you don't need video editing.\n\n```\ncd PyClicky\npython3 -m venv venv\nsource venv/bin/activate\npip install -r requirements.txt\npython -c \"import openwakeword; openwakeword.utils.download_models()\"\npip install webrtcvad-wheels\n```\n\nWithout this, recording falls back to a 30-second hard cap instead of stopping when you stop talking.\n\n```\ncd PyClicky\nsource venv/bin/activate\nollama serve &   # if not already running\npython main.py\n```\n\nThe app appears in your menubar. No Dock icon.\n\nLocalClicky needs three macOS permissions for the `python3`\n\nbinary inside your venv:\n`/path/to/PyClicky/venv/bin/python3`\n\n| Permission | Why | Where to grant |\n|---|---|---|\nMicrophone |\nVoice recording | Prompted automatically on first run |\nScreen Recording |\nScreenshot for vision | System Settings → Privacy & Security → Screen Recording |\nAccessibility |\nCursor movement & clicks | System Settings → Privacy & Security → Accessibility |\n\nTip:If`python3`\n\nis not selectable in the file picker, addTerminalinstead — Python inherits Terminal's permissions when launched from it.\n\nSay **\"Hey Jarvis\"** — the icon turns 🔴 and recording starts. When you stop talking, it automatically processes your command and responds.\n\nAfter responding, it stays active and listens for your next command immediately — **no need to say \"Computer\" again**.\n\nSay **\"bye\"**, **\"goodbye\"**, **\"stop listening\"**, **\"go to sleep\"**, or **\"that's all\"** — the assistant says goodbye and returns to wake word mode.\n\nThe session also auto-expires after **25 seconds of silence**.\n\n| You say | What happens |\n|---|---|\n| \"Open Spotify and play hip hop\" | Opens Spotify, searches and plays |\n| \"Set Spotify volume to 30 percent\" | AppleScript sets Spotify's internal volume |\n| \"Set volume to 50 percent\" | Sets macOS system volume |\n| \"Click the notification bell\" | Takes screenshot, finds the bell, clicks it |\n| \"What's on my screen?\" | Takes screenshot, describes what it sees |\n| \"Create a reminder to call John tomorrow at 9am\" | Creates reminder in macOS Reminders |\n| \"Open a new tab in Chrome\" | AppleScript opens a new Chrome tab |\n| \"Play next track\" | AppleScript skips to next Spotify track |\n| \"Make a folder called Projects on my Desktop\" | `mkdir ~/Desktop/Projects` |\n| \"What is the capital of France?\" | Answers directly, no tools needed |\n| \"Trim the video on my desktop from 10 seconds to 30 seconds\" | ffmpeg cuts the clip, saves to Desktop |\n| \"Mute the audio in intro dot mp4\" | ffmpeg strips audio track |\n| \"Speed up the video to 2x\" | ffmpeg applies setpts + atempo filters |\n| \"Merge video dot mp4 and clip dot mp4\" | ffmpeg concat filter, saves to Desktop |\n\nWhen you ask to click or find something, the assistant calls `look_at_screen`\n\n— it takes a clean screenshot, sends it to the vision model (gemma4:e4b), and gets back a bounding box for the target element. The center of that box is computed and clicked automatically.\n\nThe model decides on its own when it needs to see the screen — you don't have to phrase commands any special way.\n\n```\nWake word (\"Computer\")\n        ↓\nAudioRecorder.start()           ← opens sounddevice InputStream\n        ↓  (VAD auto-stop on silence, 30s hard cap)\nAudioRecorder.stop()            → WAV file\n        ↓\nWhisperTranscriber.transcribe() → runs whisper-cli → transcript text\n        ↓\nDismissal check (\"bye\" etc.)   → end session / OllamaClient.chat()\n        ↓\nOllamaClient.chat() — always qwen3:8b with think mode + tools:\n  ├─ run_shell_command  → zsh → output\n  ├─ query_system       → read-only zsh → output\n  ├─ look_at_screen     → screencapture → gemma4:e4b → [CLICK:x1,y1,x2,y2]\n  └─ create_reminder    → Python builds correct AppleScript → osascript\n  (up to 5 tool rounds, streaming)\n        ↓\nCursorControl.extract_action()  → parse [CLICK/POINT/RCLICK:x1,y1,x2,y2]\nCursorControl.execute()         → compute center → pyautogui moves/clicks\n        ↓\nSpeechOutput.speak()            → macOS `say` speaks the response\n        ↓\nSession active: wait 0.4s → start recording again\nSession idle 25s: return to WakeWordDetector\nPyClicky/\n├── main.py                # rumps menubar app — icons, menu, state display\n├── companion.py           # state machine — session management, full pipeline\n├── ollama_client.py       # qwen3 with tools, gemma4 vision via look_at_screen\n├── wake_word.py           # offline wake word via openWakeWord (hey_jarvis pretrained model)\n├── audio_recorder.py      # sounddevice mic capture + VAD silence detection → WAV\n├── whisper_transcriber.py # calls whisper-cli subprocess, returns transcript\n├── screen_capture.py      # screencapture → resize to 1280px → base64 JPEG\n├── cursor_control.py      # parses [CLICK/POINT/RCLICK:x1,y1,x2,y2], clicks center\n├── speech_output.py       # macOS `say` command wrapper\n├── shell_executor.py      # zsh subprocess runner, cwd=~\n└── requirements.txt\n```\n\nEdit `ollama_client.py`\n\n:\n\n```\nVISION_MODEL = \"gemma4:e4b\"   # called by look_at_screen tool for visual tasks\nCOMMAND_MODEL = \"qwen3:8b\"    # main model — tool calling, reasoning, Mac control\n```\n\nThe command model must support reliable tool calling. The vision model must be multimodal.\n\n| Vision | Command | Notes |\n|---|---|---|\n`gemma4:e4b` |\n`qwen3:8b` |\nDefault — good balance of speed and capability |\n`gemma4:e4b` |\n`qwen3:14b` |\nBetter reasoning, needs ~16GB RAM |\n`gemma4:27b` |\n`qwen3:8b` |\nBetter vision accuracy, needs ~32GB RAM |\n`qwen2.5vl:7b` |\n`qwen3:8b` |\nAlternative vision model |\n\nEdit `wake_word.py`\n\n:\n\n```\n# Use a different pretrained model (e.g. \"alexa\", \"hey_mycroft\"):\nWAKE_MODEL = \"hey_jarvis\"\n\n# Point to a custom trained .onnx or .tflite file instead:\nWAKE_MODEL_PATH = \"/path/to/your/computer.onnx\"  # overrides WAKE_MODEL when set\n\n# Lower = more sensitive (more false positives), higher = stricter:\nDETECTION_THRESHOLD = 0.5\n```\n\nTo train a custom \"computer\" model, follow the\n[openWakeWord training guide](https://github.com/dscripka/openWakeWord/blob/main/docs/training.md),\nthen set `WAKE_MODEL_PATH`\n\nto the output `.onnx`\n\nfile.\n\nEdit `companion.py`\n\n:\n\n```\nSESSION_IDLE_TIMEOUT = 25.0   # seconds of silence before returning to wake word mode\n```\n\nEdit `screen_capture.py`\n\n:\n\n```\nMAX_WIDTH = 1280    # resize screenshot to this width before sending to vision model\nJPEG_QUALITY = 75   # compression quality\n```\n\nLower `MAX_WIDTH`\n\n= faster responses, slightly less visual detail. Higher = more detail, larger payload.\n\nEdit `ollama_client.py`\n\n:\n\n```\nOLLAMA_URL = \"http://localhost:11434/api/chat\"\n```\n\n**\"No speech detected\" every time**\n\n- Check microphone permission\n- Speak louder or closer to the mic\n- Whisper model path may be wrong — check logs for the\n`model:`\n\nline\n\n**Recording never stops / runs too long**\n\n- Install webrtcvad:\n`pip install webrtcvad-wheels`\n\nfor VAD silence detection - Without it, recording stops after 30 seconds\n\n**Screenshot always fails**\n\n- Grant Screen Recording to Terminal in System Settings → Privacy & Security\n- Test:\n`screencapture -x -t jpg /tmp/test.jpg && echo OK`\n\n**Cursor doesn't move**\n\n- Grant Accessibility to Terminal in System Settings → Privacy & Security → Accessibility\n\n**Wake word never triggers**\n\n- Wake word detection runs fully offline via openWakeWord — no internet needed\n- Default keyword is\n**\"hey Jarvis\"**(not \"Computer\") — say that phrase to trigger - To use a different keyword, change\n`WAKE_MODEL`\n\nin`wake_word.py`\n\n(see Configuration) - Check logs for\n`WAKE triggered:`\n\nlines; lower`DETECTION_THRESHOLD`\n\nif it's not firing - Speak clearly and at normal pace — very fast or whispered speech may score below threshold\n\n**Mic error when OBS or Zoom is running**\n\n- The app will retry 5 times automatically\n- If it still fails, close the other app briefly then restart the session\n\n**Model says \"I can't see your screen\"**\n\n- Ensure Screen Recording permission is granted\n- Try rephrasing: \"look at my screen and click...\"\n\n**Ollama 400 error**\n\n- Check\n`ollama list`\n\n— ensure both models are pulled - Restart Ollama:\n`ollama serve`\n\n**\"Too many steps\" response**\n\n- The model hit the 5-round tool call limit\n- Check shell_executor logs for the underlying command error\n\n- macOS 12+\n- Python 3.11+\n- Homebrew\n- ~8GB RAM free (for both models)\n- Ollama running locally\n\n| Package | Purpose |\n|---|---|\n`rumps` |\nmacOS menubar app framework |\n`sounddevice` |\nMic input stream |\n`soundfile` |\nWrite WAV files |\n`numpy` |\nAudio buffer manipulation |\n`httpx` |\nStreaming HTTP to Ollama |\n`openwakeword` |\nOffline wake word detection |\n`pyautogui` |\nCursor movement and clicks |\n`Pillow` |\nScreenshot resize |\n`webrtcvad-wheels` |\nVoice activity detection (optional) |\n\nLocalClicky is early. Meaningful areas to improve:\n\n**Custom \"computer\" wake word**— train a personal openWakeWord model using the[training guide](https://github.com/dscripka/openWakeWord/blob/main/docs/training.md)and swap`WAKE_MODEL_PATH`\n\nin`wake_word.py`\n\n**App-specific skills**— context-aware commands for Terminal, Xcode, Figma, VS Code** Packaging**— proper`.app`\n\nbundle so users don't need to run from terminal**Windows / Linux ports**— the core pipeline is cross-platform; the menubar layer isn't** Better click accuracy**— the vision model (gemma4) has limited spatial precision; a GUI-specific model would help significantly\n\nIf you want to work on any of these, open an issue first. PRs welcome.\n\nMIT", "url": "https://wpnews.pro/news/offline-mac-voice-assistant-control-your-computer-with-zero-data-leaving", "canonical_source": "https://github.com/dikshantrajput/LocalClicky", "published_at": "2026-06-25 11:17:46+00:00", "updated_at": "2026-06-25 11:44:09.038700+00:00", "lang": "en", "topics": ["ai-tools", "ai-products", "ai-agents", "natural-language-processing", "computer-vision"], "entities": ["LocalClicky", "Whisper.cpp", "Ollama", "PyAutoGUI", "ffmpeg", "GitHub", "macOS", "Gemma"], "alternates": {"html": "https://wpnews.pro/news/offline-mac-voice-assistant-control-your-computer-with-zero-data-leaving", "markdown": "https://wpnews.pro/news/offline-mac-voice-assistant-control-your-computer-with-zero-data-leaving.md", "text": "https://wpnews.pro/news/offline-mac-voice-assistant-control-your-computer-with-zero-data-leaving.txt", "jsonld": "https://wpnews.pro/news/offline-mac-voice-assistant-control-your-computer-with-zero-data-leaving.jsonld"}}