Offline Mac Voice Assistant – Control Your Computer with Zero Data Leaving LocalClicky, a new open-source voice assistant for Mac, runs entirely offline using local AI models like Whisper.cpp and Ollama, ensuring no voice data, screenshots, or commands leave the user's machine. The tool offers full desktop control, including app management, file operations, and video editing, and is available via GitHub. Control your Mac with your voice. Completely offline. Your voice, your screen, your commands — nothing leaves your machine. No cloud APIs. No API keys. No subscriptions. Every cloud voice assistant makes the same tradeoff: you get convenience, they get your data. Your audio gets uploaded. Your screen gets sent to a server. Your commands get logged. LocalClicky breaks that tradeoff. Everything runs on your hardware: — transcription, runs locally Whisper.cpp https://github.com/ggerganov/whisper.cpp qwen3, gemma4 — AI reasoning and vision, runs locally Ollama https://ollama.com macOS say — text-to-speech, built into your Mac PyAutoGUI — cursor and click control No data leaves your machine. Not your voice. Not your screenshots. Not your commands. - Sits in the menubar — no Dock icon, stays out of the way - Say "Hey Jarvis" → starts a session — stays active until you say goodbye Voice Activity Detection — auto-stops recording when you stop talking no fixed timeout Sees your screen on demand — vision model gemma4:e4b takes a screenshot when needed Moves your cursor and clicks based on what it sees on screen Controls your Mac : open/quit apps, adjust volume, control Spotify, manage files, run shell commands, inject JS into Chrome Edits videos : trim, mute, merge, speed up, resize, add text — all via ffmpeg, no upload Creates reminders with natural language dates- Multi-round tool calling — runs commands, checks results, confirms or retries - Conversation memory across the session last 10 exchanges Session mode — chain commands back-to-back without repeating the wake word | Icon | State | |---|---| | 🎙️ | Idle / ready | | 👂 | Listening for "Computer" | | 🔴 | Recording your voice | | 🔄 | Transcribing | | 🤔 | Thinking Ollama | | 🔊 | Speaking response | | Error | brew install whisper-cpp Download the base English model mkdir -p /opt/homebrew/share/whisper-cpp/models curl -L -o /opt/homebrew/share/whisper-cpp/models/ggml-base.en.bin \ "https://huggingface.co/ggerganov/whisper.cpp/resolve/main/ggml-base.en.bin" brew install ollama Start Ollama ollama serve Pull the models ollama pull qwen3:8b command model — tool calling, Mac control ollama pull gemma4:e4b vision model — sees your screen when needed brew install ffmpeg Required for any video editing commands. Skip if you don't need video editing. cd PyClicky python3 -m venv venv source venv/bin/activate pip install -r requirements.txt python -c "import openwakeword; openwakeword.utils.download models " pip install webrtcvad-wheels Without this, recording falls back to a 30-second hard cap instead of stopping when you stop talking. cd PyClicky source venv/bin/activate ollama serve & if not already running python main.py The app appears in your menubar. No Dock icon. LocalClicky needs three macOS permissions for the python3 binary inside your venv: /path/to/PyClicky/venv/bin/python3 | Permission | Why | Where to grant | |---|---|---| Microphone | Voice recording | Prompted automatically on first run | Screen Recording | Screenshot for vision | System Settings → Privacy & Security → Screen Recording | Accessibility | Cursor movement & clicks | System Settings → Privacy & Security → Accessibility | Tip:If python3 is not selectable in the file picker, addTerminalinstead — Python inherits Terminal's permissions when launched from it. Say "Hey Jarvis" — the icon turns 🔴 and recording starts. When you stop talking, it automatically processes your command and responds. After responding, it stays active and listens for your next command immediately — no need to say "Computer" again . Say "bye" , "goodbye" , "stop listening" , "go to sleep" , or "that's all" — the assistant says goodbye and returns to wake word mode. The session also auto-expires after 25 seconds of silence . | You say | What happens | |---|---| | "Open Spotify and play hip hop" | Opens Spotify, searches and plays | | "Set Spotify volume to 30 percent" | AppleScript sets Spotify's internal volume | | "Set volume to 50 percent" | Sets macOS system volume | | "Click the notification bell" | Takes screenshot, finds the bell, clicks it | | "What's on my screen?" | Takes screenshot, describes what it sees | | "Create a reminder to call John tomorrow at 9am" | Creates reminder in macOS Reminders | | "Open a new tab in Chrome" | AppleScript opens a new Chrome tab | | "Play next track" | AppleScript skips to next Spotify track | | "Make a folder called Projects on my Desktop" | mkdir ~/Desktop/Projects | | "What is the capital of France?" | Answers directly, no tools needed | | "Trim the video on my desktop from 10 seconds to 30 seconds" | ffmpeg cuts the clip, saves to Desktop | | "Mute the audio in intro dot mp4" | ffmpeg strips audio track | | "Speed up the video to 2x" | ffmpeg applies setpts + atempo filters | | "Merge video dot mp4 and clip dot mp4" | ffmpeg concat filter, saves to Desktop | When you ask to click or find something, the assistant calls look at screen — it takes a clean screenshot, sends it to the vision model gemma4:e4b , and gets back a bounding box for the target element. The center of that box is computed and clicked automatically. The model decides on its own when it needs to see the screen — you don't have to phrase commands any special way. Wake word "Computer" ↓ AudioRecorder.start ← opens sounddevice InputStream ↓ VAD auto-stop on silence, 30s hard cap AudioRecorder.stop → WAV file ↓ WhisperTranscriber.transcribe → runs whisper-cli → transcript text ↓ Dismissal check "bye" etc. → end session / OllamaClient.chat ↓ OllamaClient.chat — always qwen3:8b with think mode + tools: ├─ run shell command → zsh → output ├─ query system → read-only zsh → output ├─ look at screen → screencapture → gemma4:e4b → CLICK:x1,y1,x2,y2 └─ create reminder → Python builds correct AppleScript → osascript up to 5 tool rounds, streaming ↓ CursorControl.extract action → parse CLICK/POINT/RCLICK:x1,y1,x2,y2 CursorControl.execute → compute center → pyautogui moves/clicks ↓ SpeechOutput.speak → macOS say speaks the response ↓ Session active: wait 0.4s → start recording again Session idle 25s: return to WakeWordDetector PyClicky/ ├── main.py rumps menubar app — icons, menu, state display ├── companion.py state machine — session management, full pipeline ├── ollama client.py qwen3 with tools, gemma4 vision via look at screen ├── wake word.py offline wake word via openWakeWord hey jarvis pretrained model ├── audio recorder.py sounddevice mic capture + VAD silence detection → WAV ├── whisper transcriber.py calls whisper-cli subprocess, returns transcript ├── screen capture.py screencapture → resize to 1280px → base64 JPEG ├── cursor control.py parses CLICK/POINT/RCLICK:x1,y1,x2,y2 , clicks center ├── speech output.py macOS say command wrapper ├── shell executor.py zsh subprocess runner, cwd=~ └── requirements.txt Edit ollama client.py : VISION MODEL = "gemma4:e4b" called by look at screen tool for visual tasks COMMAND MODEL = "qwen3:8b" main model — tool calling, reasoning, Mac control The command model must support reliable tool calling. The vision model must be multimodal. | Vision | Command | Notes | |---|---|---| gemma4:e4b | qwen3:8b | Default — good balance of speed and capability | gemma4:e4b | qwen3:14b | Better reasoning, needs ~16GB RAM | gemma4:27b | qwen3:8b | Better vision accuracy, needs ~32GB RAM | qwen2.5vl:7b | qwen3:8b | Alternative vision model | Edit wake word.py : Use a different pretrained model e.g. "alexa", "hey mycroft" : WAKE MODEL = "hey jarvis" Point to a custom trained .onnx or .tflite file instead: WAKE MODEL PATH = "/path/to/your/computer.onnx" overrides WAKE MODEL when set Lower = more sensitive more false positives , higher = stricter: DETECTION THRESHOLD = 0.5 To train a custom "computer" model, follow the openWakeWord training guide https://github.com/dscripka/openWakeWord/blob/main/docs/training.md , then set WAKE MODEL PATH to the output .onnx file. Edit companion.py : SESSION IDLE TIMEOUT = 25.0 seconds of silence before returning to wake word mode Edit screen capture.py : MAX WIDTH = 1280 resize screenshot to this width before sending to vision model JPEG QUALITY = 75 compression quality Lower MAX WIDTH = faster responses, slightly less visual detail. Higher = more detail, larger payload. Edit ollama client.py : OLLAMA URL = "http://localhost:11434/api/chat" "No speech detected" every time - Check microphone permission - Speak louder or closer to the mic - Whisper model path may be wrong — check logs for the model: line Recording never stops / runs too long - Install webrtcvad: pip install webrtcvad-wheels for VAD silence detection - Without it, recording stops after 30 seconds Screenshot always fails - Grant Screen Recording to Terminal in System Settings → Privacy & Security - Test: screencapture -x -t jpg /tmp/test.jpg && echo OK Cursor doesn't move - Grant Accessibility to Terminal in System Settings → Privacy & Security → Accessibility Wake word never triggers - Wake word detection runs fully offline via openWakeWord — no internet needed - Default keyword is "hey Jarvis" not "Computer" — say that phrase to trigger - To use a different keyword, change WAKE MODEL in wake word.py see Configuration - Check logs for WAKE triggered: lines; lower DETECTION THRESHOLD if it's not firing - Speak clearly and at normal pace — very fast or whispered speech may score below threshold Mic error when OBS or Zoom is running - The app will retry 5 times automatically - If it still fails, close the other app briefly then restart the session Model says "I can't see your screen" - Ensure Screen Recording permission is granted - Try rephrasing: "look at my screen and click..." Ollama 400 error - Check ollama list — ensure both models are pulled - Restart Ollama: ollama serve "Too many steps" response - The model hit the 5-round tool call limit - Check shell executor logs for the underlying command error - macOS 12+ - Python 3.11+ - Homebrew - ~8GB RAM free for both models - Ollama running locally | Package | Purpose | |---|---| rumps | macOS menubar app framework | sounddevice | Mic input stream | soundfile | Write WAV files | numpy | Audio buffer manipulation | httpx | Streaming HTTP to Ollama | openwakeword | Offline wake word detection | pyautogui | Cursor movement and clicks | Pillow | Screenshot resize | webrtcvad-wheels | Voice activity detection optional | LocalClicky is early. Meaningful areas to improve: Custom "computer" wake word — train a personal openWakeWord model using the training guide https://github.com/dscripka/openWakeWord/blob/main/docs/training.md and swap WAKE MODEL PATH in wake word.py App-specific skills — context-aware commands for Terminal, Xcode, Figma, VS Code Packaging — proper .app bundle so users don't need to run from terminal Windows / Linux ports — the core pipeline is cross-platform; the menubar layer isn't Better click accuracy — the vision model gemma4 has limited spatial precision; a GUI-specific model would help significantly If you want to work on any of these, open an issue first. PRs welcome. MIT