Offline Mac Voice Assistant – Control Your Computer with Zero Data Leaving

wpnews.pro

Control your Mac with your voice. Completely offline.

Your voice, your screen, your commands — nothing leaves your machine. No cloud APIs. No API keys. No subscriptions.

Every cloud voice assistant makes the same tradeoff: you get convenience, they get your data. Your audio gets uploaded. Your screen gets sent to a server. Your commands get logged.

LocalClicky breaks that tradeoff. Everything runs on your hardware:

— transcription, runs locallyWhisper.cpp(qwen3, gemma4) — AI reasoning and vision, runs locallyOllamamacOS say— text-to-speech, built into your Mac** PyAutoGUI**— cursor and click control

No data leaves your machine. Not your voice. Not your screenshots. Not your commands.

Sits in the menubar — no Dock icon, stays out of the way
Say "Hey Jarvis"→ starts a session — stays active until you say goodbye Voice Activity Detection— auto-stops recording when you stop talking (no fixed timeout) Sees your screen on demand— vision model (gemma4:e4b) takes a screenshot when needed** Moves your cursor and clicksbased on what it sees on screen Controls your Mac**: open/quit apps, adjust volume, control Spotify, manage files, run shell commands, inject JS into Chrome** Edits videos**: trim, mute, merge, speed up, resize, add text — all via ffmpeg, no upload** Creates reminders**with natural language dates- Multi-round tool calling — runs commands, checks results, confirms or retries
Conversation memory across the session (last 10 exchanges) Session mode— chain commands back-to-back without repeating the wake word

Icon	State
🎙️	Idle / ready
👂	Listening for "Computer"
🔴	Recording your voice
🔄	Transcribing
🤔	Thinking (Ollama)
🔊	Speaking response
Error

brew install whisper-cpp

mkdir -p /opt/homebrew/share/whisper-cpp/models
curl -L -o /opt/homebrew/share/whisper-cpp/models/ggml-base.en.bin \
  "https://huggingface.co/ggerganov/whisper.cpp/resolve/main/ggml-base.en.bin"
brew install ollama

ollama serve

ollama pull qwen3:8b     # command model — tool calling, Mac control
ollama pull gemma4:e4b   # vision model — sees your screen when needed
brew install ffmpeg

Required for any video editing commands. Skip if you don't need video editing.

cd PyClicky
python3 -m venv venv
source venv/bin/activate
pip install -r requirements.txt
python -c "import openwakeword; openwakeword.utils.download_models()"
pip install webrtcvad-wheels

Without this, recording falls back to a 30-second hard cap instead of stopping when you stop talking.

cd PyClicky
source venv/bin/activate
ollama serve &   # if not already running
python main.py

The app appears in your menubar. No Dock icon.

LocalClicky needs three macOS permissions for the python3

binary inside your venv: /path/to/PyClicky/venv/bin/python3

Permission	Why	Where to grant
Microphone
Voice recording	Prompted automatically on first run
Screen Recording
Screenshot for vision	System Settings → Privacy & Security → Screen Recording
Accessibility
Cursor movement & clicks	System Settings → Privacy & Security → Accessibility

Tip:Ifpython3

is not selectable in the file picker, addTerminalinstead — Python inherits Terminal's permissions when launched from it.

Say "Hey Jarvis" — the icon turns 🔴 and recording starts. When you stop talking, it automatically processes your command and responds.

After responding, it stays active and listens for your next command immediately — no need to say "Computer" again.

Say "bye", "goodbye", "stop listening", "go to sleep", or "that's all" — the assistant says goodbye and returns to wake word mode.

The session also auto-expires after 25 seconds of silence.

You say	What happens
"Open Spotify and play hip hop"	Opens Spotify, searches and plays
"Set Spotify volume to 30 percent"	AppleScript sets Spotify's internal volume
"Set volume to 50 percent"	Sets macOS system volume
"Click the notification bell"	Takes screenshot, finds the bell, clicks it
"What's on my screen?"	Takes screenshot, describes what it sees
"Create a reminder to call John tomorrow at 9am"	Creates reminder in macOS Reminders
"Open a new tab in Chrome"	AppleScript opens a new Chrome tab
"Play next track"	AppleScript skips to next Spotify track
"Make a folder called Projects on my Desktop"	`mkdir ~/Desktop/Projects`
"What is the capital of France?"	Answers directly, no tools needed
"Trim the video on my desktop from 10 seconds to 30 seconds"	ffmpeg cuts the clip, saves to Desktop
"Mute the audio in intro dot mp4"	ffmpeg strips audio track
"Speed up the video to 2x"	ffmpeg applies setpts + atempo filters
"Merge video dot mp4 and clip dot mp4"	ffmpeg concat filter, saves to Desktop

When you ask to click or find something, the assistant calls look_at_screen

— it takes a clean screenshot, sends it to the vision model (gemma4:e4b), and gets back a bounding box for the target element. The center of that box is computed and clicked automatically.

The model decides on its own when it needs to see the screen — you don't have to phrase commands any special way.

Wake word ("Computer")
        ↓
AudioRecorder.start()           ← opens sounddevice InputStream
        ↓  (VAD auto-stop on silence, 30s hard cap)
AudioRecorder.stop()            → WAV file
        ↓
WhisperTranscriber.transcribe() → runs whisper-cli → transcript text
        ↓
Dismissal check ("bye" etc.)   → end session / OllamaClient.chat()
        ↓
OllamaClient.chat() — always qwen3:8b with think mode + tools:
  ├─ run_shell_command  → zsh → output
  ├─ query_system       → read-only zsh → output
  ├─ look_at_screen     → screencapture → gemma4:e4b → [CLICK:x1,y1,x2,y2]
  └─ create_reminder    → Python builds correct AppleScript → osascript
  (up to 5 tool rounds, streaming)
        ↓
CursorControl.extract_action()  → parse [CLICK/POINT/RCLICK:x1,y1,x2,y2]
CursorControl.execute()         → compute center → pyautogui moves/clicks
        ↓
SpeechOutput.speak()            → macOS `say` speaks the response
        ↓
Session active: wait 0.4s → start recording again
Session idle 25s: return to WakeWordDetector
PyClicky/
├── main.py                # rumps menubar app — icons, menu, state display
├── companion.py           # state machine — session management, full pipeline
├── ollama_client.py       # qwen3 with tools, gemma4 vision via look_at_screen
├── wake_word.py           # offline wake word via openWakeWord (hey_jarvis pretrained model)
├── audio_recorder.py      # sounddevice mic capture + VAD silence detection → WAV
├── whisper_transcriber.py # calls whisper-cli subprocess, returns transcript
├── screen_capture.py      # screencapture → resize to 1280px → base64 JPEG
├── cursor_control.py      # parses [CLICK/POINT/RCLICK:x1,y1,x2,y2], clicks center
├── speech_output.py       # macOS `say` command wrapper
├── shell_executor.py      # zsh subprocess runner, cwd=~
└── requirements.txt

Edit ollama_client.py

:

VISION_MODEL = "gemma4:e4b"   # called by look_at_screen tool for visual tasks
COMMAND_MODEL = "qwen3:8b"    # main model — tool calling, reasoning, Mac control

The command model must support reliable tool calling. The vision model must be multimodal.

Vision	Command	Notes
`gemma4:e4b`
`qwen3:8b`
Default — good balance of speed and capability
`gemma4:e4b`
`qwen3:14b`
Better reasoning, needs ~16GB RAM
`gemma4:27b`
`qwen3:8b`
Better vision accuracy, needs ~32GB RAM
`qwen2.5vl:7b`
`qwen3:8b`
Alternative vision model

Edit wake_word.py

:

WAKE_MODEL = "hey_jarvis"

WAKE_MODEL_PATH = "/path/to/your/computer.onnx"  # overrides WAKE_MODEL when set

DETECTION_THRESHOLD = 0.5

To train a custom "computer" model, follow the openWakeWord training guide, then set WAKE_MODEL_PATH

to the output .onnx

file.

Edit companion.py

:

SESSION_IDLE_TIMEOUT = 25.0   # seconds of silence before returning to wake word mode

Edit screen_capture.py

:

MAX_WIDTH = 1280    # resize screenshot to this width before sending to vision model
JPEG_QUALITY = 75   # compression quality

Lower MAX_WIDTH

= faster responses, slightly less visual detail. Higher = more detail, larger payload.

Edit ollama_client.py

:

OLLAMA_URL = "http://localhost:11434/api/chat"

"No speech detected" every time

Check microphone permission
Speak louder or closer to the mic
Whisper model path may be wrong — check logs for the model:

line

Recording never stops / runs too long

Install webrtcvad: pip install webrtcvad-wheels

for VAD silence detection - Without it, recording stops after 30 seconds

Screenshot always fails

Grant Screen Recording to Terminal in System Settings → Privacy & Security
Test: screencapture -x -t jpg /tmp/test.jpg && echo OK

Cursor doesn't move

Grant Accessibility to Terminal in System Settings → Privacy & Security → Accessibility

Wake word never triggers

Wake word detection runs fully offline via openWakeWord — no internet needed
Default keyword is "hey Jarvis"(not "Computer") — say that phrase to trigger - To use a different keyword, change WAKE_MODEL

inwake_word.py

(see Configuration) - Check logs for WAKE triggered:

lines; lowerDETECTION_THRESHOLD

if it's not firing - Speak clearly and at normal pace — very fast or whispered speech may score below threshold

Mic error when OBS or Zoom is running

The app will retry 5 times automatically
If it still fails, close the other app briefly then restart the session

Model says "I can't see your screen"

Ensure Screen Recording permission is granted
Try rephrasing: "look at my screen and click..."

Ollama 400 error

Check ollama list

— ensure both models are pulled - Restart Ollama: ollama serve

"Too many steps" response

The model hit the 5-round tool call limit
Check shell_executor logs for the underlying command error
macOS 12+
Python 3.11+
Homebrew
~8GB RAM free (for both models)
Ollama running locally

Package	Purpose
`rumps`
macOS menubar app framework
`sounddevice`
Mic input stream
`soundfile`
Write WAV files
`numpy`
Audio buffer manipulation
`httpx`
Streaming HTTP to Ollama
`openwakeword`
Offline wake word detection
`pyautogui`
Cursor movement and clicks
`Pillow`
Screenshot resize
`webrtcvad-wheels`
Voice activity detection (optional)

LocalClicky is early. Meaningful areas to improve:

Custom "computer" wake word— train a personal openWakeWord model using thetraining guideand swapWAKE_MODEL_PATH

inwake_word.py

App-specific skills— context-aware commands for Terminal, Xcode, Figma, VS Code** Packaging**— proper.app

bundle so users don't need to run from terminalWindows / Linux ports— the core pipeline is cross-platform; the menubar layer isn't** Better click accuracy**— the vision model (gemma4) has limited spatial precision; a GUI-specific model would help significantly

If you want to work on any of these, open an issue first. PRs welcome.

MIT

source & further reading

github.com — original article

Offline Mac Voice Assistant – Control Your Computer with Zero Data Leaving

Run your AI side-project on zahid.host