A Robot Arm with Eyes, Ears & Brain — Runs on MacBook A developer built a tabletop robot arm that understands spoken commands and picks up objects using only a MacBook Air, with no cloud dependency. The system runs three AI models locally—speech-to-text, a language model for planning, and an object detector—on Apple Silicon via MLX and MPS. The project demonstrates that capable AI robotics can run entirely on consumer hardware. Most "AI robot" demos you see online quietly send everything to a cloud GPU somewhere. The robot looks impressive, but pull the network cable and it falls apart. I wanted to find out: can a tiny tabletop arm actually understand spoken commands, see what's on the desk, and pick stuff up — without ever phoning home? Spoiler: yes. And it all fits on a MacBook Air.This post is a writeup of that demo. I won't drown you in code — at the end I'll link the repo if you want to build one. Here, I want to show you what's possible and how the pieces fit together. What It DoesYou speak. The robot picks. Here's a typical exchange: ▎ 🎙️ "Hand me the green cube." The arm scans the desk, finds the green cube among other colored cubes, lowers a suction nozzle onto it, picks it up, then carries it to wherever I hold my hand out — and gently drops it into my palm. Other things it understands: - "Drop the blue cube in the cardboard box." - "Hand me the credit card." - "Wait " mid-action — the arm stops where it is. Everything runs on the laptop. No internet required. The Mac never reaches out for help. The Three BrainsThe "AI" in this project isn't one model — it's three, each doing what it's best at, talking to each other through a small Python script. All three run locally via Apple's MPS the GPU built into M-series Macs : 1. Ears — Nemotron-ASR ~1 GB A streaming speech-to-text model from NVIDIA. It listens through the MacBook's built-in mic and turns your voice into text in roughly real-time. I use it via the mlx-audio library, which is Apple-Silicon optimized. 2. Brain — Qwen3-1.7B ~4 GB A small language model from Alibaba. It reads the transcribed text and decides what the robot should do, output as a tiny JSON plan like {"action":"pick and place", "source":"a green cube", "target":"a human hand"}. The whole reasoning step takes about a second. Runs on Apple's MLX framework — the same way Apple Intelligence runs locally on your Mac. 3. Eyes — OWLv2 ~600 MB Google's open-vocabulary object detector. Unlike traditional detectors that only know fixed classes "dog", "car" , OWLv2 takes a text description and finds whatever you describe. So when Qwen says "find a green cube", OWLv2 actually looks for a green cube — no special training needed. Total memory: about 6 GB. The Mac sips this comfortably with room to spare for the rest of the OS. The Hardware Surprisingly Small mycobot 280 M5 https://shop.elephantrobotics.com/collections/mycobot-280/products/mycobot-worlds-smallest-and-lightest-six-axis-collaborative-robot — 6-axis robot arm, 280 mm reach. About $400, made by Elephant Robotics. Suction pump kit https://shop.elephantrobotics.com/collections/suction-pumps/products/suction-pump-2-0 — Comes as a mycobot accessory. Two wires control the pump and the air valve.- USB webcam — Mounted on a gantry above the workspace, looking straight down. Nothing fancy; a $30 cam works. - MacBook Air M4 — Running everything. No external GPU needed. - Colored cubes, a cardboard box, a credit card — Random demo objects. That's it. No depth sensor, no LIDAR, no NVIDIA box humming under the desk. How a Single Command Flows Through the System Let me walk you through what happens when you say "Hand me the green cube." Step 1 — Mic to text ~500 ms . A background thread is always listening. When it detects speech your voice rises above the room's noise floor , it captures the audio, sends it to Nemotron, and gets back: "Hand me the green cube." Step 2 — Text to plan ~1 second . The transcription goes to Qwen3 with a small prompt that explains its job "turn a robot command into JSON" . Qwen replies: {"action": "pick and place", "source": "a green cube", "target": "a human hand"} Step 3 — Find the cube ~100 ms . A frame from the camera goes to OWLv2 along with the query "a green cube." OWLv2 returns a bounding box around the green cube with a confidence score. Step 4 — Camera pixel → robot coordinate. Knowing where the camera is relative to the robot calibrated once at the start , the script transforms the cube's pixel location into an actual XY position in millimeters in the robot's coordinate system. Step 5 — Move the arm. The script calculates joint angles using an inverse kinematics solver open-source library ikpy , sends them to the mycobot over USB serial, and the arm moves above the cube, descends, and the pump turns on. Step 6 — Carry to your hand. OWLv2 now starts looking for "a human hand." It finds your palm in the frame, the script computes where it is, and the arm carries the cube there. While moving, it keeps checking — if you move your hand, the arm tracks. When you hold still for two seconds, the pump releases and the cube drops into your hand. Total time, voice to delivery: about 8 seconds. What Surprised MeA few things were not what I expected going in. The models are small, but they're capable. Qwen3-1.7B isn't the biggest LLM out there, but for parsing a robot command into structured JSON, it's plenty. Same for OWLv2 — it isn't as accurate as a fine-tuned detector for one specific object, but the fact that I can change the target by editing one string "a green cube" → "a credit card" → "a banana" is wildly powerful. The hardest part wasn't AI — it was calibration. Knowing exactly where the camera is, exactly where the robot's joints think they are, exactly how long the pump nozzle hangs below the wrist… those millimeters matter. I went through three rounds of touch-calibration manually jogging the robot's pump to known points on a marker before pick accuracy was reliable. Voice control needs guardrails. Voice activity detection picks up coughs, fan noise, even the script's own confirmation chimes if you don't use headphones. About 200 lines of the code are filters: "ignore short utterances", "drop common filler words", "always allow STOP through", "if the LLM mis-parses, refuse to move the arm." The whole stack is real-time on battery. On my MacBook Air, with the camera streaming, all three models loaded, and the arm taking commands, the laptop pulls maybe 15-20W. You could honestly run this off the wall socket at a coffee shop. What I'm Going to Try NextA few directions I want to explore now that the basic loop works: - Bigger LLM, more nuanced commands. Try Qwen3-7B or Llama 3.1-8B for richer reasoning. "Sort these cubes by color" or "Put the heaviest one in the box first." - Vision-language for end-to-end tasks. Replace Qwen + OWLv2 with a single VLM like Qwen2.5-VL. Less hand-stitching. - Stereo cameras for depth. Right now I cheat: I assume objects are on the desk or hands are at a known height . A second camera would let me compute true 3D positions. If you've been wanting to play with locally-deployed AI but didn't know how to make it tangible, robotics is a fun way in. You don't need a $50K humanoid. A $400 arm and three open models is enough to build something that genuinely surprises people in person. Three takeaways:1. Local AI is real now. A 5-year-old laptop can run a stack that would have required a server in 2020. 2. Compose models, don't train one. STT + LLM + vision detector beats trying to train one custom thing. They're each great at their job. 3. Voice + vision + robotics is more fun than any of them alone. This is the part that's hard to convey in text — when you say a sentence and a physical thing in the world responds, it feels different. Read more javascript:void 0