{"slug": "show-hn-recreate-thinking-machines-276b-voice-demo-with-duct-tape-and-8b-model", "title": "Show HN: Recreate Thinking Machines 276B voice demo with duct tape and 8B model", "summary": "A developer has built a CPU-only voice agent that replicates four key behaviors from Thinking Machines' 276B-parameter Interaction Models demo using off-the-shelf parts and commodity AI models on a single laptop. The project, which runs on one Python asyncio loop with local speech and vision processing, demonstrates friend detection, live translation, slouch detection, and search with chart generation by gluing together models like YOLO11, Silero VAD, and Llama-3.1-8B-Instruct-Turbo. The work shows how close a careful software harness can get to matching a massive custom-trained model's surface behaviors without requiring specialized hardware or training.", "body_md": "A CPU-only voice agent that replicates the surface behaviors of Thinking Machines' Interaction Models demo (May 2026) — real-time speech, vision-keyed proactivity, live translation, mid-conversation background tasks — on a laptop, with off-the-shelf parts and minimal LLM calls.\n\nThe point isn't to match Thinking Machines' architecture. They trained a 276B MoE from scratch on continuous audio+video with 200ms micro-turns. This project glues commodity models together with a Python event loop and shows how close a careful harness can get on the four behaviors that demo highlighted. Speech and vision are local (Silero VAD, Kroko ASR, YOLO11 pose, Piper TTS); LLM calls go to DeepInfra (Llama-3.1-8B-Instruct-Turbo for the foreground, DeepSeek-V3.2 for structured background work). One CPU laptop, one process, one asyncio loop.\n\nThe four demo behaviors all run end to end on a real laptop with a real webcam and mic:\n\n**Friend detection**— YOLO11-pose on the webcam emits`person_count_changed`\n\n; the registered watcher fires on the non-primary person.**Live translation**— Silero VAD cuts phrase-sized chunks, Whisper-large-v3-turbo on DeepInfra translates them to English, Piper speaks each chunk interpreter-style over the user. Exit is automatic when the user speaks English on an end-of-turn pause.**Slouch detection**— shoulder→ear vector angle off vertical, debounced over three frames (1.5s) so a momentary lean doesn't fire.** Search + chart with continued conversation**— the foreground says \"let me find those for you\" while a background worker calls Serper, then DeepSeek-V3.2 for a Chart.js spec. The user can interrupt and ask follow-ups while the chart renders in the browser.\n\nPass `--no-cam`\n\nto skip the camera and the YOLO load entirely;\nvision-keyed triggers stay in the table but don't fire automatically\n(`VisionWorker.push_event`\n\nstill works for scripted demos).\n`--no-audio`\n\nruns from stdin without touching mic or TTS.\n`--no-audio --no-cam`\n\ntogether gives a headless pure-text session,\nwhich is what the integration tests use.\n\n```\nflowchart TB\n    O[\"<b>Orchestrator</b><br/>(Python, single process)<br/><br/>• event loop<br/>• triggers table<br/>• foreground queue<br/>• background queue<br/>• completed-tasks queue\"]\n    W[\"<b>Webcam</b><br/>YOLO11 pose<br/>@ 4 fps\"]\n    M[\"<b>Mic</b><br/>Silero + Kroko<br/>+ Whisper\"]\n    S[\"<b>Speaker</b><br/>Piper<br/>(VITS / onnx)\"]\n    N[\"<b>Network</b><br/>Llama + DeepSeek<br/>+ Serper\"]\n    B[\"<b>WebSocket</b><br/>Browser<br/>(Chart.js)\"]\n\n    O <--> W\n    O <--> M\n    O <--> S\n    O <--> N\n    O <--> B\n```\n\nThe orchestrator runs a single asyncio loop. Three event sources feed it:\n\n**Audio events** from Silero VAD + Kroko ASR (normal mode) or Silero VAD + Whisper (translate mode):`user_started_speaking`\n\n,`partial_transcript`\n\n,`user_finished_speaking`\n\n,`translation_chunk`\n\n.**Vision events** from the YOLO worker:`person_count_changed`\n\n,`pose_state_changed`\n\n.**Background completions** from the worker that handles search and chart generation.\n\nThe **triggers table** is the central data structure. It is a list of\nregistered behaviors that the orchestrator consults on every event.\nEntries are written by the foreground model in response to user\ninstructions and read by the event loop. The orchestrator logs the\ntable after every change, so you can watch the demo's \"memory\" grow\nover a session.\n\n**Local (CPU):**\n\n[Silero VAD](https://github.com/snakers4/silero-vad)— endpointing and barge-in gating[Kroko ASR](https://kroko.ai)via[sherpa-onnx](https://github.com/k2-fsa/sherpa-onnx)— streaming speech-to-text[YOLO11-pose](https://docs.ultralytics.com)via ultralytics — person + pose at 4 fps[Piper](https://github.com/OHF-Voice/piper1-gpl)— TTS (VITS via onnxruntime)[WebRTC AEC3](https://webrtc.googlesource.com/src/+/refs/heads/main/modules/audio_processing/aec3/)via[livekit](https://pypi.org/project/livekit/)'s Python bindings — acoustic echo cancellation on the mic path, so speaker output doesn't self-barge mid-reply`sounddevice`\n\n+`soundfile`\n\nfor mic/speaker I/O and FLAC encoding of translate-mode audio uploads`opencv-python`\n\nfor webcam capture\n\n**Network APIs (DeepInfra by default, plus Serper):**\n\n`meta-llama/Meta-Llama-3.1-8B-Instruct-Turbo`\n\n— foreground \"interaction\" model`deepseek-ai/DeepSeek-V3.2`\n\n— background reasoning / structured-output model`openai/whisper-large-v3-turbo`\n\n— translation, only invoked in translate mode[Serper](https://serper.dev)— web search\n\nThe three model calls go through DeepInfra's OpenAI-compatible\n`/v1/openai/chat/completions`\n\nand `/v1/openai/audio/translations`\n\nendpoints. The base URL is set in `src/foreground.py`\n\nand\n`src/background.py`\n\n; point them at any OpenAI-format provider (the\nreal OpenAI, Together, Groq, Fireworks, a local vLLM/Ollama server,\netc.) and supply matching model names. Cost numbers in this README\nassume DeepInfra's pricing.\n\n**Orchestrator core:**\n\n`httpx`\n\nfor streaming SSE to/from the model endpoints`partial-json-parser`\n\nfor incremental extraction of the`say`\n\nfield before the full response lands`python-dotenv`\n\nfor`.env`\n\nloading\n\n**Browser-side:**\n\n`fastapi`\n\n+`uvicorn`\n\n+`websockets`\n\nserve a single static HTML page that renders charts with Chart.js v4. Used only for behavior #4.\n\nEarlier rounds used Chatterbox Turbo. Chatterbox sounds better but is comfortably slower than real-time on a single CPU core, which meant the first sentence of every assistant turn was bottlenecked on synthesis rather than on Llama's time-to-first-token. Piper runs the VITS-style synthesizer as an ONNX graph through onnxruntime — no torch at synthesis time — and produces audio at several times real-time on a single core. The cost is some prosody quality; the win is that time-to-first-audio drops by roughly an order of magnitude, which is the latency dimension that matters most for the demo. Piper also streams natively (sub-sentence audio chunks), so barge-in cancels within tens of ms instead of the seconds-per-sentence granularity the old sentence-by-sentence loop gave us.\n\nOn headphones the mic only hears the user, and barge-in detection is straightforward: VAD says \"speech\", the orchestrator cancels TTS, the user takes over. On speakers, the mic also hears the assistant's own voice coming back from the speakers a few tens of milliseconds later. VAD can't tell that apart from a real interruption, so a naive setup self-barges on every reply — the assistant cancels itself a second into its first sentence.\n\nWebRTC AEC3 fixes this at the sample level. The TTS playback path\ntees its int16 PCM into a reference ring buffer (resampled to 16 kHz\nso it matches the mic rate); the mic loop, on every 10 ms block,\nhands the reference to `process_reverse_stream`\n\nand the mic to\n`process_stream`\n\nbefore VAD or ASR see anything. AEC3's adaptive\nfilter learns the room's impulse response and subtracts the echo.\nAfter the filter converges (~0.5–1.5 seconds of TTS), speaker bleed\nis suppressed by 20–40 dB and VAD only fires on actual user speech.\n\nThe convergence window itself is handled separately: the orchestrator\nignores `user_started_speaking`\n\nevents during the first\n`aec_convergence_ms`\n\n(default 1200) of each TTS utterance, because\nthe filter hasn't seen enough reference yet to be reliable. The UX\ncost is essentially zero — nobody interrupts the opening of a reply\nthat just started — and it eliminates the otherwise-persistent\nopening-sentence self-cancellation.\n\nThe token-overlap heuristic in `audio._looks_like_self_echo`\n\nis kept\nas a backstop on ASR finals (for residual bleed during double-talk\nor when the platform's render-to-capture delay differs significantly\nfrom the configured `aec_stream_delay_ms`\n\n). With AEC active it\nalmost never fires; the log line `echo filter: dropping likely self-transcription (post-AEC residual)`\n\nis your hint that something\nis misaligned.\n\nIf livekit isn't installed at runtime, AEC degrades to no-op with a\nlogged warning and the behaviour matches the pre-AEC version (the\ntoken-overlap filter does what it can). Set `CHEAPIM_AEC_ENABLED=0`\n\nto disable explicitly. Tune the render-to-capture delay with\n`CHEAPIM_AEC_DELAY_MS`\n\nif echo persists past the convergence window\non your hardware (USB devices typically want 80–150ms; bluetooth\n200–300ms).\n\nThe default is `yolo11n-pose.pt`\n\n(nano) running at **4 fps**, balanced\nfor \"feels real time\" on a CPU laptop. End-to-end latency from a pose\nchange to a trigger fire lands roughly in the 150–300ms range on a\nmodern laptop in practice (your hardware will vary): camera read is\nnear-instant once buffer-draining is in play, inference is ~60–100ms\nfor `n`\n\n, and the rest is the asyncio hop into the orchestrator and\nTTS dispatch.\n\nEarlier versions defaulted to `yolo11x-pose.pt`\n\nat 2 fps. That gave\nbeautiful keypoints but a 2–3 second perceived delay — the bottleneck\nwas camera buffering compounding with a slow capture interval, not\nmodel size. Both have been fixed (see the buffer-drain note in\n`vision._run_loop`\n\n), but `n`\n\nis still the right default: it leaves\nplenty of headroom for the loop to recover from a slow frame without\nfalling behind. `m`\n\nis a good upgrade if your slouch angle jitters\non `n`\n\n's noisier keypoints.\n\nOverride via env:\n\n```\nCHEAPIM_YOLO_MODEL=yolo11m-pose.pt    # slower, more stable keypoints\nCHEAPIM_YOLO_MODEL=yolo11x-pose.pt    # slowest, very stable keypoints\n```\n\nIf you switch to `x`\n\non CPU, also drop `VisionConfig.fps`\n\nto 2.0 (in\n`src/vision.py`\n\n); the loop will silently fall behind otherwise and\nyou'll see the lag come back.\n\nRequires Python 3.11.\n\nBuilt and tested on Linux. The dependencies are cross-platform\n(`sounddevice`\n\n, `opencv-python`\n\n, `onnxruntime`\n\n, `ultralytics`\n\nall\nship macOS and Windows wheels) and nothing in the orchestrator is\nLinux-specific, so macOS and Windows should work — but they haven't\nbeen exercised. The mic/speaker and webcam permission notes below\ncover the macOS and Linux cases; Windows users will need to adapt.\n\n```\ngit clone <this repo>\ncd cheap-im\nuv venv && source .venv/bin/activate   # or your preferred venv tool\nuv pip install -e .\n\ncp .env.example .env\n# fill in DEEPINFRA_API_KEY and SERPER_API_KEY\n```\n\n`pyproject.toml`\n\npins `torch`\n\nand `torchvision`\n\nto PyTorch's CPU-only\nwheel index (`download.pytorch.org/whl/cpu`\n\n). The demo never uses\nCUDA; this keeps the install at ~250 MB instead of ~3 GB. (Piper\ndoesn't use torch at all — only Silero VAD does — but torch is still\na direct dep so the CPU-index pin applies.) If you're not using `uv`\n\n,\ninstall torch from the CPU index manually:\n\n```\npip install torch torchvision --index-url https://download.pytorch.org/whl/cpu\npip install -e .\n```\n\nDownload Banafo Kroko ASR weights for English (and any other languages\nyou want) from the [HuggingFace\nmirror](https://huggingface.co/hudaiapa88/sherpa-stt-onnx). You need\nfour files in one folder:\n\n```\nmodels/en/kroko_64l/\n├── encoder.int8.onnx\n├── decoder.int8.onnx\n├── joiner.int8.onnx\n└── tokens.txt\n```\n\nSet `CHEAPIM_KROKO_MODEL_DIR=./models/en/kroko_64l`\n\nin your `.env`\n\n.\n\nultralytics downloads pose weights on first use. No manual step\nrequired; the first `cheap-im`\n\nrun with vision enabled will fetch\n`yolo11n-pose.pt`\n\n(~6 MB) into ultralytics' cache. If you want to\npre-cache or use a different size, the model name is read from\n`CHEAPIM_YOLO_MODEL`\n\n(see above).\n\nPiper ships a downloader. Pick a voice from the\n[Piper voices list](https://github.com/OHF-Voice/piper1-gpl/blob/main/docs/VOICES.md)\nand run:\n\n```\nmkdir -p models/piper\npython -m piper.download_voices --data-dir models/piper en_US-lessac-medium\n```\n\nThat writes both `en_US-lessac-medium.onnx`\n\nand\n`en_US-lessac-medium.onnx.json`\n\ninto `models/piper/`\n\n. Point the env at\nthe `.onnx`\n\nfile:\n\n```\nCHEAPIM_PIPER_MODEL_PATH=./models/piper/en_US-lessac-medium.onnx\n```\n\nThe `.onnx.json`\n\nfile must sit next to the `.onnx`\n\n; Piper reads it\nautomatically. `medium`\n\nis a good default for laptop CPUs; `high`\n\nsounds better but is roughly 2× slower. Optional knobs:\n\n`CHEAPIM_PIPER_LENGTH_SCALE`\n\n— float;`>1.0`\n\nslows speech,`<1.0`\n\nspeeds it up. Useful if a particular voice feels rushed.`CHEAPIM_PIPER_SPEAKER_ID`\n\n— int; only meaningful for multi-speaker voices (most en_US voices are single-speaker).\n\nThe orchestrator hosts the chart page itself. Open\n[http://127.0.0.1:8765/](http://127.0.0.1:8765/) in a regular browser\nwindow **before** starting the demo. The orchestrator doesn't manage\nthe browser lifecycle.\n\nOn macOS, grant terminal access to the microphone in System Settings →\nPrivacy & Security. On Linux, ensure the user is in the `audio`\n\ngroup\nand that PulseAudio/PipeWire is running. List devices with\n`python -m sounddevice`\n\nand set `CHEAPIM_AUDIO_DEVICE=<index>`\n\nif the\ndefault doesn't work.\n\nOn macOS, grant terminal access to the camera in System Settings →\nPrivacy & Security. On Linux, the user needs read access to `/dev/video*`\n\n(typically via the `video`\n\ngroup). If your machine has multiple cameras,\nset `CHEAPIM_WEBCAM_INDEX=<index>`\n\n— OpenCV enumerates in the same order\nas `v4l2-ctl --list-devices`\n\non Linux and `system_profiler SPCameraDataType`\n\non macOS.\n\n```\ncheap-im\n```\n\nOr, to skip mic/TTS and drive the orchestrator from stdin (useful for testing the chart pipeline without audio hardware):\n\n```\ncheap-im --no-audio --log-level=DEBUG\n```\n\nThen type the user's side of the conversation at the `you>`\n\nprompt.\n\nTo skip the webcam and YOLO model load (useful on a headless machine, in CI, or when iterating on non-vision behaviors):\n\n```\ncheap-im --no-cam\n```\n\nVision-keyed triggers still register and persist in the triggers\ntable; they just don't fire from the camera. Tests and scripted demos\ncan still push synthetic events via `VisionWorker.push_event`\n\n— that\npath doesn't need a camera.\n\nTo watch what YOLO sees while the orchestrator runs — bounding boxes,\nkeypoints, which person is treated as primary, the live slouch angle,\nand the debounce streak counters — pass `--cam-debug`\n\n:\n\n```\ncheap-im --cam-debug\n```\n\nAn OpenCV window opens alongside the chart UI. Press `q`\n\nin the\nwindow to close it; the orchestrator keeps running headlessly\n(closing the preview should not be a way to silently disable\ntriggers). This is the flag to reach for when \"the model isn't firing\nthe trigger I expected\" — the overlay shows the angle and streak\ncounts that go into the decision. Incompatible with `--no-cam`\n\n.\n\nThe two flags compose. `cheap-im --no-audio --no-cam`\n\nruns a\npure-text session with no hardware touched at all, which is the\nconfiguration the integration tests use.\n\nUser: Every time one of my friends enters the frame, say 'friend'.\n\nThe foreground model registers a `new_person_in_frame`\n\nvisual\nwatcher. The vision worker tracks a \"primary\" person (largest\nbounding box, assumed to be the user) and fires the watcher whenever\na non-primary person appears in frame for at least two consecutive\nframes (~0.5s at 4 fps), subject to the trigger's cooldown.\n\nUser: I'll speak in Hindi. Can you translate to English in real time?\n\nThe foreground model emits `{\"set_mode\": {\"mode\": \"translate\"}}`\n\nand\nthe orchestrator flips the audio path. In translate mode, Silero VAD\ncuts the user's speech at phrase pauses (~400ms of silence) and\nend-of-turn pauses (~1400ms). Each chunk is sent to Whisper-large-v3-turbo\non DeepInfra with `task=translate`\n\n, and the returned English text is\nspoken by Piper over the user's continued speech — interpreter-style,\nwithout barge-in.\n\nExit is automatic: when Whisper detects a chunk as English **and**\nthat chunk ended on a full end-of-turn pause, the orchestrator flips\nmode back to normal and routes the English text to the foreground as\nan ordinary user turn. The model replies however it wants. Whisper\naudio is metered in seconds and folded into the cost estimate\nalongside LLM tokens.\n\nEnd-to-end latency from end-of-phrase to spoken English is dominated by Whisper round-trip; DeepInfra typically returns in under a second on warm calls, with cold-start spikes of a few seconds on the first chunk.\n\nUser: Tell me when I start slouching.\n\nThe foreground model registers a `slouch_sustained`\n\nvisual watcher.\nSlouch is detected from the angle of the shoulder-midpoint →\near-midpoint vector off vertical, on the primary person only. The\nstate flips between `upright`\n\nand `slouching`\n\nafter three consecutive\nframes (~0.75s at 4 fps) past threshold, so a momentary lean forward\ndoesn't fire and a brief head-bob back doesn't bounce out.\n\nUser: What are typical human reaction times for auditory, visual, and tactile cues? Could you search?\n\nThe foreground model speaks \"Let me find those for you\" immediately\nand emits a `delegate`\n\nfor `search`\n\n. The background worker runs\nSerper.\n\nWhen the search returns, the result is buffered as a system note. The next user turn picks it up and Llama 8B speaks the answer from it.\n\nUser: Can you visualize them in a bar chart?\n\nA second `delegate`\n\n, this time for `chart`\n\n. Background worker calls\nDeepSeek-V3.2 for a Chart.js spec, validates the JSON, pushes to the\nbrowser. Chart appears in the open browser window.\n\nUser (interrupting): In the meantime, why is auditory faster than visual?\n\nThe foreground handles this as an ordinary turn while the chart render completes asynchronously. No notification needed — the chart just shows up in the browser when it's ready.\n\nThings the harness **cannot** do, and which TM's architecture handles\nnatively:\n\n- True mid-sentence interruption with learned timing on when to\nyield. The orchestrator does coarse-grained barge-in: when VAD\ndetects new speech mid-TTS, it cancels the current Piper chunk and\ntruncates the assistant turn in history to roughly what played.\nWebRTC AEC handles the speaker-bleed case (so the assistant doesn't\ncancel itself), but the model has no learned policy on\n*whether*to yield — every detected user start cancels TTS unconditionally. - Backchanneling (\"mhm\", \"right\") during user speech.\n- Simultaneous speaking and listening at the token level. The harness cleanly separates these: while TTS plays, the mic is still listened to, but the system can't generate and listen at the same time.\n- Visual proactivity decisions that depend on subtle context (not just hardcoded events like \"new person\").\n- Sub-400ms turn-taking. Realistic floor here is 500–1200ms on a laptop, dominated by DeepInfra time-to-first-token (Piper itself is fast enough not to be the bottleneck on most hardware).\n- Code-switching mid-utterance during translation.\n- Graceful \"I got cut off after saying X\" behavior. The orchestrator approximates this by truncating the assistant turn in history to match what was actually played through TTS, but it isn't learned — the model isn't trained on cut-off transcripts.\n\nThe orchestrator tracks input/output tokens for every foreground and\nbackground call, plus seconds of audio sent to Whisper. The default\ncap is 100k tokens per session, set via `CHEAPIM_TOKEN_CAP`\n\n. When\nhit, a warning is logged and the demo continues; the goal is to make\nthe cost claim verifiable from logs, not to enforce a hard limit.\nThe token cap does not gate Whisper minutes.\n\nA typical 5-minute demo run looks like:\n\n- ~10 foreground turns × ~300 input + ~100 output tokens → 3k input, 1k output Llama 8B tokens → ~$0.000090\n- ~2 background tasks (one search, one chart) × ~500 in + ~600 out → 1k input, 1.2k output DeepSeek-V3.2 tokens → ~$0.000716\n- Serper search: 1 call → $0.001\n- Translate mode (if used): ~$0.0002 per minute of audio\n\nTotal: well under a cent of inference for a non-translation demo, plus Serper; a 10-minute translation session adds about $0.002 on top.\n\n```\n/\n├── README.md\n├── SPEC.md                  the design doc this README implements\n├── pyproject.toml\n├── .env.example\n├── src/\n│   ├── orchestrator.py      event loop, triggers table, queues\n│   ├── foreground.py        DeepInfra Llama client, streaming JSON parsing\n│   ├── background.py        DeepInfra DeepSeek client, search/chart tasks\n│   ├── audio.py             Silero VAD + Kroko ASR + Whisper + Piper TTS + WebRTC AEC\n│   ├── vision.py            YOLO11-pose @ 4 fps, person count + slouch\n│   ├── modes.py             normal / translate mode state\n│   ├── web/\n│   │   ├── server.py        FastAPI + WS chart server\n│   │   └── chart.html       Chart.js v4 renderer\n│   └── prompts/\n│       ├── foreground.txt   system prompt + output grammar\n│       └── chart.txt        DeepSeek chart-gen prompt\n├── examples/\n│   ├── README.md            how to write your own visual watcher\n│   └── quorum_watcher.py    concrete extension sketch\n└── tests/\n    └── test_foreground_parsing.py\nuv pip install -e \".[dev]\"\npytest\n```\n\nThe only dev dependency is `pytest`\n\nitself — the streaming\nintegration tests use plain `asyncio.run()`\n\ninstead of pulling in\npytest-asyncio. The tests cover the foreground JSON-parsing path\n(streaming-say-early, plain-text fallback, truncated streams, history\ntrimming) and a mock-transport integration test that exercises the\nfull streaming pipeline end-to-end.\n\nMIT.", "url": "https://wpnews.pro/news/show-hn-recreate-thinking-machines-276b-voice-demo-with-duct-tape-and-8b-model", "canonical_source": "https://github.com/kouhxp/cheap-im", "published_at": "2026-06-12 08:16:37+00:00", "updated_at": "2026-06-12 08:49:11.515962+00:00", "lang": "en", "topics": ["artificial-intelligence", "machine-learning", "large-language-models", "computer-vision", "natural-language-processing"], "entities": ["Thinking Machines", "Silero VAD", "Kroko ASR", "YOLO11", "Piper TTS", "DeepInfra", "Llama-3.1-8B-Instruct-Turbo", "DeepSeek-V3.2"], "alternates": {"html": "https://wpnews.pro/news/show-hn-recreate-thinking-machines-276b-voice-demo-with-duct-tape-and-8b-model", "markdown": "https://wpnews.pro/news/show-hn-recreate-thinking-machines-276b-voice-demo-with-duct-tape-and-8b-model.md", "text": "https://wpnews.pro/news/show-hn-recreate-thinking-machines-276b-voice-demo-with-duct-tape-and-8b-model.txt", "jsonld": "https://wpnews.pro/news/show-hn-recreate-thinking-machines-276b-voice-demo-with-duct-tape-and-8b-model.jsonld"}}