Whissle Gateway – Run Multi-Modal Voice AI Locally in a 500MB Docker

Whissle Gateway launches a multi-modal voice AI system that runs locally in a 500MB Docker container, offering batch REST, streaming WebSocket, text-to-speech, voice calling, and an intelligent agent. The system includes built-in AI analysis modes for sales coaching and collections compliance, with models that extract metadata like emotion, behavior, and role in a single ASR forward pass.

What happens when you run it: ═══════════════════════════════════════════════ Whissle Gateway — en-full ═══════════════════════════════════════════════ No GPU detected → using CPU Shared models: ✓ speaker encoder + VAD 26 MB ✓ punctuation 254 MB ✓ ITN English + Hinglish 1.5 MB Variant: en-full ✓ en-in-tech-misc 485 MB ✓ KenLM ENGLISH 1.5 GB Auth: Mode: local Token: wh a1b2c3d4e5f6... admin Manage: curl -H 'Authorization: Bearer ...' localhost:9000/auth/tokens Starting services... PostgreSQL: :5432 ● ASR: :8001 ● TTS: :8003 ● Agent: :8765 ● Pipecat: :8000 ● Gateway: :9000 ● API Five interfaces — batch REST, streaming WebSocket, text-to-speech, voice calling, and an intelligent agent. POST localhost:8001/transcribe bash $ curl -X POST http://localhost:8001/transcribe \ -F "file=@call.mp3" \ -F "diarize=true" \ -F "num speakers=2" \ -F "punctuation=true" \ -F "metadata prob=true" \ -F "summarize=sales coaching" \ -o result.json Response — transcript + metadata per segment + AI analysis { "segments": { "speaker": "SPEAKER 00", "text": "Hello, good morning.", "start": 1.0, "end": 1.9, "metadata": { "emotion": "EMOTION NEUTRAL", "behavior": "BEHAVIOR DIRECT", "role": "ROLE INTERVIEWER", "age": "AGE 30 45", "gender": "GENDER MALE" }, "words": {"word": "Hello", "start": 1.0, "end": 1.3} } , "analysis": { "overall score": 78, "buyer outcome": "Converted", "practices": { "followed": 6, "total": 8 }, "highlights": ... } } Parameters All parameters for POST /transcribe . | Parameter | Type | Default | Description | |---|---|---|---| | file | file | required | Audio file MP3, WAV, FLAC, OGG, M4A | | language | string | auto | Language hint: en, hi, zh | | diarize | bool | false | Speaker diarization | | num speakers | int | auto | Exact speaker count if known | | punctuation | bool | true | Restore punctuation and capitalization | | itn | bool | true | Inverse text normalization numbers, currency | | use lm | bool | true | KenLM language model beam search | | metadata prob | bool | false | Probability distributions for metadata | | word timestamps | bool | false | Per-word start/end timestamps | | speech analysis | bool | false | Speech patterns pace, fillers, fluency | | summarize | string | — | AI analysis: true, sales coaching, collections, or custom prompt | | hotwords | string | — | Comma-separated hotwords for boosting | AI analysis modes Add -F "summarize=mode" to any transcription. The diarized transcript + metadata is sent to Claude or Gemini for analysis. sales coaching Sales Coaching 8 best practices scored. Rep/buyer identification. Highlights with timestamps. Behavior labels per segment. Overall score 0–100. collections Collections Compliance Identity verification, reason stated, amount mentioned, no harassment. Call outcome Promise to Pay / Dispute / Hardship . Next action. true General Summary Overview, participants, key topics, emotional dynamics, entities, outcome. Markdown format. your prompt here Custom Prompt Pass any prompt string. The LLM receives your instructions + full transcript with per-segment metadata. Models Each model extracts different metadata in a single ASR forward pass — no separate models or API calls. en-in-tech-misc 485 MB120M params, 26 Behavioral codes for coaching, therapy, interviews. 8 evaluation labels. English · 6 heads, 51 classes hinglish-loans 479 MB115M params, Debt collection intents — pay-back, disputes, hardship. Agent/Customer role detection. Hindi-English · 5 heads, 26 classes zh 627 MB160M params, Mandarin with North/South dialect detection. Mandarin · 3 heads, 12 classes whissle-large 2.4 GB600M params, inline action tokens. 31 intent groups, 18K vocabulary. 23 languages · 5,500+ action tokens Kokoro TTS 82 MBNon-autoregressive text-to-speech. Sub-200ms TTFB on CPU. Always included. 10 languages · Baked in Punctuation + ITN 255 MBPunctuation restoration and inverse text normalization. EN + Hinglish · Auto-downloaded Metadata per segment Every segment includes these tags. Common tags appear on all models. Additional tags depend on the model. | Tag | Values | Models | |---|---|---| | emotion | EMOTION NEUTRAL, EMOTION HAPPY, EMOTION SAD, EMOTION ANGRY, EMOTION FEAR, EMOTION SURPRISE | All | | age | AGE 0 18, AGE 18 30, AGE 30 45, AGE 45 60, AGE 60+ | All | | gender | GENDER MALE, GENDER FEMALE | All | | behavior | 26 types BEHAVIOR EXPLAIN, BEHAVIOR QUESTION, BEHAVIOR ACKNOWLEDGE, ... | en-in-tech-misc | | eval | EVAL CORRECT, EVAL PROBE, EVAL PARTIAL, EVAL INCORRECT, EVAL HINT, EVAL SKIP | en-in-tech-misc | | role | ROLE INTERVIEWER / ROLE INTERVIEWEE or ROLE AGENT / ROLE CUSTOMER | en-in-tech-misc, hinglish-loans | | intent | 13 collections intents or 31 general intents INTENT GREETING, INTENT QUESTION, ... | hinglish-loans, whissle-large | | dialect | DIALECT NORTH, DIALECT SOUTH, DIALECT OTHERS | zh | Variants Choose your variant based on language and quality needs. Switch by changing VARIANT= and restarting. Cached models are reused. | Variant | Languages | Download | Best for | |---|---|---|---| hinglish | Hindi-English | ~515 MB | Debt collections, Hindi-English call centers | en-lite | English | ~500 MB | Quick testing, development | en-full ★ | English | ~2 GB | Sales coaching, interviews, therapy | multi-full | 23 languages | ~4 GB | Multilingual, highest quality | multi-zh | 23 langs + Mandarin | ~5 GB | Multilingual + dialect detection | all | All | ~6 GB | Maximum flexibility | Runs everywhere From your laptop CPU to data center GPUs. Same Docker, same API. Auto-detects GPU. | Hardware | VRAM | Variant | Concurrent | |---|---|---|---| | MacBook / Laptop | CPU | Any | 1–3 | | Mac Mini M4 Pro | 24 GB unified | en-full | 3–8 | | NVIDIA T4 | 16 GB | en-lite | 5–10 | | RTX 4090 | 24 GB | en-full | 20–50 | | A100 40GB | 40 GB | multi-full | 50–80 | | RTX 6000 Ada | 48 GB | all | 50–100 | | H100 | 80 GB | all | 150–300 | | DGX Spark | 128 GB unified | all | 30–60 | | H200 | 141 GB | all | 250–500 | | Docker Tag | Arch | Runtime | |---|---|---| | whissleasr/whissle-gateway:latest | amd64 | CPU — Mac Rosetta , Linux, Windows | | whissleasr/whissle-gateway:gpu | amd64 | NVIDIA CUDA 12.4 + onnxruntime-gpu | Architecture ┌──────────────────────────────────────────────────────────────┐ │ Docker Container │ │ │ │ ┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────────┐ │ │ │ ASR │ │ TTS │ │ Pipecat │ │ Agent │ │ │ │ :8001 │ │ :8003 │ │ :8000 │ │ :8765 │ │ │ │ │ │ Kokoro │ │ │ │ Claude / │ │ │ │ ONNX │ │ 82M │ │ WebRTC │ │ Gemini API │ │ │ │ +KenLM │ │ 55 voice │ │ Twilio │ │ │ │ │ │ +ECAPA │ │ │ │ Voice AI │ │ Summarize │ │ │ │ +VAD │ │ │ │ │ │ Coach │ │ │ │ +Punct │ │ │ │ Auth │ │ Analyze │ │ │ │ +ITN │ │ │ │ Multi-org│ │ │ │ │ └──────────┘ └──────────┘ └──────────┘ └──────────────┘ │ │ │ │ │ ┌──────────────┐ │ │ │ PostgreSQL │ │ │ │ :5432 │ │ │ └──────────────┘ │ │ │ │ /models Docker volume — cached ASR models │ │ /data Docker volume — PostgreSQL, auth, conversations │ └──────────────────────────────────────────────────────────────┘ whissle-models volume ASR models, KenLM, punctuation, ITN. Downloaded on first run, cached forever. Survives container restarts. whissle-data volume Conversations, analytics, agent configs, auth tokens. Persists across restarts. Only deleted by docker volume rm . Get started One command. Models download automatically. Ready in 2 minutes. Built for contact centers, sales intelligence, behavioral AI, and more.