{"slug": "whissle-gateway-run-multi-modal-voice-ai-locally-in-a-500mb-docker", "title": "Whissle Gateway – Run Multi-Modal Voice AI Locally in a 500MB Docker", "summary": "Whissle Gateway launches a multi-modal voice AI system that runs locally in a 500MB Docker container, offering batch REST, streaming WebSocket, text-to-speech, voice calling, and an intelligent agent. The system includes built-in AI analysis modes for sales coaching and collections compliance, with models that extract metadata like emotion, behavior, and role in a single ASR forward pass.", "body_md": "What happens when you run it:\n\n```\n═══════════════════════════════════════════════\n  Whissle Gateway — en-full\n═══════════════════════════════════════════════\nNo GPU detected → using CPU\n\nShared models:\n  ✓ speaker encoder + VAD           26 MB\n  ✓ punctuation                    254 MB\n  ✓ ITN (English + Hinglish)       1.5 MB\n\nVariant: en-full\n  ✓ en-in-tech-misc (485 MB)\n  ✓ KenLM ENGLISH (1.5 GB)\n\nAuth:\n  Mode:    local\n  Token:   wh_a1b2c3d4e5f6... (admin)\n  Manage:  curl -H 'Authorization: Bearer ...' localhost:9000/auth/tokens\n\nStarting services...\n  PostgreSQL: :5432  ●\n  ASR:        :8001  ●\n  TTS:        :8003  ●\n  Agent:      :8765  ●\n  Pipecat:    :8000  ●\n  Gateway:    :9000  ●\n```\n\n## API\n\nFive interfaces — batch REST, streaming WebSocket, text-to-speech, voice calling, and an intelligent agent.\n\nPOST localhost:8001/transcribe\n\n``` bash\n$ curl -X POST http://localhost:8001/transcribe \\\n    -F \"file=@call.mp3\" \\\n    -F \"diarize=true\" \\\n    -F \"num_speakers=2\" \\\n    -F \"punctuation=true\" \\\n    -F \"metadata_prob=true\" \\\n    -F \"summarize=sales_coaching\" \\\n    -o result.json\n```\n\nResponse — transcript + metadata per segment + AI analysis\n\n```\n{\n  \"segments\": [\n    {\n      \"speaker\":  \"SPEAKER_00\",\n      \"text\":     \"Hello, good morning.\",\n      \"start\":    1.0,  \"end\": 1.9,\n      \"metadata\": {\n        \"emotion\":  \"EMOTION_NEUTRAL\",\n        \"behavior\": \"BEHAVIOR_DIRECT\",\n        \"role\":     \"ROLE_INTERVIEWER\",\n        \"age\":      \"AGE_30_45\",\n        \"gender\":   \"GENDER_MALE\"\n      },\n      \"words\": [{\"word\": \"Hello\", \"start\": 1.0, \"end\": 1.3}]\n    }\n  ],\n  \"analysis\": {\n    \"overall_score\": 78,\n    \"buyer_outcome\": \"Converted\",\n    \"practices\":     { \"followed\": 6, \"total\": 8 },\n    \"highlights\":    [...]\n  }\n}\n```\n\n## Parameters\n\nAll parameters for `POST /transcribe`\n\n.\n\n| Parameter | Type | Default | Description |\n|---|---|---|---|\n| file | file | required | Audio file (MP3, WAV, FLAC, OGG, M4A) |\n| language | string | auto | Language hint: en, hi, zh |\n| diarize | bool | false | Speaker diarization |\n| num_speakers | int | auto | Exact speaker count (if known) |\n| punctuation | bool | true | Restore punctuation and capitalization |\n| itn | bool | true | Inverse text normalization (numbers, currency) |\n| use_lm | bool | true | KenLM language model beam search |\n| metadata_prob | bool | false | Probability distributions for metadata |\n| word_timestamps | bool | false | Per-word start/end timestamps |\n| speech_analysis | bool | false | Speech patterns (pace, fillers, fluency) |\n| summarize | string | — | AI analysis: true, sales_coaching, collections, or custom prompt |\n| hotwords | string | — | Comma-separated hotwords for boosting |\n\n## AI analysis modes\n\nAdd `-F \"summarize=mode\"`\n\nto any transcription. The diarized transcript + metadata is sent to Claude or Gemini for analysis.\n\n`sales_coaching`\n\n### Sales Coaching\n\n8 best practices scored. Rep/buyer identification. Highlights with timestamps. Behavior labels per segment. Overall score 0–100.\n\n`collections`\n\n### Collections Compliance\n\nIdentity verification, reason stated, amount mentioned, no harassment. Call outcome (Promise to Pay / Dispute / Hardship). Next action.\n\n`true`\n\n### General Summary\n\nOverview, participants, key topics, emotional dynamics, entities, outcome. Markdown format.\n\n`your prompt here`\n\n### Custom Prompt\n\nPass any prompt string. The LLM receives your instructions + full transcript with per-segment metadata.\n\n## Models\n\nEach model extracts different metadata in a single ASR forward pass — no separate models or API calls.\n\n### en-in-tech-misc\n\n485 MB120M params, 26 Behavioral codes for coaching, therapy, interviews. 8 evaluation labels.\n\nEnglish · 6 heads, 51 classes\n\n### hinglish-loans\n\n479 MB115M params, Debt collection intents — pay-back, disputes, hardship. Agent/Customer role detection.\n\nHindi-English · 5 heads, 26 classes\n\n### zh\n\n627 MB160M params, Mandarin with North/South dialect detection.\n\nMandarin · 3 heads, 12 classes\n\n### whissle-large\n\n2.4 GB600M params, inline action tokens. 31 intent groups, 18K vocabulary.\n\n23 languages · 5,500+ action tokens\n\n### Kokoro TTS\n\n82 MBNon-autoregressive text-to-speech. Sub-200ms TTFB on CPU. Always included.\n\n10 languages · Baked in\n\n### Punctuation + ITN\n\n255 MBPunctuation restoration and inverse text normalization.\n\nEN + Hinglish · Auto-downloaded\n\n## Metadata per segment\n\nEvery segment includes these tags. Common tags appear on all models. Additional tags depend on the model.\n\n| Tag | Values | Models |\n|---|---|---|\n| emotion | EMOTION_NEUTRAL, EMOTION_HAPPY, EMOTION_SAD, EMOTION_ANGRY, EMOTION_FEAR, EMOTION_SURPRISE | All |\n| age | AGE_0_18, AGE_18_30, AGE_30_45, AGE_45_60, AGE_60+ | All |\n| gender | GENDER_MALE, GENDER_FEMALE | All |\n| behavior | 26 types (BEHAVIOR_EXPLAIN, BEHAVIOR_QUESTION, BEHAVIOR_ACKNOWLEDGE, ...) | en-in-tech-misc |\n| eval | EVAL_CORRECT, EVAL_PROBE, EVAL_PARTIAL, EVAL_INCORRECT, EVAL_HINT, EVAL_SKIP | en-in-tech-misc |\n| role | ROLE_INTERVIEWER / ROLE_INTERVIEWEE or ROLE_AGENT / ROLE_CUSTOMER | en-in-tech-misc, hinglish-loans |\n| intent | 13 collections intents or 31 general intents (INTENT_GREETING, INTENT_QUESTION, ...) | hinglish-loans, whissle-large |\n| dialect | DIALECT_NORTH, DIALECT_SOUTH, DIALECT_OTHERS | zh |\n\n## Variants\n\nChoose your variant based on language and quality needs. Switch by changing `VARIANT=`\n\nand restarting. Cached models are reused.\n\n| Variant | Languages | Download | Best for |\n|---|---|---|---|\n`hinglish` | Hindi-English | ~515 MB | Debt collections, Hindi-English call centers |\n`en-lite` | English | ~500 MB | Quick testing, development |\n`en-full` ★ | English | ~2 GB | Sales coaching, interviews, therapy |\n`multi-full` | 23 languages | ~4 GB | Multilingual, highest quality |\n`multi-zh` | 23 langs + Mandarin | ~5 GB | Multilingual + dialect detection |\n`all` | All | ~6 GB | Maximum flexibility |\n\n## Runs everywhere\n\nFrom your laptop (CPU) to data center GPUs. Same Docker, same API. Auto-detects GPU.\n\n| Hardware | VRAM | Variant | Concurrent |\n|---|---|---|---|\n| MacBook / Laptop | CPU | `Any` | 1–3 |\n| Mac Mini M4 Pro | 24 GB unified | `en-full` | 3–8 |\n| NVIDIA T4 | 16 GB | `en-lite` | 5–10 |\n| RTX 4090 | 24 GB | `en-full` | 20–50 |\n| A100 40GB | 40 GB | `multi-full` | 50–80 |\n| RTX 6000 Ada | 48 GB | `all` | 50–100 |\n| H100 | 80 GB | `all` | 150–300 |\n| DGX Spark | 128 GB unified | `all` | 30–60 |\n| H200 | 141 GB | `all` | 250–500 |\n\n| Docker Tag | Arch | Runtime |\n|---|---|---|\n| whissleasr/whissle-gateway:latest | amd64 | CPU — Mac (Rosetta), Linux, Windows |\n| whissleasr/whissle-gateway:gpu | amd64 | NVIDIA CUDA 12.4 + onnxruntime-gpu |\n\n## Architecture\n\n```\n┌──────────────────────────────────────────────────────────────┐\n│                     Docker Container                        │\n│                                                             │\n│  ┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────────┐   │\n│  │ ASR      │ │ TTS      │ │ Pipecat  │ │ Agent        │   │\n│  │ :8001    │ │ :8003    │ │ :8000    │ │ :8765        │   │\n│  │          │ │ Kokoro   │ │          │ │ Claude /     │   │\n│  │ ONNX     │ │ 82M      │ │ WebRTC   │ │ Gemini API   │   │\n│  │ +KenLM   │ │ 55 voice │ │ Twilio   │ │              │   │\n│  │ +ECAPA   │ │          │ │ Voice AI │ │ Summarize    │   │\n│  │ +VAD     │ │          │ │          │ │ Coach        │   │\n│  │ +Punct   │ │          │ │ Auth     │ │ Analyze      │   │\n│  │ +ITN     │ │          │ │ Multi-org│ │              │   │\n│  └──────────┘ └──────────┘ └──────────┘ └──────────────┘   │\n│                     │                                       │\n│              ┌──────────────┐                               │\n│              │ PostgreSQL   │                               │\n│              │ :5432        │                               │\n│              └──────────────┘                               │\n│                                                             │\n│  /models  (Docker volume — cached ASR models)               │\n│  /data    (Docker volume — PostgreSQL, auth, conversations) │\n└──────────────────────────────────────────────────────────────┘\n```\n\n### whissle-models volume\n\nASR models, KenLM, punctuation, ITN. Downloaded on first run, cached forever. Survives container restarts.\n\n### whissle-data volume\n\nConversations, analytics, agent configs, auth tokens. Persists across restarts. Only deleted by `docker volume rm`\n\n.\n\n## Get started\n\nOne command. Models download automatically. Ready in 2 minutes.\n\nBuilt for contact centers, sales intelligence, behavioral AI, and more.", "url": "https://wpnews.pro/news/whissle-gateway-run-multi-modal-voice-ai-locally-in-a-500mb-docker", "canonical_source": "https://whissle.ai/gateway", "published_at": "2026-06-13 08:16:09+00:00", "updated_at": "2026-06-13 08:20:08.244296+00:00", "lang": "en", "topics": ["artificial-intelligence", "natural-language-processing", "ai-products", "ai-tools", "developer-tools"], "entities": ["Whissle Gateway", "Claude", "Gemini", "PostgreSQL"], "alternates": {"html": "https://wpnews.pro/news/whissle-gateway-run-multi-modal-voice-ai-locally-in-a-500mb-docker", "markdown": "https://wpnews.pro/news/whissle-gateway-run-multi-modal-voice-ai-locally-in-a-500mb-docker.md", "text": "https://wpnews.pro/news/whissle-gateway-run-multi-modal-voice-ai-locally-in-a-500mb-docker.txt", "jsonld": "https://wpnews.pro/news/whissle-gateway-run-multi-modal-voice-ai-locally-in-a-500mb-docker.jsonld"}}