cd /news/artificial-intelligence/whissle-gateway-run-multi-modal-voic… · home topics artificial-intelligence article
[ARTICLE · art-25996] src=whissle.ai pub= topic=artificial-intelligence verified=true sentiment=↑ positive

Whissle Gateway – Run Multi-Modal Voice AI Locally in a 500MB Docker

Whissle Gateway launches a multi-modal voice AI system that runs locally in a 500MB Docker container, offering batch REST, streaming WebSocket, text-to-speech, voice calling, and an intelligent agent. The system includes built-in AI analysis modes for sales coaching and collections compliance, with models that extract metadata like emotion, behavior, and role in a single ASR forward pass.

read6 min publishedJun 13, 2026

What happens when you run it:

═══════════════════════════════════════════════
  Whissle Gateway — en-full
═══════════════════════════════════════════════
No GPU detected → using CPU

Shared models:
  ✓ speaker encoder + VAD           26 MB
  ✓ punctuation                    254 MB
  ✓ ITN (English + Hinglish)       1.5 MB

Variant: en-full
  ✓ en-in-tech-misc (485 MB)
  ✓ KenLM ENGLISH (1.5 GB)

Auth:
  Mode:    local
  Token:   wh_a1b2c3d4e5f6... (admin)
  Manage:  curl -H 'Authorization: Bearer ...' localhost:9000/auth/tokens

Starting services...
  PostgreSQL: :5432  ●
  ASR:        :8001  ●
  TTS:        :8003  ●
  Agent:      :8765  ●
  Pipecat:    :8000  ●
  Gateway:    :9000  ●

API #

Five interfaces — batch REST, streaming WebSocket, text-to-speech, voice calling, and an intelligent agent.

POST localhost:8001/transcribe

$ curl -X POST http://localhost:8001/transcribe \
    -F "file=@call.mp3" \
    -F "diarize=true" \
    -F "num_speakers=2" \
    -F "punctuation=true" \
    -F "metadata_prob=true" \
    -F "summarize=sales_coaching" \
    -o result.json

Response — transcript + metadata per segment + AI analysis

{
  "segments": [
    {
      "speaker":  "SPEAKER_00",
      "text":     "Hello, good morning.",
      "start":    1.0,  "end": 1.9,
      "metadata": {
        "emotion":  "EMOTION_NEUTRAL",
        "behavior": "BEHAVIOR_DIRECT",
        "role":     "ROLE_INTERVIEWER",
        "age":      "AGE_30_45",
        "gender":   "GENDER_MALE"
      },
      "words": [{"word": "Hello", "start": 1.0, "end": 1.3}]
    }
  ],
  "analysis": {
    "overall_score": 78,
    "buyer_outcome": "Converted",
    "practices":     { "followed": 6, "total": 8 },
    "highlights":    [...]
  }
}

Parameters #

All parameters for POST /transcribe

.

Parameter Type Default Description
file file required Audio file (MP3, WAV, FLAC, OGG, M4A)
language string auto Language hint: en, hi, zh
diarize bool false Speaker diarization
num_speakers int auto Exact speaker count (if known)
punctuation bool true Restore punctuation and capitalization
itn bool true Inverse text normalization (numbers, currency)
use_lm bool true KenLM language model beam search
metadata_prob bool false Probability distributions for metadata
word_timestamps bool false Per-word start/end timestamps
speech_analysis bool false Speech patterns (pace, fillers, fluency)
summarize string AI analysis: true, sales_coaching, collections, or custom prompt
hotwords string Comma-separated hotwords for boosting

AI analysis modes #

Add -F "summarize=mode"

to any transcription. The diarized transcript + metadata is sent to Claude or Gemini for analysis.

sales_coaching

Sales Coaching

8 best practices scored. Rep/buyer identification. Highlights with timestamps. Behavior labels per segment. Overall score 0–100.

collections

Collections Compliance

Identity verification, reason stated, amount mentioned, no harassment. Call outcome (Promise to Pay / Dispute / Hardship). Next action.

true

General Summary

Overview, participants, key topics, emotional dynamics, entities, outcome. Markdown format.

your prompt here

Custom Prompt

Pass any prompt string. The LLM receives your instructions + full transcript with per-segment metadata.

Models #

Each model extracts different metadata in a single ASR forward pass — no separate models or API calls.

en-in-tech-misc

485 MB120M params, 26 Behavioral codes for coaching, therapy, interviews. 8 evaluation labels.

English · 6 heads, 51 classes

hinglish-loans

479 MB115M params, Debt collection intents — pay-back, disputes, hardship. Agent/Customer role detection.

Hindi-English · 5 heads, 26 classes

zh

627 MB160M params, Mandarin with North/South dialect detection.

Mandarin · 3 heads, 12 classes

whissle-large

2.4 GB600M params, inline action tokens. 31 intent groups, 18K vocabulary.

23 languages · 5,500+ action tokens

Kokoro TTS

82 MBNon-autoregressive text-to-speech. Sub-200ms TTFB on CPU. Always included.

10 languages · Baked in

Punctuation + ITN

255 MBPunctuation restoration and inverse text normalization.

EN + Hinglish · Auto-downloaded

Metadata per segment #

Every segment includes these tags. Common tags appear on all models. Additional tags depend on the model.

Tag Values Models
emotion EMOTION_NEUTRAL, EMOTION_HAPPY, EMOTION_SAD, EMOTION_ANGRY, EMOTION_FEAR, EMOTION_SURPRISE All
age AGE_0_18, AGE_18_30, AGE_30_45, AGE_45_60, AGE_60+ All
gender GENDER_MALE, GENDER_FEMALE All
behavior 26 types (BEHAVIOR_EXPLAIN, BEHAVIOR_QUESTION, BEHAVIOR_ACKNOWLEDGE, ...) en-in-tech-misc
eval EVAL_CORRECT, EVAL_PROBE, EVAL_PARTIAL, EVAL_INCORRECT, EVAL_HINT, EVAL_SKIP en-in-tech-misc
role ROLE_INTERVIEWER / ROLE_INTERVIEWEE or ROLE_AGENT / ROLE_CUSTOMER en-in-tech-misc, hinglish-loans
intent 13 collections intents or 31 general intents (INTENT_GREETING, INTENT_QUESTION, ...) hinglish-loans, whissle-large
dialect DIALECT_NORTH, DIALECT_SOUTH, DIALECT_OTHERS zh

Variants #

Choose your variant based on language and quality needs. Switch by changing VARIANT=

and restarting. Cached models are reused.

Variant Languages Download Best for
hinglish Hindi-English ~515 MB Debt collections, Hindi-English call centers
en-lite English ~500 MB Quick testing, development
en-full English ~2 GB Sales coaching, interviews, therapy
multi-full 23 languages ~4 GB Multilingual, highest quality
multi-zh 23 langs + Mandarin ~5 GB Multilingual + dialect detection
all All ~6 GB Maximum flexibility

Runs everywhere #

From your laptop (CPU) to data center GPUs. Same Docker, same API. Auto-detects GPU.

Hardware VRAM Variant Concurrent
MacBook / Laptop CPU Any 1–3
Mac Mini M4 Pro 24 GB unified en-full 3–8
NVIDIA T4 16 GB en-lite 5–10
RTX 4090 24 GB en-full 20–50
A100 40GB 40 GB multi-full 50–80
RTX 6000 Ada 48 GB all 50–100
H100 80 GB all 150–300
DGX Spark 128 GB unified all 30–60
H200 141 GB all 250–500
Docker Tag Arch Runtime
whissleasr/whissle-gateway:latest amd64 CPU — Mac (Rosetta), Linux, Windows
whissleasr/whissle-gateway:gpu amd64 NVIDIA CUDA 12.4 + onnxruntime-gpu

Architecture #

┌──────────────────────────────────────────────────────────────┐
│                     Docker Container                        │
│                                                             │
│  ┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────────┐   │
│  │ ASR      │ │ TTS      │ │ Pipecat  │ │ Agent        │   │
│  │ :8001    │ │ :8003    │ │ :8000    │ │ :8765        │   │
│  │          │ │ Kokoro   │ │          │ │ Claude /     │   │
│  │ ONNX     │ │ 82M      │ │ WebRTC   │ │ Gemini API   │   │
│  │ +KenLM   │ │ 55 voice │ │ Twilio   │ │              │   │
│  │ +ECAPA   │ │          │ │ Voice AI │ │ Summarize    │   │
│  │ +VAD     │ │          │ │          │ │ Coach        │   │
│  │ +Punct   │ │          │ │ Auth     │ │ Analyze      │   │
│  │ +ITN     │ │          │ │ Multi-org│ │              │   │
│  └──────────┘ └──────────┘ └──────────┘ └──────────────┘   │
│                     │                                       │
│              ┌──────────────┐                               │
│              │ PostgreSQL   │                               │
│              │ :5432        │                               │
│              └──────────────┘                               │
│                                                             │
│  /models  (Docker volume — cached ASR models)               │
│  /data    (Docker volume — PostgreSQL, auth, conversations) │
└──────────────────────────────────────────────────────────────┘

whissle-models volume

ASR models, KenLM, punctuation, ITN. Downloaded on first run, cached forever. Survives container restarts.

whissle-data volume

Conversations, analytics, agent configs, auth tokens. Persists across restarts. Only deleted by docker volume rm

.

Get started #

One command. Models download automatically. Ready in 2 minutes.

Built for contact centers, sales intelligence, behavioral AI, and more.

── more in #artificial-intelligence 4 stories · sorted by recency
sponsored brought to you by zahid.host 4,200+ EU-deployed projects
reading about agents? ship yours in a single git push.

Run your AI side-project on zahid.host

EU-based hosting, git-push deploys, automatic HTTPS, no cold starts. Free tier with a custom domain — perfect for shipping the agent you just read about.

$git push zahid main
Live at https://your-agent.zahid.host
Get free account → Pricing
from €0/mo · no card required
LIVE [news/whissle-gateway-run-…] indexed:0 read:6min 2026-06-13 ·