What happens when you run it:
═══════════════════════════════════════════════
Whissle Gateway — en-full
═══════════════════════════════════════════════
No GPU detected → using CPU
Shared models:
✓ speaker encoder + VAD 26 MB
✓ punctuation 254 MB
✓ ITN (English + Hinglish) 1.5 MB
Variant: en-full
✓ en-in-tech-misc (485 MB)
✓ KenLM ENGLISH (1.5 GB)
Auth:
Mode: local
Token: wh_a1b2c3d4e5f6... (admin)
Manage: curl -H 'Authorization: Bearer ...' localhost:9000/auth/tokens
Starting services...
PostgreSQL: :5432 ●
ASR: :8001 ●
TTS: :8003 ●
Agent: :8765 ●
Pipecat: :8000 ●
Gateway: :9000 ●
API #
Five interfaces — batch REST, streaming WebSocket, text-to-speech, voice calling, and an intelligent agent.
POST localhost:8001/transcribe
$ curl -X POST http://localhost:8001/transcribe \
-F "file=@call.mp3" \
-F "diarize=true" \
-F "num_speakers=2" \
-F "punctuation=true" \
-F "metadata_prob=true" \
-F "summarize=sales_coaching" \
-o result.json
Response — transcript + metadata per segment + AI analysis
{
"segments": [
{
"speaker": "SPEAKER_00",
"text": "Hello, good morning.",
"start": 1.0, "end": 1.9,
"metadata": {
"emotion": "EMOTION_NEUTRAL",
"behavior": "BEHAVIOR_DIRECT",
"role": "ROLE_INTERVIEWER",
"age": "AGE_30_45",
"gender": "GENDER_MALE"
},
"words": [{"word": "Hello", "start": 1.0, "end": 1.3}]
}
],
"analysis": {
"overall_score": 78,
"buyer_outcome": "Converted",
"practices": { "followed": 6, "total": 8 },
"highlights": [...]
}
}
Parameters #
All parameters for POST /transcribe
.
| Parameter | Type | Default | Description |
|---|---|---|---|
| file | file | required | Audio file (MP3, WAV, FLAC, OGG, M4A) |
| language | string | auto | Language hint: en, hi, zh |
| diarize | bool | false | Speaker diarization |
| num_speakers | int | auto | Exact speaker count (if known) |
| punctuation | bool | true | Restore punctuation and capitalization |
| itn | bool | true | Inverse text normalization (numbers, currency) |
| use_lm | bool | true | KenLM language model beam search |
| metadata_prob | bool | false | Probability distributions for metadata |
| word_timestamps | bool | false | Per-word start/end timestamps |
| speech_analysis | bool | false | Speech patterns (pace, fillers, fluency) |
| summarize | string | — | AI analysis: true, sales_coaching, collections, or custom prompt |
| hotwords | string | — | Comma-separated hotwords for boosting |
AI analysis modes #
Add -F "summarize=mode"
to any transcription. The diarized transcript + metadata is sent to Claude or Gemini for analysis.
sales_coaching
Sales Coaching
8 best practices scored. Rep/buyer identification. Highlights with timestamps. Behavior labels per segment. Overall score 0–100.
collections
Collections Compliance
Identity verification, reason stated, amount mentioned, no harassment. Call outcome (Promise to Pay / Dispute / Hardship). Next action.
true
General Summary
Overview, participants, key topics, emotional dynamics, entities, outcome. Markdown format.
your prompt here
Custom Prompt
Pass any prompt string. The LLM receives your instructions + full transcript with per-segment metadata.
Models #
Each model extracts different metadata in a single ASR forward pass — no separate models or API calls.
en-in-tech-misc
485 MB120M params, 26 Behavioral codes for coaching, therapy, interviews. 8 evaluation labels.
English · 6 heads, 51 classes
hinglish-loans
479 MB115M params, Debt collection intents — pay-back, disputes, hardship. Agent/Customer role detection.
Hindi-English · 5 heads, 26 classes
zh
627 MB160M params, Mandarin with North/South dialect detection.
Mandarin · 3 heads, 12 classes
whissle-large
2.4 GB600M params, inline action tokens. 31 intent groups, 18K vocabulary.
23 languages · 5,500+ action tokens
Kokoro TTS
82 MBNon-autoregressive text-to-speech. Sub-200ms TTFB on CPU. Always included.
10 languages · Baked in
Punctuation + ITN
255 MBPunctuation restoration and inverse text normalization.
EN + Hinglish · Auto-downloaded
Metadata per segment #
Every segment includes these tags. Common tags appear on all models. Additional tags depend on the model.
| Tag | Values | Models |
|---|---|---|
| emotion | EMOTION_NEUTRAL, EMOTION_HAPPY, EMOTION_SAD, EMOTION_ANGRY, EMOTION_FEAR, EMOTION_SURPRISE | All |
| age | AGE_0_18, AGE_18_30, AGE_30_45, AGE_45_60, AGE_60+ | All |
| gender | GENDER_MALE, GENDER_FEMALE | All |
| behavior | 26 types (BEHAVIOR_EXPLAIN, BEHAVIOR_QUESTION, BEHAVIOR_ACKNOWLEDGE, ...) | en-in-tech-misc |
| eval | EVAL_CORRECT, EVAL_PROBE, EVAL_PARTIAL, EVAL_INCORRECT, EVAL_HINT, EVAL_SKIP | en-in-tech-misc |
| role | ROLE_INTERVIEWER / ROLE_INTERVIEWEE or ROLE_AGENT / ROLE_CUSTOMER | en-in-tech-misc, hinglish-loans |
| intent | 13 collections intents or 31 general intents (INTENT_GREETING, INTENT_QUESTION, ...) | hinglish-loans, whissle-large |
| dialect | DIALECT_NORTH, DIALECT_SOUTH, DIALECT_OTHERS | zh |
Variants #
Choose your variant based on language and quality needs. Switch by changing VARIANT=
and restarting. Cached models are reused.
| Variant | Languages | Download | Best for |
|---|---|---|---|
hinglish |
Hindi-English | ~515 MB | Debt collections, Hindi-English call centers |
en-lite |
English | ~500 MB | Quick testing, development |
en-full ★ |
English | ~2 GB | Sales coaching, interviews, therapy |
multi-full |
23 languages | ~4 GB | Multilingual, highest quality |
multi-zh |
23 langs + Mandarin | ~5 GB | Multilingual + dialect detection |
all |
All | ~6 GB | Maximum flexibility |
Runs everywhere #
From your laptop (CPU) to data center GPUs. Same Docker, same API. Auto-detects GPU.
| Hardware | VRAM | Variant | Concurrent |
|---|---|---|---|
| MacBook / Laptop | CPU | Any |
1–3 |
| Mac Mini M4 Pro | 24 GB unified | en-full |
3–8 |
| NVIDIA T4 | 16 GB | en-lite |
5–10 |
| RTX 4090 | 24 GB | en-full |
20–50 |
| A100 40GB | 40 GB | multi-full |
50–80 |
| RTX 6000 Ada | 48 GB | all |
50–100 |
| H100 | 80 GB | all |
150–300 |
| DGX Spark | 128 GB unified | all |
30–60 |
| H200 | 141 GB | all |
250–500 |
| Docker Tag | Arch | Runtime |
|---|---|---|
| whissleasr/whissle-gateway:latest | amd64 | CPU — Mac (Rosetta), Linux, Windows |
| whissleasr/whissle-gateway:gpu | amd64 | NVIDIA CUDA 12.4 + onnxruntime-gpu |
Architecture #
┌──────────────────────────────────────────────────────────────┐
│ Docker Container │
│ │
│ ┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────────┐ │
│ │ ASR │ │ TTS │ │ Pipecat │ │ Agent │ │
│ │ :8001 │ │ :8003 │ │ :8000 │ │ :8765 │ │
│ │ │ │ Kokoro │ │ │ │ Claude / │ │
│ │ ONNX │ │ 82M │ │ WebRTC │ │ Gemini API │ │
│ │ +KenLM │ │ 55 voice │ │ Twilio │ │ │ │
│ │ +ECAPA │ │ │ │ Voice AI │ │ Summarize │ │
│ │ +VAD │ │ │ │ │ │ Coach │ │
│ │ +Punct │ │ │ │ Auth │ │ Analyze │ │
│ │ +ITN │ │ │ │ Multi-org│ │ │ │
│ └──────────┘ └──────────┘ └──────────┘ └──────────────┘ │
│ │ │
│ ┌──────────────┐ │
│ │ PostgreSQL │ │
│ │ :5432 │ │
│ └──────────────┘ │
│ │
│ /models (Docker volume — cached ASR models) │
│ /data (Docker volume — PostgreSQL, auth, conversations) │
└──────────────────────────────────────────────────────────────┘
whissle-models volume
ASR models, KenLM, punctuation, ITN. Downloaded on first run, cached forever. Survives container restarts.
whissle-data volume
Conversations, analytics, agent configs, auth tokens. Persists across restarts. Only deleted by docker volume rm
.
Get started #
One command. Models download automatically. Ready in 2 minutes.
Built for contact centers, sales intelligence, behavioral AI, and more.