# Whissle Gateway – Run Multi-Modal Voice AI Locally in a 500MB Docker

> Source: <https://whissle.ai/gateway>
> Published: 2026-06-13 08:16:09+00:00

What happens when you run it:

```
═══════════════════════════════════════════════
  Whissle Gateway — en-full
═══════════════════════════════════════════════
No GPU detected → using CPU

Shared models:
  ✓ speaker encoder + VAD           26 MB
  ✓ punctuation                    254 MB
  ✓ ITN (English + Hinglish)       1.5 MB

Variant: en-full
  ✓ en-in-tech-misc (485 MB)
  ✓ KenLM ENGLISH (1.5 GB)

Auth:
  Mode:    local
  Token:   wh_a1b2c3d4e5f6... (admin)
  Manage:  curl -H 'Authorization: Bearer ...' localhost:9000/auth/tokens

Starting services...
  PostgreSQL: :5432  ●
  ASR:        :8001  ●
  TTS:        :8003  ●
  Agent:      :8765  ●
  Pipecat:    :8000  ●
  Gateway:    :9000  ●
```

## API

Five interfaces — batch REST, streaming WebSocket, text-to-speech, voice calling, and an intelligent agent.

POST localhost:8001/transcribe

``` bash
$ curl -X POST http://localhost:8001/transcribe \
    -F "file=@call.mp3" \
    -F "diarize=true" \
    -F "num_speakers=2" \
    -F "punctuation=true" \
    -F "metadata_prob=true" \
    -F "summarize=sales_coaching" \
    -o result.json
```

Response — transcript + metadata per segment + AI analysis

```
{
  "segments": [
    {
      "speaker":  "SPEAKER_00",
      "text":     "Hello, good morning.",
      "start":    1.0,  "end": 1.9,
      "metadata": {
        "emotion":  "EMOTION_NEUTRAL",
        "behavior": "BEHAVIOR_DIRECT",
        "role":     "ROLE_INTERVIEWER",
        "age":      "AGE_30_45",
        "gender":   "GENDER_MALE"
      },
      "words": [{"word": "Hello", "start": 1.0, "end": 1.3}]
    }
  ],
  "analysis": {
    "overall_score": 78,
    "buyer_outcome": "Converted",
    "practices":     { "followed": 6, "total": 8 },
    "highlights":    [...]
  }
}
```

## Parameters

All parameters for `POST /transcribe`

.

| Parameter | Type | Default | Description |
|---|---|---|---|
| file | file | required | Audio file (MP3, WAV, FLAC, OGG, M4A) |
| language | string | auto | Language hint: en, hi, zh |
| diarize | bool | false | Speaker diarization |
| num_speakers | int | auto | Exact speaker count (if known) |
| punctuation | bool | true | Restore punctuation and capitalization |
| itn | bool | true | Inverse text normalization (numbers, currency) |
| use_lm | bool | true | KenLM language model beam search |
| metadata_prob | bool | false | Probability distributions for metadata |
| word_timestamps | bool | false | Per-word start/end timestamps |
| speech_analysis | bool | false | Speech patterns (pace, fillers, fluency) |
| summarize | string | — | AI analysis: true, sales_coaching, collections, or custom prompt |
| hotwords | string | — | Comma-separated hotwords for boosting |

## AI analysis modes

Add `-F "summarize=mode"`

to any transcription. The diarized transcript + metadata is sent to Claude or Gemini for analysis.

`sales_coaching`

### Sales Coaching

8 best practices scored. Rep/buyer identification. Highlights with timestamps. Behavior labels per segment. Overall score 0–100.

`collections`

### Collections Compliance

Identity verification, reason stated, amount mentioned, no harassment. Call outcome (Promise to Pay / Dispute / Hardship). Next action.

`true`

### General Summary

Overview, participants, key topics, emotional dynamics, entities, outcome. Markdown format.

`your prompt here`

### Custom Prompt

Pass any prompt string. The LLM receives your instructions + full transcript with per-segment metadata.

## Models

Each model extracts different metadata in a single ASR forward pass — no separate models or API calls.

### en-in-tech-misc

485 MB120M params, 26 Behavioral codes for coaching, therapy, interviews. 8 evaluation labels.

English · 6 heads, 51 classes

### hinglish-loans

479 MB115M params, Debt collection intents — pay-back, disputes, hardship. Agent/Customer role detection.

Hindi-English · 5 heads, 26 classes

### zh

627 MB160M params, Mandarin with North/South dialect detection.

Mandarin · 3 heads, 12 classes

### whissle-large

2.4 GB600M params, inline action tokens. 31 intent groups, 18K vocabulary.

23 languages · 5,500+ action tokens

### Kokoro TTS

82 MBNon-autoregressive text-to-speech. Sub-200ms TTFB on CPU. Always included.

10 languages · Baked in

### Punctuation + ITN

255 MBPunctuation restoration and inverse text normalization.

EN + Hinglish · Auto-downloaded

## Metadata per segment

Every segment includes these tags. Common tags appear on all models. Additional tags depend on the model.

| Tag | Values | Models |
|---|---|---|
| emotion | EMOTION_NEUTRAL, EMOTION_HAPPY, EMOTION_SAD, EMOTION_ANGRY, EMOTION_FEAR, EMOTION_SURPRISE | All |
| age | AGE_0_18, AGE_18_30, AGE_30_45, AGE_45_60, AGE_60+ | All |
| gender | GENDER_MALE, GENDER_FEMALE | All |
| behavior | 26 types (BEHAVIOR_EXPLAIN, BEHAVIOR_QUESTION, BEHAVIOR_ACKNOWLEDGE, ...) | en-in-tech-misc |
| eval | EVAL_CORRECT, EVAL_PROBE, EVAL_PARTIAL, EVAL_INCORRECT, EVAL_HINT, EVAL_SKIP | en-in-tech-misc |
| role | ROLE_INTERVIEWER / ROLE_INTERVIEWEE or ROLE_AGENT / ROLE_CUSTOMER | en-in-tech-misc, hinglish-loans |
| intent | 13 collections intents or 31 general intents (INTENT_GREETING, INTENT_QUESTION, ...) | hinglish-loans, whissle-large |
| dialect | DIALECT_NORTH, DIALECT_SOUTH, DIALECT_OTHERS | zh |

## Variants

Choose your variant based on language and quality needs. Switch by changing `VARIANT=`

and restarting. Cached models are reused.

| Variant | Languages | Download | Best for |
|---|---|---|---|
`hinglish` | Hindi-English | ~515 MB | Debt collections, Hindi-English call centers |
`en-lite` | English | ~500 MB | Quick testing, development |
`en-full` ★ | English | ~2 GB | Sales coaching, interviews, therapy |
`multi-full` | 23 languages | ~4 GB | Multilingual, highest quality |
`multi-zh` | 23 langs + Mandarin | ~5 GB | Multilingual + dialect detection |
`all` | All | ~6 GB | Maximum flexibility |

## Runs everywhere

From your laptop (CPU) to data center GPUs. Same Docker, same API. Auto-detects GPU.

| Hardware | VRAM | Variant | Concurrent |
|---|---|---|---|
| MacBook / Laptop | CPU | `Any` | 1–3 |
| Mac Mini M4 Pro | 24 GB unified | `en-full` | 3–8 |
| NVIDIA T4 | 16 GB | `en-lite` | 5–10 |
| RTX 4090 | 24 GB | `en-full` | 20–50 |
| A100 40GB | 40 GB | `multi-full` | 50–80 |
| RTX 6000 Ada | 48 GB | `all` | 50–100 |
| H100 | 80 GB | `all` | 150–300 |
| DGX Spark | 128 GB unified | `all` | 30–60 |
| H200 | 141 GB | `all` | 250–500 |

| Docker Tag | Arch | Runtime |
|---|---|---|
| whissleasr/whissle-gateway:latest | amd64 | CPU — Mac (Rosetta), Linux, Windows |
| whissleasr/whissle-gateway:gpu | amd64 | NVIDIA CUDA 12.4 + onnxruntime-gpu |

## Architecture

```
┌──────────────────────────────────────────────────────────────┐
│                     Docker Container                        │
│                                                             │
│  ┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────────┐   │
│  │ ASR      │ │ TTS      │ │ Pipecat  │ │ Agent        │   │
│  │ :8001    │ │ :8003    │ │ :8000    │ │ :8765        │   │
│  │          │ │ Kokoro   │ │          │ │ Claude /     │   │
│  │ ONNX     │ │ 82M      │ │ WebRTC   │ │ Gemini API   │   │
│  │ +KenLM   │ │ 55 voice │ │ Twilio   │ │              │   │
│  │ +ECAPA   │ │          │ │ Voice AI │ │ Summarize    │   │
│  │ +VAD     │ │          │ │          │ │ Coach        │   │
│  │ +Punct   │ │          │ │ Auth     │ │ Analyze      │   │
│  │ +ITN     │ │          │ │ Multi-org│ │              │   │
│  └──────────┘ └──────────┘ └──────────┘ └──────────────┘   │
│                     │                                       │
│              ┌──────────────┐                               │
│              │ PostgreSQL   │                               │
│              │ :5432        │                               │
│              └──────────────┘                               │
│                                                             │
│  /models  (Docker volume — cached ASR models)               │
│  /data    (Docker volume — PostgreSQL, auth, conversations) │
└──────────────────────────────────────────────────────────────┘
```

### whissle-models volume

ASR models, KenLM, punctuation, ITN. Downloaded on first run, cached forever. Survives container restarts.

### whissle-data volume

Conversations, analytics, agent configs, auth tokens. Persists across restarts. Only deleted by `docker volume rm`

.

## Get started

One command. Models download automatically. Ready in 2 minutes.

Built for contact centers, sales intelligence, behavioral AI, and more.
