# I built an open source VAD that beats Silero, Pyannote, and WebRTC

> Source: <https://github.com/monishmal3375/nova-vad>
> Published: 2026-06-24 03:06:22+00:00

Noise-robust, Optimized, eXplainable Voice Activity Detector

NOVA-VAD is a lightweight, explainable Voice Activity Detector that outperforms every major open-source alternative on real-world noisy audio — without requiring a GPU or PyTorch.

Built as an open-source contribution to solving a problem that has existed in speech processing for 15+ years: existing VADs are either accurate OR lightweight OR explainable. Never all three.

Tested on 100 held-out files from UrbanSound8K (traffic, sirens, jackhammers, AC units, construction noise):

| Model | Accuracy | Precision | Recall | F1 | Lightweight | Explainable |
|---|---|---|---|---|---|---|
| WebRTC VAD | 58.0% | 57.69% | 60.0% | 58.82% | ✅ | ❌ |
| Pyannote VAD | 62.0% | 57.32% | 94.0% | 71.21% | ❌ | ❌ |
| Silero VAD | 87.0% | 86.27% | 88.0% | 87.13% | ❌ | ❌ |
NOVA-VAD |
93.0% |
97.78% |
88.0% |
92.63% |
✅ |
✅ |

| Feature | WebRTC | Silero | Pyannote | NOVA-VAD |
|---|---|---|---|---|
| Accurate on noisy audio | ❌ | Partial | Partial | ✅ |
| Lightweight (no PyTorch) | ✅ | ❌ | ❌ | ✅ |
| Fully open source | ✅ | Partial | ✅ | ✅ |
| Explains every decision | ❌ | ❌ | ❌ | ✅ |
| Retrainable on custom data | ❌ | ❌ | ❌ | ✅ |
| Confidence scores | ❌ | ❌ | ❌ | ✅ |

Raw Audio → Denoiser → 150+ Features → Ensemble Classifier → SPEECH / NO SPEECH + Explanation

- MFCCs + deltas (78 features) — spectral shape and change over time
- Zero Crossing Rate — speech is more consistent than noise
- RMS Energy pattern — speech rises and falls rhythmically
- Spectral Flux — speech transitions smoothly, noise changes randomly
- Harmonic/Percussive ratio — human voice is mostly harmonic
- Tempo/rhythm — speech has syllable rhythm noise does not
- Mel Spectrogram statistics — energy distribution across frequency bands
- Silence ratio — proportion of frames below energy threshold

Random Forest + Gradient Boosting voting together.

Every prediction includes confidence score + top 10 features that drove the decision in plain English.

```
git clone https://github.com/monishmal3375/nova-vad.git
cd nova-vad
python3 -m venv venv
source venv/bin/activate
pip install -r requirements.txt
python3 download_data.py
python3 -m src.pipeline
python3 -m src.explainer data/clean_speech/speech_001.wav
python3 -m src.benchmark
```

=======================================================

NOVA-VAD EXPLANATION File: speech_001.wav

Prediction: SPEECH

Confidence: 93.47% Why this decision was made:

MFCC Delta 1 std (10.63%) → HIGH spectral change rate — dynamic audio like speech MFCC Delta 2 std ( 6.14%) → HIGH acceleration — rapidly changing audio, speech-like Silence ratio ( 5.92%) → 56% silence — mix of speech and pauses Spectral centroid std ( 4.27%) → HIGH variation — shifting frequency center Mel mean ( 3.50%) → MODERATE energy — normal speech level

nova-vad/

├── data/

│ ├── speech/ # raw speech files

│ ├── noise/ # raw noise files

│ ├── clean_speech/ # denoised speech

│ └── clean_noise/ # denoised noise

├── src/

│ ├── denoiser.py # noise reduction pipeline

│ ├── vad.py # WebRTC VAD baseline

│ ├── classifier.py # NOVA-VAD 150+ features + ensemble

│ ├── explainer.py # explainability layer

│ ├── benchmark.py # head-to-head comparison

│ └── pipeline.py # end-to-end runner

├── models/ # saved trained models

├── download_data.py # automated dataset downloader

**Existing VADs fail in three ways:**

- They break in noisy environments — WebRTC gets 58% on real-world noise
- They are black boxes — no explanation of why a decision was made
- They are too heavy for edge devices — Silero needs PyTorch (200MB+)

NOVA-VAD solves all three simultaneously. No existing open-source tool does this.

- Denoiser pipeline
- WebRTC VAD baseline
- 150+ feature MFCC classifier
- Ensemble model (RF + GBT)
- Explainability layer
- Benchmark vs Silero, Pyannote, WebRTC
- Real-time streaming audio support
- pip install nova-vad packaging
- Research paper

**Monish**

MIT License — free to use, modify, and distribute.