cd /news/machine-learning/i-built-an-open-source-vad-that-beat… Β· home β€Ί topics β€Ί machine-learning β€Ί article
[ARTICLE Β· art-37139] src=github.com β†— pub= topic=machine-learning verified=true sentiment=↑ positive

I built an open source VAD that beats Silero, Pyannote, and WebRTC

Developer Monish released NOVA-VAD, an open-source voice activity detector that achieves 93% accuracy on noisy audio, outperforming Silero, Pyannote, and WebRTC while remaining lightweight and explainable. The tool uses an ensemble classifier with 150+ features and provides confidence scores and decision explanations, addressing a long-standing trade-off in speech processing.

read3 min views5 publishedJun 24, 2026
I built an open source VAD that beats Silero, Pyannote, and WebRTC
Image: source

Noise-robust, Optimized, eXplainable Voice Activity Detector

NOVA-VAD is a lightweight, explainable Voice Activity Detector that outperforms every major open-source alternative on real-world noisy audio β€” without requiring a GPU or PyTorch.

Built as an open-source contribution to solving a problem that has existed in speech processing for 15+ years: existing VADs are either accurate OR lightweight OR explainable. Never all three.

Tested on 100 held-out files from UrbanSound8K (traffic, sirens, jackhammers, AC units, construction noise):

Model Accuracy Precision Recall F1 Lightweight Explainable
WebRTC VAD 58.0% 57.69% 60.0% 58.82% βœ… ❌
Pyannote VAD 62.0% 57.32% 94.0% 71.21% ❌ ❌
Silero VAD 87.0% 86.27% 88.0% 87.13% ❌ ❌
NOVA-VAD
93.0%
97.78%
88.0%
92.63%
βœ…
βœ…
Feature WebRTC Silero Pyannote NOVA-VAD
Accurate on noisy audio ❌ Partial Partial βœ…
Lightweight (no PyTorch) βœ… ❌ ❌ βœ…
Fully open source βœ… Partial βœ… βœ…
Explains every decision ❌ ❌ ❌ βœ…
Retrainable on custom data ❌ ❌ ❌ βœ…
Confidence scores ❌ ❌ ❌ βœ…

Raw Audio β†’ Denoiser β†’ 150+ Features β†’ Ensemble Classifier β†’ SPEECH / NO SPEECH + Explanation

  • MFCCs + deltas (78 features) β€” spectral shape and change over time
  • Zero Crossing Rate β€” speech is more consistent than noise
  • RMS Energy pattern β€” speech rises and falls rhythmically
  • Spectral Flux β€” speech transitions smoothly, noise changes randomly
  • Harmonic/Percussive ratio β€” human voice is mostly harmonic
  • Tempo/rhythm β€” speech has syllable rhythm noise does not
  • Mel Spectrogram statistics β€” energy distribution across frequency bands
  • Silence ratio β€” proportion of frames below energy threshold

Random Forest + Gradient Boosting voting together.

Every prediction includes confidence score + top 10 features that drove the decision in plain English.

git clone https://github.com/monishmal3375/nova-vad.git
cd nova-vad
python3 -m venv venv
source venv/bin/activate
pip install -r requirements.txt
python3 download_data.py
python3 -m src.pipeline
python3 -m src.explainer data/clean_speech/speech_001.wav
python3 -m src.benchmark

NOVA-VAD EXPLANATION File: speech_001.wav

Prediction: SPEECH

Confidence: 93.47% Why this decision was made:

MFCC Delta 1 std (10.63%) β†’ HIGH spectral change rate β€” dynamic audio like speech MFCC Delta 2 std ( 6.14%) β†’ HIGH acceleration β€” rapidly changing audio, speech-like Silence ratio ( 5.92%) β†’ 56% silence β€” mix of speech and s Spectral centroid std ( 4.27%) β†’ HIGH variation β€” shifting frequency center Mel mean ( 3.50%) β†’ MODERATE energy β€” normal speech level

nova-vad/

β”œβ”€β”€ data/

β”‚ β”œβ”€β”€ speech/ # raw speech files

β”‚ β”œβ”€β”€ noise/ # raw noise files

β”‚ β”œβ”€β”€ clean_speech/ # denoised speech

β”‚ └── clean_noise/ # denoised noise

β”œβ”€β”€ src/

β”‚ β”œβ”€β”€ denoiser.py # noise reduction pipeline

β”‚ β”œβ”€β”€ vad.py # WebRTC VAD baseline

β”‚ β”œβ”€β”€ classifier.py # NOVA-VAD 150+ features + ensemble

β”‚ β”œβ”€β”€ explainer.py # explainability layer

β”‚ β”œβ”€β”€ benchmark.py # head-to-head comparison

β”‚ └── pipeline.py # end-to-end runner

β”œβ”€β”€ models/ # saved trained models

β”œβ”€β”€ download_data.py # automated dataset down

Existing VADs fail in three ways:

  • They break in noisy environments β€” WebRTC gets 58% on real-world noise
  • They are black boxes β€” no explanation of why a decision was made
  • They are too heavy for edge devices β€” Silero needs PyTorch (200MB+)

NOVA-VAD solves all three simultaneously. No existing open-source tool does this.

  • Denoiser pipeline
  • WebRTC VAD baseline
  • 150+ feature MFCC classifier
  • Ensemble model (RF + GBT)
  • Explainability layer
  • Benchmark vs Silero, Pyannote, WebRTC
  • Real-time streaming audio support
  • pip install nova-vad packaging
  • Research paper

Monish

MIT License β€” free to use, modify, and distribute.

── more in #machine-learning 4 stories Β· sorted by recency
── more on @monish 3 stories trending now
sponsored brought to you by zahid.host 4,200+ EU-deployed projects
reading about agents? ship yours in a single git push.

Run your AI side-project on zahid.host

EU-based hosting, git-push deploys, automatic HTTPS, no cold starts. Free tier with a custom domain β€” perfect for shipping the agent you just read about.

$git push zahid main
β†’ Live at https://your-agent.zahid.host βœ“
Get free account β†’ Pricing
from €0/mo Β· no card required
LIVE [news/i-built-an-open-sour…] indexed:0 read:3min 2026-06-24 Β· β€”