I built an open source VAD that beats Silero, Pyannote, and WebRTC

Developer Monish released NOVA-VAD, an open-source voice activity detector that achieves 93% accuracy on noisy audio, outperforming Silero, Pyannote, and WebRTC while remaining lightweight and explainable. The tool uses an ensemble classifier with 150+ features and provides confidence scores and decision explanations, addressing a long-standing trade-off in speech processing.

Noise-robust, Optimized, eXplainable Voice Activity Detector NOVA-VAD is a lightweight, explainable Voice Activity Detector that outperforms every major open-source alternative on real-world noisy audio — without requiring a GPU or PyTorch. Built as an open-source contribution to solving a problem that has existed in speech processing for 15+ years: existing VADs are either accurate OR lightweight OR explainable. Never all three. Tested on 100 held-out files from UrbanSound8K traffic, sirens, jackhammers, AC units, construction noise : | Model | Accuracy | Precision | Recall | F1 | Lightweight | Explainable | |---|---|---|---|---|---|---| | WebRTC VAD | 58.0% | 57.69% | 60.0% | 58.82% | ✅ | ❌ | | Pyannote VAD | 62.0% | 57.32% | 94.0% | 71.21% | ❌ | ❌ | | Silero VAD | 87.0% | 86.27% | 88.0% | 87.13% | ❌ | ❌ | NOVA-VAD | 93.0% | 97.78% | 88.0% | 92.63% | ✅ | ✅ | | Feature | WebRTC | Silero | Pyannote | NOVA-VAD | |---|---|---|---|---| | Accurate on noisy audio | ❌ | Partial | Partial | ✅ | | Lightweight no PyTorch | ✅ | ❌ | ❌ | ✅ | | Fully open source | ✅ | Partial | ✅ | ✅ | | Explains every decision | ❌ | ❌ | ❌ | ✅ | | Retrainable on custom data | ❌ | ❌ | ❌ | ✅ | | Confidence scores | ❌ | ❌ | ❌ | ✅ | Raw Audio → Denoiser → 150+ Features → Ensemble Classifier → SPEECH / NO SPEECH + Explanation - MFCCs + deltas 78 features — spectral shape and change over time - Zero Crossing Rate — speech is more consistent than noise - RMS Energy pattern — speech rises and falls rhythmically - Spectral Flux — speech transitions smoothly, noise changes randomly - Harmonic/Percussive ratio — human voice is mostly harmonic - Tempo/rhythm — speech has syllable rhythm noise does not - Mel Spectrogram statistics — energy distribution across frequency bands - Silence ratio — proportion of frames below energy threshold Random Forest + Gradient Boosting voting together. Every prediction includes confidence score + top 10 features that drove the decision in plain English. git clone https://github.com/monishmal3375/nova-vad.git cd nova-vad python3 -m venv venv source venv/bin/activate pip install -r requirements.txt python3 download data.py python3 -m src.pipeline python3 -m src.explainer data/clean speech/speech 001.wav python3 -m src.benchmark ======================================================= NOVA-VAD EXPLANATION File: speech 001.wav Prediction: SPEECH Confidence: 93.47% Why this decision was made: MFCC Delta 1 std 10.63% → HIGH spectral change rate — dynamic audio like speech MFCC Delta 2 std 6.14% → HIGH acceleration — rapidly changing audio, speech-like Silence ratio 5.92% → 56% silence — mix of speech and pauses Spectral centroid std 4.27% → HIGH variation — shifting frequency center Mel mean 3.50% → MODERATE energy — normal speech level nova-vad/ ├── data/ │ ├── speech/ raw speech files │ ├── noise/ raw noise files │ ├── clean speech/ denoised speech │ └── clean noise/ denoised noise ├── src/ │ ├── denoiser.py noise reduction pipeline │ ├── vad.py WebRTC VAD baseline │ ├── classifier.py NOVA-VAD 150+ features + ensemble │ ├── explainer.py explainability layer │ ├── benchmark.py head-to-head comparison │ └── pipeline.py end-to-end runner ├── models/ saved trained models ├── download data.py automated dataset downloader Existing VADs fail in three ways: - They break in noisy environments — WebRTC gets 58% on real-world noise - They are black boxes — no explanation of why a decision was made - They are too heavy for edge devices — Silero needs PyTorch 200MB+ NOVA-VAD solves all three simultaneously. No existing open-source tool does this. - Denoiser pipeline - WebRTC VAD baseline - 150+ feature MFCC classifier - Ensemble model RF + GBT - Explainability layer - Benchmark vs Silero, Pyannote, WebRTC - Real-time streaming audio support - pip install nova-vad packaging - Research paper Monish MIT License — free to use, modify, and distribute.