Noise-robust, Optimized, eXplainable Voice Activity Detector
NOVA-VAD is a lightweight, explainable Voice Activity Detector that outperforms every major open-source alternative on real-world noisy audio β without requiring a GPU or PyTorch.
Built as an open-source contribution to solving a problem that has existed in speech processing for 15+ years: existing VADs are either accurate OR lightweight OR explainable. Never all three.
Tested on 100 held-out files from UrbanSound8K (traffic, sirens, jackhammers, AC units, construction noise):
| Model | Accuracy | Precision | Recall | F1 | Lightweight | Explainable |
|---|---|---|---|---|---|---|
| WebRTC VAD | 58.0% | 57.69% | 60.0% | 58.82% | β | β |
| Pyannote VAD | 62.0% | 57.32% | 94.0% | 71.21% | β | β |
| Silero VAD | 87.0% | 86.27% | 88.0% | 87.13% | β | β |
| NOVA-VAD | ||||||
| 93.0% | ||||||
| 97.78% | ||||||
| 88.0% | ||||||
| 92.63% | ||||||
| β | ||||||
| β |
| Feature | WebRTC | Silero | Pyannote | NOVA-VAD |
|---|---|---|---|---|
| Accurate on noisy audio | β | Partial | Partial | β |
| Lightweight (no PyTorch) | β | β | β | β |
| Fully open source | β | Partial | β | β |
| Explains every decision | β | β | β | β |
| Retrainable on custom data | β | β | β | β |
| Confidence scores | β | β | β | β |
Raw Audio β Denoiser β 150+ Features β Ensemble Classifier β SPEECH / NO SPEECH + Explanation
- MFCCs + deltas (78 features) β spectral shape and change over time
- Zero Crossing Rate β speech is more consistent than noise
- RMS Energy pattern β speech rises and falls rhythmically
- Spectral Flux β speech transitions smoothly, noise changes randomly
- Harmonic/Percussive ratio β human voice is mostly harmonic
- Tempo/rhythm β speech has syllable rhythm noise does not
- Mel Spectrogram statistics β energy distribution across frequency bands
- Silence ratio β proportion of frames below energy threshold
Random Forest + Gradient Boosting voting together.
Every prediction includes confidence score + top 10 features that drove the decision in plain English.
git clone https://github.com/monishmal3375/nova-vad.git
cd nova-vad
python3 -m venv venv
source venv/bin/activate
pip install -r requirements.txt
python3 download_data.py
python3 -m src.pipeline
python3 -m src.explainer data/clean_speech/speech_001.wav
python3 -m src.benchmark
NOVA-VAD EXPLANATION File: speech_001.wav
Prediction: SPEECH
Confidence: 93.47% Why this decision was made:
MFCC Delta 1 std (10.63%) β HIGH spectral change rate β dynamic audio like speech MFCC Delta 2 std ( 6.14%) β HIGH acceleration β rapidly changing audio, speech-like Silence ratio ( 5.92%) β 56% silence β mix of speech and s Spectral centroid std ( 4.27%) β HIGH variation β shifting frequency center Mel mean ( 3.50%) β MODERATE energy β normal speech level
nova-vad/
βββ data/
β βββ speech/ # raw speech files
β βββ noise/ # raw noise files
β βββ clean_speech/ # denoised speech
β βββ clean_noise/ # denoised noise
βββ src/
β βββ denoiser.py # noise reduction pipeline
β βββ vad.py # WebRTC VAD baseline
β βββ classifier.py # NOVA-VAD 150+ features + ensemble
β βββ explainer.py # explainability layer
β βββ benchmark.py # head-to-head comparison
β βββ pipeline.py # end-to-end runner
βββ models/ # saved trained models
βββ download_data.py # automated dataset down
Existing VADs fail in three ways:
- They break in noisy environments β WebRTC gets 58% on real-world noise
- They are black boxes β no explanation of why a decision was made
- They are too heavy for edge devices β Silero needs PyTorch (200MB+)
NOVA-VAD solves all three simultaneously. No existing open-source tool does this.
- Denoiser pipeline
- WebRTC VAD baseline
- 150+ feature MFCC classifier
- Ensemble model (RF + GBT)
- Explainability layer
- Benchmark vs Silero, Pyannote, WebRTC
- Real-time streaming audio support
- pip install nova-vad packaging
- Research paper
Monish
MIT License β free to use, modify, and distribute.