{"slug": "i-built-an-open-source-vad-that-beats-silero-pyannote-and-webrtc", "title": "I built an open source VAD that beats Silero, Pyannote, and WebRTC", "summary": "Developer Monish released NOVA-VAD, an open-source voice activity detector that achieves 93% accuracy on noisy audio, outperforming Silero, Pyannote, and WebRTC while remaining lightweight and explainable. The tool uses an ensemble classifier with 150+ features and provides confidence scores and decision explanations, addressing a long-standing trade-off in speech processing.", "body_md": "Noise-robust, Optimized, eXplainable Voice Activity Detector\n\nNOVA-VAD is a lightweight, explainable Voice Activity Detector that outperforms every major open-source alternative on real-world noisy audio — without requiring a GPU or PyTorch.\n\nBuilt as an open-source contribution to solving a problem that has existed in speech processing for 15+ years: existing VADs are either accurate OR lightweight OR explainable. Never all three.\n\nTested on 100 held-out files from UrbanSound8K (traffic, sirens, jackhammers, AC units, construction noise):\n\n| Model | Accuracy | Precision | Recall | F1 | Lightweight | Explainable |\n|---|---|---|---|---|---|---|\n| WebRTC VAD | 58.0% | 57.69% | 60.0% | 58.82% | ✅ | ❌ |\n| Pyannote VAD | 62.0% | 57.32% | 94.0% | 71.21% | ❌ | ❌ |\n| Silero VAD | 87.0% | 86.27% | 88.0% | 87.13% | ❌ | ❌ |\nNOVA-VAD |\n93.0% |\n97.78% |\n88.0% |\n92.63% |\n✅ |\n✅ |\n\n| Feature | WebRTC | Silero | Pyannote | NOVA-VAD |\n|---|---|---|---|---|\n| Accurate on noisy audio | ❌ | Partial | Partial | ✅ |\n| Lightweight (no PyTorch) | ✅ | ❌ | ❌ | ✅ |\n| Fully open source | ✅ | Partial | ✅ | ✅ |\n| Explains every decision | ❌ | ❌ | ❌ | ✅ |\n| Retrainable on custom data | ❌ | ❌ | ❌ | ✅ |\n| Confidence scores | ❌ | ❌ | ❌ | ✅ |\n\nRaw Audio → Denoiser → 150+ Features → Ensemble Classifier → SPEECH / NO SPEECH + Explanation\n\n- MFCCs + deltas (78 features) — spectral shape and change over time\n- Zero Crossing Rate — speech is more consistent than noise\n- RMS Energy pattern — speech rises and falls rhythmically\n- Spectral Flux — speech transitions smoothly, noise changes randomly\n- Harmonic/Percussive ratio — human voice is mostly harmonic\n- Tempo/rhythm — speech has syllable rhythm noise does not\n- Mel Spectrogram statistics — energy distribution across frequency bands\n- Silence ratio — proportion of frames below energy threshold\n\nRandom Forest + Gradient Boosting voting together.\n\nEvery prediction includes confidence score + top 10 features that drove the decision in plain English.\n\n```\ngit clone https://github.com/monishmal3375/nova-vad.git\ncd nova-vad\npython3 -m venv venv\nsource venv/bin/activate\npip install -r requirements.txt\npython3 download_data.py\npython3 -m src.pipeline\npython3 -m src.explainer data/clean_speech/speech_001.wav\npython3 -m src.benchmark\n```\n\n=======================================================\n\nNOVA-VAD EXPLANATION File: speech_001.wav\n\nPrediction: SPEECH\n\nConfidence: 93.47% Why this decision was made:\n\nMFCC Delta 1 std (10.63%) → HIGH spectral change rate — dynamic audio like speech MFCC Delta 2 std ( 6.14%) → HIGH acceleration — rapidly changing audio, speech-like Silence ratio ( 5.92%) → 56% silence — mix of speech and pauses Spectral centroid std ( 4.27%) → HIGH variation — shifting frequency center Mel mean ( 3.50%) → MODERATE energy — normal speech level\n\nnova-vad/\n\n├── data/\n\n│ ├── speech/ # raw speech files\n\n│ ├── noise/ # raw noise files\n\n│ ├── clean_speech/ # denoised speech\n\n│ └── clean_noise/ # denoised noise\n\n├── src/\n\n│ ├── denoiser.py # noise reduction pipeline\n\n│ ├── vad.py # WebRTC VAD baseline\n\n│ ├── classifier.py # NOVA-VAD 150+ features + ensemble\n\n│ ├── explainer.py # explainability layer\n\n│ ├── benchmark.py # head-to-head comparison\n\n│ └── pipeline.py # end-to-end runner\n\n├── models/ # saved trained models\n\n├── download_data.py # automated dataset downloader\n\n**Existing VADs fail in three ways:**\n\n- They break in noisy environments — WebRTC gets 58% on real-world noise\n- They are black boxes — no explanation of why a decision was made\n- They are too heavy for edge devices — Silero needs PyTorch (200MB+)\n\nNOVA-VAD solves all three simultaneously. No existing open-source tool does this.\n\n- Denoiser pipeline\n- WebRTC VAD baseline\n- 150+ feature MFCC classifier\n- Ensemble model (RF + GBT)\n- Explainability layer\n- Benchmark vs Silero, Pyannote, WebRTC\n- Real-time streaming audio support\n- pip install nova-vad packaging\n- Research paper\n\n**Monish**\n\nMIT License — free to use, modify, and distribute.", "url": "https://wpnews.pro/news/i-built-an-open-source-vad-that-beats-silero-pyannote-and-webrtc", "canonical_source": "https://github.com/monishmal3375/nova-vad", "published_at": "2026-06-24 03:06:22+00:00", "updated_at": "2026-06-24 03:14:10.099817+00:00", "lang": "en", "topics": ["machine-learning", "ai-tools"], "entities": ["Monish", "NOVA-VAD", "Silero", "Pyannote", "WebRTC", "UrbanSound8K"], "alternates": {"html": "https://wpnews.pro/news/i-built-an-open-source-vad-that-beats-silero-pyannote-and-webrtc", "markdown": "https://wpnews.pro/news/i-built-an-open-source-vad-that-beats-silero-pyannote-and-webrtc.md", "text": "https://wpnews.pro/news/i-built-an-open-source-vad-that-beats-silero-pyannote-and-webrtc.txt", "jsonld": "https://wpnews.pro/news/i-built-an-open-source-vad-that-beats-silero-pyannote-and-webrtc.jsonld"}}