Inside Hoovik: Building a Real-Time Multimodal Emotion AI Pipeline

wpnews.pro

When I started building Hoovik — a distributed video conferencing platform — I expected WebRTC signaling and transcription pipelines to be the hardest problems. They weren’t. The real engineering challenge was building a production-ready real-time multimodal emotion inference engine capable of processing live video meetings under strict latency constraints. Unlike offline ML systems, live meeting environments are unstable by default: And unlike research notebooks, production systems need to survive all of it without freezing the application loop. This article breaks down how I designed Hoovik’s emotion recognition backend using: The production configuration currently operates with: seq_len = 10 audio_dim = 1024 face_dim = 326 d_model = 256 nhead = 8 3 encoder layers To separate lightweight websocket orchestration from heavy ML workloads, Hoovik runs on a distributed multi-cloud topology. Handles: Handles: This separation keeps signaling latency stable while allowing compute services to scale independently. The emotion_service runs as an asynchronous FastAPI + Socket.IO server. A naive implementation quickly breaks under load because: Instead of running everything inside one async path, I isolated the workload into dedicated executor pools. +---------------------------------------+ | Per-Participant Sockets | +---------------------------------------+ /
[audio_chunk (Float32 PCM)] [emotion.frame (JPEG)] /
v v

+----------------------+ +-----------------------+
| _audio_executor (2T) | | _face_executor (2T) |

+---------------------------+
| _inference_executor (1T) |

| - Transformer | | - XGBoost | | - Isolation Forest | +---------------------------+ | v emotion.result _audio_executor Handles: _face_executor Handles: _inference_executor Runs: Inference is intentionally serialized to guarantee thread safety without adding explicit locking overhead around model state. Real meetings are noisy and inconsistent. Users: If one stream disappears for more than 0.4 seconds , the backend automatically switches execution profiles: both audio_only video_only This prevents runtime crashes and preserves stable predictions during degraded sessions. The engine combines synchronized audio and facial features into aligned temporal embeddings. Incoming audio chunks are: audeering/wav2vec2-large-robust-12-ft-emotion-msp-dim The system extracts: using centered 0.6-second windows. MediaPipe extracts 326 facial features per frame: All landmarks are normalized relative to inter-ocular distance so the model remains invariant to camera proximity. The backend stores rolling temporal sequences of: seq_len = 10 allowing the model to track facial motion across time. The primary deep learning model is a custom multimodal transformer architecture using: d_model = 256 nhead = 8 encoder_layers = 3 The network learns bidirectional attention: A learned cross_gate dynamically balances voice tone against facial motion. Training multimodal systems becomes unstable when modalities disappear randomly. To solve this, I trained the network in 3 phases. Epochs: 1–15 The model trains only on complete: pairs. This forces cross-attention layers to learn joint representations first. Epochs: 16–55 The training pipeline introduces: This teaches the model to preserve embeddings even when streams disappear entirely.

Epochs: 56–90
Final training uses:

to smooth final weights and improve generalization. Training currently uses: batch_size = 64 mixup_alpha = 0.166 modality_drop_prob = 0.068 label_smoothing = 0.077 grad_clip = 1.0 Deep networks are excellent at latent representation learning. But tree ensembles remain extremely effective at learning explicit statistical boundaries. To complement the transformer, the backend engineers an 8,149-dimensional feature vector every inference cycle. Before training, features are compressed using: PCA → 512 dimensions The XGBoost model handles missing modalities naturally using: missing=np.nan which makes degraded stream inference extremely resilient. Current XGBoost configuration: n_estimators = 3150 max_depth = 5 learning_rate = 0.0308 tree_method = hist Top predictive markers extracted from the engineered XGBoost feature space. While the strongest features currently remain PCA-compressed latent representations, the distribution clearly shows the ensemble learning stable statistical boundaries across temporal motion patterns. Normalized confusion matrix for the calibrated ensemble classifier. The model performs strongest on: while softer affective states such as: show heavier overlap due to lower facial motion intensity and acoustic ambiguity. The strongest normalized recalls were: angry → angry = 0.81 happy → happy = 0.77 neutral/calm → neutral/calm = 0.73 Instead of averaging probabilities directly, Hoovik uses weighted ensemble calibration optimized using Optuna. Final probability output: P_final = 0.455 × P_Transformer + 0.545 × P_XGBoost Both models pass through: before fusion. This significantly reduced overconfident predictions and improved calibration stability across degraded modalities. Real-world meeting environments are chaotic: Without safeguards, models generate confident garbage predictions. To solve this, every feature vector passes through dedicated Isolation Forest pipelines. iso_both iso_audio_only iso_video_only iso_global_fallback Each model is calibrated against: The deployed thresholds currently operate at: both = 0.0525 audio_only = 0.0884 video_only = -0.0264 The negative threshold for video_only reflects the inherently noisier variance distribution of isolated visual streams. Variance spreads across bimodal and unimodal execution paths. Samples falling below calibrated thresholds are flagged as anomalous. If facial landmarks collapse due to: the backend emits:

{
"anomaly": true
}

allowing the frontend to suppress unreliable emotional predictions. One hidden production problem was frame burst overload. If browsers uploaded frames continuously at high FPS, executor queues eventually exploded. To prevent memory collapse, the backend continuously monitors queue depth. If _face_executor exceeds: queue_depth >= 3 the backend emits a websocket backpressure event. The frontend immediately reduces: until queues recover. This dramatically stabilized long-running sessions under burst traffic. The service exposes live telemetry endpoints: GET /stats GET /stats/json tracking: without blocking inference workers. Early inference profiling and websocket load testing were performed locally on Apple Silicon hardware using Locust-based stress testing and concurrent Socket.IO session simulation. Observed runtime characteristics included: audio_only / video_only )The calibrated ensemble currently achieves: 74.34% 73.84% 78.68% 74.25% 66.03% Final probability output: P_final = 0.455 × P_Transformer + 0.545 × P_XGBoost with temperature calibration fixed at: T = 0.3 Training only on perfect multimodal samples causes catastrophic failure when streams disappear. Progressive modality degradation training was essential. Trying to parallelize model execution aggressively created: Thread-isolated extraction + serialized inference proved significantly more stable. Backpressure protection, anomaly detection, and graceful degradation ended up being just as important as raw model accuracy. Hoovik is fully open-source and actively looking for contributors around:

GitHub : https://github.com/AnupamKumar-1/Hoovik
If you enjoy systems engineering, real-time ML infrastructure, or multimodal AI pipelines, contributions are welcome.

source & further reading

dev.to — original article The End of the Algorithmic Overlord: Why AI's Future Belongs to Local-First, Open Source How my view of AI is changing OpenAI Expands GPT-Live ChatGPT Voice to Enterprise Workspaces Worldwide

Inside Hoovik: Building a Real-Time Multimodal Emotion AI Pipeline

Run your AI side-project on zahid.host