Inside Hoovik: Building a Real-Time Multimodal Emotion AI Pipeline Engineering challenges and architecture behind Hoovik, a real-time multimodal emotion AI pipeline for video conferencing. The system processes live audio and video streams using a distributed multi-cloud topology, employing dedicated executor pools for feature extraction (Wav2Vec2 for audio, MediaPipe for facial landmarks) and a serialized inference pipeline with a custom transformer and XGBoost model. To handle unstable meeting conditions, the backend automatically switches between audio-only, video-only, or full multimodal execution profiles if a stream drops for more than 0.4 seconds, and the model was trained in three phases with modality dropout to ensure robustness. When I started building Hoovik — a distributed video conferencing platform — I expected WebRTC signaling and transcription pipelines to be the hardest problems. They weren’t. The real engineering challenge was building a production-ready real-time multimodal emotion inference engine capable of processing live video meetings under strict latency constraints. Unlike offline ML systems, live meeting environments are unstable by default: And unlike research notebooks, production systems need to survive all of it without freezing the application loop. This article breaks down how I designed Hoovik’s emotion recognition backend using: The production configuration currently operates with: seq len = 10 audio dim = 1024 face dim = 326 d model = 256 nhead = 8 3 encoder layers To separate lightweight websocket orchestration from heavy ML workloads, Hoovik runs on a distributed multi-cloud topology. Handles: Handles: This separation keeps signaling latency stable while allowing compute services to scale independently. The emotion service runs as an asynchronous FastAPI + Socket.IO server. A naive implementation quickly breaks under load because: Instead of running everything inside one async path, I isolated the workload into dedicated executor pools. +---------------------------------------+ | Per-Participant Sockets | +---------------------------------------+ / \ audio chunk Float32 PCM emotion.frame JPEG / \ v v +----------------------+ +-----------------------+ | audio executor 2T | | face executor 2T | | - Wav2Vec2 Features | | - MediaPipe Tracking | +----------------------+ +-----------------------+ \ / \ / --- Normalization --- | v +---------------------------+ | inference executor 1T | | - Transformer | | - XGBoost | | - Isolation Forest | +---------------------------+ | v emotion.result audio executor Handles: face executor Handles: inference executor Runs: Inference is intentionally serialized to guarantee thread safety without adding explicit locking overhead around model state. Real meetings are noisy and inconsistent. Users: If one stream disappears for more than 0.4 seconds , the backend automatically switches execution profiles: both audio only video only This prevents runtime crashes and preserves stable predictions during degraded sessions. The engine combines synchronized audio and facial features into aligned temporal embeddings. Incoming audio chunks are: audeering/wav2vec2-large-robust-12-ft-emotion-msp-dim The system extracts: using centered 0.6-second windows. MediaPipe extracts 326 facial features per frame: All landmarks are normalized relative to inter-ocular distance so the model remains invariant to camera proximity. The backend stores rolling temporal sequences of: seq len = 10 allowing the model to track facial motion across time. The primary deep learning model is a custom multimodal transformer architecture using: d model = 256 nhead = 8 encoder layers = 3 The network learns bidirectional attention: A learned cross gate dynamically balances voice tone against facial motion. Training multimodal systems becomes unstable when modalities disappear randomly. To solve this, I trained the network in 3 phases. Epochs: 1–15 The model trains only on complete: pairs. This forces cross-attention layers to learn joint representations first. Epochs: 16–55 The training pipeline introduces: This teaches the model to preserve embeddings even when streams disappear entirely. Epochs: 56–90 Final training uses: to smooth final weights and improve generalization. Training currently uses: batch size = 64 mixup alpha = 0.166 modality drop prob = 0.068 label smoothing = 0.077 grad clip = 1.0 Deep networks are excellent at latent representation learning. But tree ensembles remain extremely effective at learning explicit statistical boundaries. To complement the transformer, the backend engineers an 8,149-dimensional feature vector every inference cycle. Before training, features are compressed using: PCA → 512 dimensions The XGBoost model handles missing modalities naturally using: missing=np.nan which makes degraded stream inference extremely resilient. Current XGBoost configuration: n estimators = 3150 max depth = 5 learning rate = 0.0308 tree method = hist Top predictive markers extracted from the engineered XGBoost feature space. While the strongest features currently remain PCA-compressed latent representations, the distribution clearly shows the ensemble learning stable statistical boundaries across temporal motion patterns. Normalized confusion matrix for the calibrated ensemble classifier. The model performs strongest on: while softer affective states such as: show heavier overlap due to lower facial motion intensity and acoustic ambiguity. The strongest normalized recalls were: angry → angry = 0.81 happy → happy = 0.77 neutral/calm → neutral/calm = 0.73 Instead of averaging probabilities directly, Hoovik uses weighted ensemble calibration optimized using Optuna. Final probability output: P final = 0.455 × P Transformer + 0.545 × P XGBoost Both models pass through: before fusion. This significantly reduced overconfident predictions and improved calibration stability across degraded modalities. Real-world meeting environments are chaotic: Without safeguards, models generate confident garbage predictions. To solve this, every feature vector passes through dedicated Isolation Forest pipelines. iso both iso audio only iso video only iso global fallback Each model is calibrated against: The deployed thresholds currently operate at: both = 0.0525 audio only = 0.0884 video only = -0.0264 The negative threshold for video only reflects the inherently noisier variance distribution of isolated visual streams. Variance spreads across bimodal and unimodal execution paths. Samples falling below calibrated thresholds are flagged as anomalous. If facial landmarks collapse due to: the backend emits: { "anomaly": true } allowing the frontend to suppress unreliable emotional predictions. One hidden production problem was frame burst overload. If browsers uploaded frames continuously at high FPS, executor queues eventually exploded. To prevent memory collapse, the backend continuously monitors queue depth. If face executor exceeds: queue depth = 3 the backend emits a websocket backpressure event. The frontend immediately reduces: until queues recover. This dramatically stabilized long-running sessions under burst traffic. The service exposes live telemetry endpoints: GET /stats GET /stats/json tracking: without blocking inference workers. Early inference profiling and websocket load testing were performed locally on Apple Silicon hardware using Locust-based stress testing and concurrent Socket.IO session simulation. Observed runtime characteristics included: audio only / video only The calibrated ensemble currently achieves: 74.34% 73.84% 78.68% 74.25% 66.03% Final probability output: P final = 0.455 × P Transformer + 0.545 × P XGBoost with temperature calibration fixed at: T = 0.3 Training only on perfect multimodal samples causes catastrophic failure when streams disappear. Progressive modality degradation training was essential. Trying to parallelize model execution aggressively created: Thread-isolated extraction + serialized inference proved significantly more stable. Backpressure protection, anomaly detection, and graceful degradation ended up being just as important as raw model accuracy. Hoovik is fully open-source and actively looking for contributors around: GitHub : https://github.com/AnupamKumar-1/Hoovik If you enjoy systems engineering, real-time ML infrastructure, or multimodal AI pipelines, contributions are welcome.