{"slug": "inside-hoovik-building-a-real-time-multimodal-emotion-ai-pipeline", "title": "Inside Hoovik: Building a Real-Time Multimodal Emotion AI Pipeline", "summary": "Engineering challenges and architecture behind Hoovik, a real-time multimodal emotion AI pipeline for video conferencing. The system processes live audio and video streams using a distributed multi-cloud topology, employing dedicated executor pools for feature extraction (Wav2Vec2 for audio, MediaPipe for facial landmarks) and a serialized inference pipeline with a custom transformer and XGBoost model. To handle unstable meeting conditions, the backend automatically switches between audio-only, video-only, or full multimodal execution profiles if a stream drops for more than 0.4 seconds, and the model was trained in three phases with modality dropout to ensure robustness.", "body_md": "When I started building Hoovik — a distributed video conferencing platform — I expected WebRTC signaling and transcription pipelines to be the hardest problems.\nThey weren’t.\nThe real engineering challenge was building a production-ready real-time multimodal emotion inference engine capable of processing live video meetings under strict latency constraints.\nUnlike offline ML systems, live meeting environments are unstable by default:\nAnd unlike research notebooks, production systems need to survive all of it without freezing the application loop.\nThis article breaks down how I designed Hoovik’s emotion recognition backend using:\nThe production configuration currently operates with:\nseq_len = 10\naudio_dim = 1024\nface_dim = 326\nd_model = 256\nnhead = 8\n3 encoder layers\nTo separate lightweight websocket orchestration from heavy ML workloads, Hoovik runs on a distributed multi-cloud topology.\nHandles:\nHandles:\nThis separation keeps signaling latency stable while allowing compute services to scale independently.\nThe emotion_service\nruns as an asynchronous FastAPI + Socket.IO server.\nA naive implementation quickly breaks under load because:\nInstead of running everything inside one async path, I isolated the workload into dedicated executor pools.\n+---------------------------------------+\n| Per-Participant Sockets |\n+---------------------------------------+\n/ \\\n[audio_chunk (Float32 PCM)] [emotion.frame (JPEG)]\n/ \\\nv v\n+----------------------+ +-----------------------+\n| _audio_executor (2T) | | _face_executor (2T) |\n| - Wav2Vec2 Features | | - MediaPipe Tracking |\n+----------------------+ +-----------------------+\n\\ /\n\\ /\n---> Normalization --->\n|\nv\n+---------------------------+\n| _inference_executor (1T) |\n| - Transformer |\n| - XGBoost |\n| - Isolation Forest |\n+---------------------------+\n|\nv\nemotion.result\n_audio_executor\nHandles:\n_face_executor\nHandles:\n_inference_executor\nRuns:\nInference is intentionally serialized to guarantee thread safety without adding explicit locking overhead around model state.\nReal meetings are noisy and inconsistent.\nUsers:\nIf one stream disappears for more than 0.4 seconds\n, the backend automatically switches execution profiles:\nboth\naudio_only\nvideo_only\nThis prevents runtime crashes and preserves stable predictions during degraded sessions.\nThe engine combines synchronized audio and facial features into aligned temporal embeddings.\nIncoming audio chunks are:\naudeering/wav2vec2-large-robust-12-ft-emotion-msp-dim\nThe system extracts:\nusing centered 0.6-second\nwindows.\nMediaPipe extracts 326\nfacial features per frame:\nAll landmarks are normalized relative to inter-ocular distance so the model remains invariant to camera proximity.\nThe backend stores rolling temporal sequences of:\nseq_len = 10\nallowing the model to track facial motion across time.\nThe primary deep learning model is a custom multimodal transformer architecture using:\nd_model = 256\nnhead = 8\nencoder_layers = 3\nThe network learns bidirectional attention:\nA learned cross_gate\ndynamically balances voice tone against facial motion.\nTraining multimodal systems becomes unstable when modalities disappear randomly.\nTo solve this, I trained the network in 3 phases.\nEpochs: 1–15\nThe model trains only on complete:\npairs.\nThis forces cross-attention layers to learn joint representations first.\nEpochs: 16–55\nThe training pipeline introduces:\nThis teaches the model to preserve embeddings even when streams disappear entirely.\nEpochs: 56–90\nFinal training uses:\nto smooth final weights and improve generalization.\nTraining currently uses:\nbatch_size = 64\nmixup_alpha = 0.166\nmodality_drop_prob = 0.068\nlabel_smoothing = 0.077\ngrad_clip = 1.0\nDeep networks are excellent at latent representation learning.\nBut tree ensembles remain extremely effective at learning explicit statistical boundaries.\nTo complement the transformer, the backend engineers an 8,149-dimensional\nfeature vector every inference cycle.\nBefore training, features are compressed using:\nPCA → 512 dimensions\nThe XGBoost model handles missing modalities naturally using:\nmissing=np.nan\nwhich makes degraded stream inference extremely resilient.\nCurrent XGBoost configuration:\nn_estimators = 3150\nmax_depth = 5\nlearning_rate = 0.0308\ntree_method = hist\nTop predictive markers extracted from the engineered XGBoost feature space.\nWhile the strongest features currently remain PCA-compressed latent representations, the distribution clearly shows the ensemble learning stable statistical boundaries across temporal motion patterns.\nNormalized confusion matrix for the calibrated ensemble classifier.\nThe model performs strongest on:\nwhile softer affective states such as:\nshow heavier overlap due to lower facial motion intensity and acoustic ambiguity.\nThe strongest normalized recalls were:\nangry → angry = 0.81\nhappy → happy = 0.77\nneutral/calm → neutral/calm = 0.73\nInstead of averaging probabilities directly, Hoovik uses weighted ensemble calibration optimized using Optuna.\nFinal probability output:\nP_final = 0.455 × P_Transformer + 0.545 × P_XGBoost\nBoth models pass through:\nbefore fusion.\nThis significantly reduced overconfident predictions and improved calibration stability across degraded modalities.\nReal-world meeting environments are chaotic:\nWithout safeguards, models generate confident garbage predictions.\nTo solve this, every feature vector passes through dedicated Isolation Forest pipelines.\niso_both\niso_audio_only\niso_video_only\niso_global_fallback\nEach model is calibrated against:\nThe deployed thresholds currently operate at:\nboth = 0.0525\naudio_only = 0.0884\nvideo_only = -0.0264\nThe negative threshold for video_only\nreflects the inherently noisier variance distribution of isolated visual streams.\nVariance spreads across bimodal and unimodal execution paths.\nSamples falling below calibrated thresholds are flagged as anomalous.\nIf facial landmarks collapse due to:\nthe backend emits:\n{\n\"anomaly\": true\n}\nallowing the frontend to suppress unreliable emotional predictions.\nOne hidden production problem was frame burst overload.\nIf browsers uploaded frames continuously at high FPS, executor queues eventually exploded.\nTo prevent memory collapse, the backend continuously monitors queue depth.\nIf _face_executor\nexceeds:\nqueue_depth >= 3\nthe backend emits a websocket backpressure\nevent.\nThe frontend immediately reduces:\nuntil queues recover.\nThis dramatically stabilized long-running sessions under burst traffic.\nThe service exposes live telemetry endpoints:\nGET /stats\nGET /stats/json\ntracking:\nwithout blocking inference workers.\nEarly inference profiling and websocket load testing were performed locally on Apple Silicon hardware using Locust-based stress testing and concurrent Socket.IO session simulation.\nObserved runtime characteristics included:\naudio_only\n/ video_only\n)The calibrated ensemble currently achieves:\n74.34%\n73.84%\n78.68%\n74.25%\n66.03%\nFinal probability output:\nP_final = 0.455 × P_Transformer + 0.545 × P_XGBoost\nwith temperature calibration fixed at:\nT = 0.3\nTraining only on perfect multimodal samples causes catastrophic failure when streams disappear.\nProgressive modality degradation training was essential.\nTrying to parallelize model execution aggressively created:\nThread-isolated extraction + serialized inference proved significantly more stable.\nBackpressure protection, anomaly detection, and graceful degradation ended up being just as important as raw model accuracy.\nHoovik is fully open-source and actively looking for contributors around:\nGitHub : https://github.com/AnupamKumar-1/Hoovik\nIf you enjoy systems engineering, real-time ML infrastructure, or multimodal AI pipelines, contributions are welcome.", "url": "https://wpnews.pro/news/inside-hoovik-building-a-real-time-multimodal-emotion-ai-pipeline", "canonical_source": "https://dev.to/anupam_kumar/inside-hoovik-building-a-real-time-multimodal-emotion-ai-pipeline-5267", "published_at": "2026-05-20 07:05:38+00:00", "updated_at": "2026-05-20 07:34:03.186851+00:00", "lang": "en", "topics": ["artificial-intelligence", "machine-learning", "cloud-computing", "startups", "developer-tools"], "entities": ["Hoovik", "WebRTC", "FastAPI", "Socket.IO"], "alternates": {"html": "https://wpnews.pro/news/inside-hoovik-building-a-real-time-multimodal-emotion-ai-pipeline", "markdown": "https://wpnews.pro/news/inside-hoovik-building-a-real-time-multimodal-emotion-ai-pipeline.md", "text": "https://wpnews.pro/news/inside-hoovik-building-a-real-time-multimodal-emotion-ai-pipeline.txt", "jsonld": "https://wpnews.pro/news/inside-hoovik-building-a-real-time-multimodal-emotion-ai-pipeline.jsonld"}}