MoodStream

MoodStream, an open-source real-time emotion recognition module, has been released to enable machines to detect human facial expressions during interactions. The system processes video frames through a pipeline that captures, detects, and classifies emotions into six categories, designed to run on resource-constrained hardware like robots or laptops. The tool aims to improve human-AI communication by providing machines with non-verbal emotional cues that words alone can conceal.

Back to Blog / blog Introducing MoodStream The real-time emotion recognition module for human-AI interaction Discover MoodStream here: MoodStream Repository on GitHub https://github.com/EnricoZanetti/moodstream Important: MoodStream was designed and built for the sole purpose of benefiting humans: not for surveillance, profiling, or behavioral control, nor for replacing people’s roles. Emotion recognition is a powerful and sensitive technology, and responsible use means consent, transparency, and purpose limitation. Most people view language as the primary means of communication. It’s undoubtedly the most controlled and widely used form, yet we often underestimate the power of non-verbal communication. All of us constantly rely on it, yet we’re not very skilled at conveying it as effectively as we are with words. Examples of non-verbal communication are facial expressions, gestures, body posture, eye contact, proxemics, tone of voice, and micro-expressions: we send all of this simultaneously, mostly without thinking about it. Now think about how we communicate with machines. Machines are everywhere around us, on your desk, in your pocket, and are becoming ever more naturally integrated into our lives. There’s going to be a day, and I don’t think it’s far off, when robots, humanoid or otherwise, are part of our everyday lives. They’ll help with household chores, do small repairs, and maybe even become someone we can talk to without feeling judged, as many people are already doing with AI chatbots https://journals.sagepub.com/doi/10.1177/20552076251351088 . Whatever the task, you’ll have to interact with these machines. Their brains will most likely be powered by LLMs that receive your speech as transcribed text. However, your future robot won’t know if you’re angry, sad, anxious, or just exhausted, because all it gets is the words you chose to say, words you can carefully select to hide what you actually feel. The robot will be incredibly easy to fool, and as a result, it won’t be able to adapt its behavior, tone, or responses to what you actually need. MoodStream is built to close that gap. It’s an open-source module that classifies your facial expression in real time and streams your emotional state through a pipeline so other systems can use it. Your robot, your application, your dashboard, your research tool, gets a more honest version of you, communicating not just what you’re saying, but what your face tells about your emotional state while you say it. The detected emotion can be visualized on a live dashboard, and the data is stored in a database for later analysis. MoodStream is designed to run on resource-constrained hardware, like a tiny camera mounted on a robot, but it also works on your laptop using a regular webcam. How it works The system is built as a pipeline of small, independent stages, each doing one job and passing its output to the next. Here’s the end-to-end journey of a single video frame: Capture. A frame is acquired from a video source: a webcam via OpenCV https://github.com/opencv/opencv/tree/master , an embedded camera module OpenMV Cam H7+ connected over UART, or a synthetic source for testing. The active source is selected at startup via CLI flags. Face detection. The frame is converted to grayscale and passed to a Haar Cascade classifier, which locates one or more faces and returns their bounding boxes. If no face is found, the loop simply advances to the next frame. Cropping & preprocessing. Each detected face is cropped out, resized to 48×48 pixels, and normalized to 0, 1 as float32. Because the input is already grayscale at this stage, the final reshape just adds the batch and channel dimensions expected by the model 1, 48, 48, 1 . Emotion classification. The preprocessed face is fed into a quantized TFLite CNN, which outputs a softmax probability distribution over 6 emotion classes happy, sad, angry, neutral, surprised, fearful . The top class and its confidence score are returned; the rest of the distribution is not passed downstream. Publishing. The emotion label is published as an MQTT message to a Mosquitto broker. From there, Node-RED picks it up, enriches it with a timestamp and emoji, and fans it out to two destinations: InfluxDB for time-series persistence, and Grafana for live visualization. The broker, Node-RED, InfluxDB, and Grafana all run as Docker containers, so the only dependency you need on the host is Python. A single docker compose up brings the entire stack online. Moreover, because each stage is decoupled, you can swap any of them without rewriting the rest. For example, if you want a different face detector you can plug it in, or if you want to send the output to a robot’s behavior controller instead of a dashboard, you can add another consumer at the publishing stage. Model and training data The current model is a lightweight convolutional neural network trained on FER2013 https://www.kaggle.com/datasets/msambare/fer2013 , a public dataset of around 35,000 grayscale 48×48 face images. The Disgust class from the original dataset is excluded, leaving six classes: anger, fear, happiness, sadness, surprise, and neutral. The model is exported to TensorFlow Lite float16 quantized , which keeps inference fast and the memory footprint small: fast enough to run in real time on a CPU, without needing a GPU. On a held-out validation set it reaches an overall accuracy of 62.9%. To put that in context: FER2013 is a notoriously noisy benchmark, human annotators themselves only reach ~65% on it https://cs230.stanford.edu/projects winter 2020/reports/32610274.pdf , because many images are ambiguous or mislabeled by design. That puts MoodStream’s model squarely in human-level territory. Despite the limitations, the model is good enough to be useful and working, try it yourself. Hardware versatility MoodStream runs on any machine with a camera: a laptop’s built-in webcam and an external USB camera both work out of the box through OpenCV’s VideoCapture . The only real hardware requirement is a CPU, no GPU needed. TFLite keeps the model small enough that inference runs in real time on commodity hardware, and the preprocessing step resizes the face crop to the fixed 48×48 input regardless of the camera’s native resolution, so frame size doesn’t matter. A separate embedded path exists for the OpenMV Cam H7+, which runs its own on-device firmware and streams results over UART rather than running the Python pipeline. This matter because solution that only works on a powerful workstation isn’t going to end up inside an assistive robot, a classroom tool, or a low-cost monitoring device in a care facility. Where MoodStream can help today Building MoodStream, I had a few concrete scenarios where this kind of system can provide real value to people. - Education. Helping children identify and name their own emotions, and recognize them in others, is one of the foundational skills of emotional intelligence. MoodStream could power educational tools that turn this into a visual, interactive experience: a child makes a face, the system names the emotion, and the child learns to associate feeling with label. - Healthcare and mental health. Early detection of distress signals can be life-saving. People struggling with depression, isolation, or suicidal ideation often hide what they’re feeling, especially in words. A non-intrusive monitoring system in a care facility, therapeutic setting, or assisted-living environment could flag emotional deterioration before it becomes a crisis. Roadmap MoodStream could evolve to something more ambituous and during development I was thinking about some potential improvements. Better training dataset. Migrating training dataset to AffectNet https://mohammadmahoor.com/pages/databases/affectnet/ , a much larger dataset around 400,000 images scraped from real-world conditions should significantly improve accuracy and reduce the gap between training conditions and actual webcam use. Beyond faces. Faces are just one non-verbal channel. The next step is body language, like pose estimation and body segmentation to capture posture, gestures, and stance. This brings MoodStream much closer to a complete non-verbal understanding layer. An ethical layer. A built-in module that inspects every input and output of a human–machine interaction, flagging potential misuse and reporting it. This layer could be designed to be reusable in other AI systems operating in sensitive environments, not just MoodStream. Mood history. Storing emotional states over time turns the data from a stream into a signal. Patterns could emerge and could give people genuinely useful insight into their own well-being. Try it MoodStream is open source and designed to be easy to spin up. If you’re reading this from a laptop with a webcam, you can have the live dashboard running in a few minutes. Clone the repo, run docker compose up to spin up the stack, launch the Python pipeline, point the camera at your face, and see what it sees. I’d love feedback and contributions: let me know if it sparked any reflections or ideas