Hi everyone
I’ve been building AlphaAvatar, an open-source and self-hostable realtime full-multimodal personal AI assistant runtime.
The idea behind AlphaAvatar is simple: I don’t think personal AI assistants should stay as stateless chatbots forever.
Most assistants today still work like this:
User asks something
↓
Assistant replies
↓
Session ends
↓
Most useful context is lost
AlphaAvatar is my attempt to explore what a persistent personal AI butler could look like — an assistant that can talk, see, remember, understand who it is interacting with, retrieve knowledge, call tools, manage tasks, and act across different channels over time.
Below is the current high-level architecture:
AlphaAvatar combines voice, text, visual input, face identity, speaker detection, memory, persona, MCP tools, RAG, DeepResearch, status feedback, model orchestration, and channel integrations into one assistant runtime.
The goal is not just to build another chat UI.
The goal is to build a runtime layer for long-term personal assistance.
At a high level, AlphaAvatar includes:
Interaction layer for realtime voice, text, camera input, and external channels
Core runtime / agent layer for session state, context management, and orchestration
Memory + Persona layer for persistent user context and identity-aware interaction
Tool / knowledge layer with MCP, RAG, and DeepResearch
Model layer for OpenAI-compatible LLMs, multimodal models, STT, TTS, speaker detection, and face recognition
Storage / data layer for self-hosted memory, documents, vector storage, and tool APIs
Output layer for realtime voice, text, avatar responses, tool actions, and status updates
One important direction of AlphaAvatar is that “multimodal” should not only mean accepting voice, text, and camera input.
The goal is to make the entire assistant runtime become full multimodal.
That means multimodal context should flow through the core modules of the system:
Memory should be able to learn from text, voice, visual frames, face identity, speaker identity, user actions, tool results, and recurring routines.
Persona should understand the user not only from written preferences, but also from interaction style, voice behavior, identity signals, and multimodal context.
MCP tools should be selected and called based on the full runtime context, not only the latest text prompt.
RAG / DeepResearch should work with documents, user context, tool results, and future visual/event memories.
Status feedback should expose what the assistant is doing across modalities, especially during long-running tool, retrieval, or research workflows.
Channel plugins should allow the same assistant runtime to work across voice, web, avatar UI, WhatsApp, Discord, and future channels.
So the long-term goal is not simply:
text + voice + camera → chatbot
but rather:
text + voice + vision + identity + memory + persona + tools + channels
↓
full-multimodal personal assistant runtime
This is why AlphaAvatar treats Memory, Persona, MCP, RAG, DeepResearch, Status, Voice, Avatar, and Channel integrations as composable runtime plugins.
Each plugin should eventually be able to consume, produce, or update multimodal context.
A real personal assistant should not only answer questions.
It should be able to:
remember useful long-term context
understand the user’s preferences and routines
know who is currently interacting with it
work across voice, text, camera, and external channels
retrieve documents and knowledge when needed
call tools and external services
provide progress updates during long-running actions
gradually become more useful as it learns from past interactions
This is why AlphaAvatar treats memory, persona, tools, and multimodal context as first-class runtime components, rather than small add-ons around a chatbot.
One of the key design choices is to keep the system modular.
AlphaAvatar is organized around a realtime runtime powered by components such as AgentSession and AvatarEngine, while capabilities are added through plugins.
Current plugin directions include:
Memory Plugin — extracts, stores, retrieves, and injects long-term user context
Persona Plugin — tracks preferences, identity state, interaction style, and user-related context
MCP Plugin — provides a unified tool interface for external actions
RAG Plugin — connects the assistant to documents and knowledge bases
DeepResearch Plugin — supports longer research workflows
Status Plugin — exposes intermediate progress during long-running actions
Character / Avatar Plugin — supports avatar-style interaction
Channel Plugins — connect the assistant to external channels such as WhatsApp
This plugin-based architecture makes the system easier to extend. A new channel, tool, model provider, memory backend, or avatar interface should be added without rewriting the core assistant runtime.
AlphaAvatar is designed for realtime interaction, not only text-based chat.
The current direction includes:
realtime voice interaction via LiveKit RTC
text interaction
sampled camera / visual input
face detection and recognition
speaker / voice target detection
avatar-style response UI
status-aware feedback during tool execution
For realtime assistants, silence during long-running tool calls feels unnatural.
So AlphaAvatar also includes a status-aware feedback loop. For example, when the assistant is retrieving memory, calling MCP tools, reading documents, or running a DeepResearch workflow, it can expose intermediate status updates instead of making the user wait without feedback.
A major part of AlphaAvatar is the idea that memory should not just be a chat summary.
Memory should become part of the assistant’s operating context.
The Memory module is designed to extract useful long-term information from interactions and retrieve relevant context when needed.
The Persona module tracks user-related context such as:
preferences
identity state
interaction style
session-level persona information
temporary-user to real-user identity merging
The next step is to push this further into multimodal memory.
Instead of only extracting memory from text conversations, AlphaAvatar should be able to build structured memory from:
visual frames
voice signals
face identity
speaker identity
user actions
environment changes
tool execution history
recurring routines
The long-term direction is event-style multimodal memory: connecting faces, voices, objects, places, actions, documents, tools, and time into a more useful personal memory space.
AlphaAvatar is designed to be self-hostable because personal assistants will eventually handle very sensitive data.
A real personal AI butler may know about your routines, documents, tasks, conversations, visual history, voice identity, face identity, preferences, and personal workflows.
That kind of data should not be locked inside a closed black-box service by default.
In AlphaAvatar, the persistent memory and storage layer can stay on the user’s own personal server, while model inference can run locally, on another private server, or through an optional OpenAI-compatible external model provider.
The model runtime and the personal data layer do not have to live on the same machine.
The next stage is pushing AlphaAvatar toward fuller multimodal support.
Some directions I’m working on:
deeper integration of visual input into Memory
expanding Persona with face / speaker / identity-aware context
improving realtime status feedback for long-running tool workflows
building event-style multimodal memory instead of isolated frame captions
connecting memory, tools, planning, reminders, and cross-channel workflows
making the assistant feel more like a persistent personal AI butler than a session-based chatbot
Docs: https://docs.alphaavatar.io
Website: https://alphaavatar.ai
Demo: https://www.alphaavatar.ai/demo
Community: AlphaAvatar
I’d love to hear feedback from people working on realtime agents, OpenAI-compatible assistants, multimodal models, memory systems, MCP tools, RAG, voice AI, avatar interaction, or self-hosted AI infrastructure.
If anyone is interested in contributing or building in this direction together, collaboration is very welcome.