AlphaAvatar: a self-hostable realtime full-multimodal personal AI assistant runtime

AlphaAvatar, an open-source and self-hostable realtime full-multimodal personal AI assistant runtime, has been announced. The system integrates voice, text, visual input, face identity, speaker detection, memory, persona, MCP tools, RAG, DeepResearch, status feedback, model orchestration, and channel integrations into a persistent assistant runtime. The goal is to move beyond stateless chatbots toward a long-term personal AI butler that can remember context, understand users, and act across multiple modalities and channels.

Hi everyone I’ve been building AlphaAvatar , an open-source and self-hostable realtime full-multimodal personal AI assistant runtime . The idea behind AlphaAvatar is simple: I don’t think personal AI assistants should stay as stateless chatbots forever. Most assistants today still work like this: User asks something ↓ Assistant replies ↓ Session ends ↓ Most useful context is lost AlphaAvatar is my attempt to explore what a persistent personal AI butler could look like — an assistant that can talk, see, remember, understand who it is interacting with, retrieve knowledge, call tools, manage tasks, and act across different channels over time. Below is the current high-level architecture: AlphaAvatar combines voice, text, visual input, face identity, speaker detection, memory, persona, MCP tools, RAG, DeepResearch, status feedback, model orchestration, and channel integrations into one assistant runtime. The goal is not just to build another chat UI. The goal is to build a runtime layer for long-term personal assistance. At a high level, AlphaAvatar includes: Interaction layer for realtime voice, text, camera input, and external channels Core runtime / agent layer for session state, context management, and orchestration Memory + Persona layer for persistent user context and identity-aware interaction Tool / knowledge layer with MCP, RAG, and DeepResearch Model layer for OpenAI-compatible LLMs, multimodal models, STT, TTS, speaker detection, and face recognition Storage / data layer for self-hosted memory, documents, vector storage, and tool APIs Output layer for realtime voice, text, avatar responses, tool actions, and status updates One important direction of AlphaAvatar is that “multimodal” should not only mean accepting voice, text, and camera input. The goal is to make the entire assistant runtime become full multimodal . That means multimodal context should flow through the core modules of the system: Memory should be able to learn from text, voice, visual frames, face identity, speaker identity, user actions, tool results, and recurring routines. Persona should understand the user not only from written preferences, but also from interaction style, voice behavior, identity signals, and multimodal context. MCP tools should be selected and called based on the full runtime context, not only the latest text prompt. RAG / DeepResearch should work with documents, user context, tool results, and future visual/event memories. Status feedback should expose what the assistant is doing across modalities, especially during long-running tool, retrieval, or research workflows. Channel plugins should allow the same assistant runtime to work across voice, web, avatar UI, WhatsApp, Discord, and future channels. So the long-term goal is not simply: text + voice + camera → chatbot but rather: text + voice + vision + identity + memory + persona + tools + channels ↓ full-multimodal personal assistant runtime This is why AlphaAvatar treats Memory, Persona, MCP, RAG, DeepResearch, Status, Voice, Avatar, and Channel integrations as composable runtime plugins. Each plugin should eventually be able to consume, produce, or update multimodal context. A real personal assistant should not only answer questions. It should be able to: remember useful long-term context understand the user’s preferences and routines know who is currently interacting with it work across voice, text, camera, and external channels retrieve documents and knowledge when needed call tools and external services provide progress updates during long-running actions gradually become more useful as it learns from past interactions This is why AlphaAvatar treats memory, persona, tools, and multimodal context as first-class runtime components, rather than small add-ons around a chatbot. One of the key design choices is to keep the system modular. AlphaAvatar is organized around a realtime runtime powered by components such as AgentSession and AvatarEngine , while capabilities are added through plugins. Current plugin directions include: Memory Plugin — extracts, stores, retrieves, and injects long-term user context Persona Plugin — tracks preferences, identity state, interaction style, and user-related context MCP Plugin — provides a unified tool interface for external actions RAG Plugin — connects the assistant to documents and knowledge bases DeepResearch Plugin — supports longer research workflows Status Plugin — exposes intermediate progress during long-running actions Character / Avatar Plugin — supports avatar-style interaction Channel Plugins — connect the assistant to external channels such as WhatsApp This plugin-based architecture makes the system easier to extend. A new channel, tool, model provider, memory backend, or avatar interface should be added without rewriting the core assistant runtime. AlphaAvatar is designed for realtime interaction, not only text-based chat. The current direction includes: realtime voice interaction via LiveKit RTC text interaction sampled camera / visual input face detection and recognition speaker / voice target detection avatar-style response UI status-aware feedback during tool execution For realtime assistants, silence during long-running tool calls feels unnatural. So AlphaAvatar also includes a status-aware feedback loop. For example, when the assistant is retrieving memory, calling MCP tools, reading documents, or running a DeepResearch workflow, it can expose intermediate status updates instead of making the user wait without feedback. A major part of AlphaAvatar is the idea that memory should not just be a chat summary. Memory should become part of the assistant’s operating context. The Memory module is designed to extract useful long-term information from interactions and retrieve relevant context when needed. The Persona module tracks user-related context such as: preferences identity state interaction style session-level persona information temporary-user to real-user identity merging The next step is to push this further into multimodal memory. Instead of only extracting memory from text conversations, AlphaAvatar should be able to build structured memory from: visual frames voice signals face identity speaker identity user actions environment changes tool execution history recurring routines The long-term direction is event-style multimodal memory : connecting faces, voices, objects, places, actions, documents, tools, and time into a more useful personal memory space. AlphaAvatar is designed to be self-hostable because personal assistants will eventually handle very sensitive data. A real personal AI butler may know about your routines, documents, tasks, conversations, visual history, voice identity, face identity, preferences, and personal workflows. That kind of data should not be locked inside a closed black-box service by default. In AlphaAvatar, the persistent memory and storage layer can stay on the user’s own personal server, while model inference can run locally, on another private server, or through an optional OpenAI-compatible external model provider. The model runtime and the personal data layer do not have to live on the same machine. The next stage is pushing AlphaAvatar toward fuller multimodal support. Some directions I’m working on: deeper integration of visual input into Memory expanding Persona with face / speaker / identity-aware context improving realtime status feedback for long-running tool workflows building event-style multimodal memory instead of isolated frame captions connecting memory, tools, planning, reminders, and cross-channel workflows making the assistant feel more like a persistent personal AI butler than a session-based chatbot GitHub: GitHub - AlphaAvatar/AlphaAvatar: A real-time interactive Omni Avatar built on LiveKit, which allows you to seamlessly integrate with any open source Avatar components real-time model, visual, voice, memory, search, etc. . · GitHub https://github.com/AlphaAvatar/AlphaAvatar Docs: https://docs.alphaavatar.io https://docs.alphaavatar.io/ Website: https://alphaavatar.ai https://alphaavatar.ai/ Demo: https://www.alphaavatar.ai/demo https://www.alphaavatar.ai/demo Community: AlphaAvatar https://discord.gg/RVBWbb8Xy I’d love to hear feedback from people working on realtime agents, OpenAI-compatible assistants, multimodal models, memory systems, MCP tools, RAG, voice AI, avatar interaction, or self-hosted AI infrastructure. If anyone is interested in contributing or building in this direction together, collaboration is very welcome.