AlphaAvatar: a self-hostable realtime full-multimodal personal AI assistant runtime

wpnews.pro

Hi everyone

I’ve been building AlphaAvatar, an open-source and self-hostable realtime full-multimodal personal AI assistant runtime.

The idea behind AlphaAvatar is simple: I don’t think personal AI assistants should stay as stateless chatbots forever.

Most assistants today still work like this:

User asks something
↓
Assistant replies
↓
Session ends
↓
Most useful context is lost

AlphaAvatar is my attempt to explore what a persistent personal AI butler could look like — an assistant that can talk, see, remember, understand who it is interacting with, retrieve knowledge, call tools, manage tasks, and act across different channels over time.

Below is the current high-level architecture:

AlphaAvatar combines voice, text, visual input, face identity, speaker detection, memory, persona, MCP tools, RAG, DeepResearch, status feedback, model orchestration, and channel integrations into one assistant runtime.

The goal is not just to build another chat UI.

The goal is to build a runtime layer for long-term personal assistance.

At a high level, AlphaAvatar includes:

Interaction layer for realtime voice, text, camera input, and external channels

Core runtime / agent layer for session state, context management, and orchestration

Memory + Persona layer for persistent user context and identity-aware interaction

Tool / knowledge layer with MCP, RAG, and DeepResearch

Model layer for OpenAI-compatible LLMs, multimodal models, STT, TTS, speaker detection, and face recognition

Storage / data layer for self-hosted memory, documents, vector storage, and tool APIs

Output layer for realtime voice, text, avatar responses, tool actions, and status updates

One important direction of AlphaAvatar is that “multimodal” should not only mean accepting voice, text, and camera input.

The goal is to make the entire assistant runtime become full multimodal.

That means multimodal context should flow through the core modules of the system:

Memory should be able to learn from text, voice, visual frames, face identity, speaker identity, user actions, tool results, and recurring routines.

Persona should understand the user not only from written preferences, but also from interaction style, voice behavior, identity signals, and multimodal context.

MCP tools should be selected and called based on the full runtime context, not only the latest text prompt.

RAG / DeepResearch should work with documents, user context, tool results, and future visual/event memories.

Status feedback should expose what the assistant is doing across modalities, especially during long-running tool, retrieval, or research workflows.

Channel plugins should allow the same assistant runtime to work across voice, web, avatar UI, WhatsApp, Discord, and future channels.

So the long-term goal is not simply:

text + voice + camera → chatbot

but rather:

text + voice + vision + identity + memory + persona + tools + channels
        ↓
full-multimodal personal assistant runtime

This is why AlphaAvatar treats Memory, Persona, MCP, RAG, DeepResearch, Status, Voice, Avatar, and Channel integrations as composable runtime plugins.

Each plugin should eventually be able to consume, produce, or update multimodal context.

A real personal assistant should not only answer questions.

It should be able to:

remember useful long-term context

understand the user’s preferences and routines

know who is currently interacting with it

work across voice, text, camera, and external channels

retrieve documents and knowledge when needed

call tools and external services

provide progress updates during long-running actions

gradually become more useful as it learns from past interactions

This is why AlphaAvatar treats memory, persona, tools, and multimodal context as first-class runtime components, rather than small add-ons around a chatbot.

One of the key design choices is to keep the system modular.

AlphaAvatar is organized around a realtime runtime powered by components such as AgentSession and AvatarEngine, while capabilities are added through plugins.

Current plugin directions include:

Memory Plugin — extracts, stores, retrieves, and injects long-term user context

Persona Plugin — tracks preferences, identity state, interaction style, and user-related context

MCP Plugin — provides a unified tool interface for external actions

RAG Plugin — connects the assistant to documents and knowledge bases

DeepResearch Plugin — supports longer research workflows

Status Plugin — exposes intermediate progress during long-running actions

Character / Avatar Plugin — supports avatar-style interaction

Channel Plugins — connect the assistant to external channels such as WhatsApp

This plugin-based architecture makes the system easier to extend. A new channel, tool, model provider, memory backend, or avatar interface should be added without rewriting the core assistant runtime.

AlphaAvatar is designed for realtime interaction, not only text-based chat.

The current direction includes:

realtime voice interaction via LiveKit RTC

text interaction

sampled camera / visual input

face detection and recognition

speaker / voice target detection

avatar-style response UI

status-aware feedback during tool execution

For realtime assistants, silence during long-running tool calls feels unnatural.

So AlphaAvatar also includes a status-aware feedback loop. For example, when the assistant is retrieving memory, calling MCP tools, reading documents, or running a DeepResearch workflow, it can expose intermediate status updates instead of making the user wait without feedback.

A major part of AlphaAvatar is the idea that memory should not just be a chat summary.

Memory should become part of the assistant’s operating context.

The Memory module is designed to extract useful long-term information from interactions and retrieve relevant context when needed.

The Persona module tracks user-related context such as:

preferences

identity state

interaction style

session-level persona information

temporary-user to real-user identity merging

The next step is to push this further into multimodal memory.

Instead of only extracting memory from text conversations, AlphaAvatar should be able to build structured memory from:

visual frames

voice signals

face identity

speaker identity

user actions

environment changes

tool execution history

recurring routines

The long-term direction is event-style multimodal memory: connecting faces, voices, objects, places, actions, documents, tools, and time into a more useful personal memory space.

AlphaAvatar is designed to be self-hostable because personal assistants will eventually handle very sensitive data.

A real personal AI butler may know about your routines, documents, tasks, conversations, visual history, voice identity, face identity, preferences, and personal workflows.

That kind of data should not be locked inside a closed black-box service by default.

In AlphaAvatar, the persistent memory and storage layer can stay on the user’s own personal server, while model inference can run locally, on another private server, or through an optional OpenAI-compatible external model provider.

The model runtime and the personal data layer do not have to live on the same machine.

The next stage is pushing AlphaAvatar toward fuller multimodal support.

Some directions I’m working on:

deeper integration of visual input into Memory

expanding Persona with face / speaker / identity-aware context

improving realtime status feedback for long-running tool workflows

building event-style multimodal memory instead of isolated frame captions

connecting memory, tools, planning, reminders, and cross-channel workflows

making the assistant feel more like a persistent personal AI butler than a session-based chatbot

GitHub: GitHub - AlphaAvatar/AlphaAvatar: A real-time interactive Omni Avatar built on LiveKit, which allows you to seamlessly integrate with any open source Avatar components (real-time model, visual, voice, memory, search, etc.). · GitHub

Docs: https://docs.alphaavatar.io

Website: https://alphaavatar.ai

Demo: https://www.alphaavatar.ai/demo

Community: AlphaAvatar

I’d love to hear feedback from people working on realtime agents, OpenAI-compatible assistants, multimodal models, memory systems, MCP tools, RAG, voice AI, avatar interaction, or self-hosted AI infrastructure.

If anyone is interested in contributing or building in this direction together, collaboration is very welcome.

source & further reading

discuss.huggingface.co — original article Rakarrack-0.6.1 port making progress! ( AI assisted ) Cloud Storage Poll Welcome to Haiku basic(Haiku Docs, Haiku slide and Haiku sheets)

AlphaAvatar: a self-hostable realtime full-multimodal personal AI assistant runtime

Run your AI side-project on zahid.host