# AlphaAvatar: a self-hostable realtime full-multimodal personal AI assistant runtime

> Source: <https://discuss.huggingface.co/t/alphaavatar-a-self-hostable-realtime-full-multimodal-personal-ai-assistant-runtime/176928#post_1>
> Published: 2026-06-18 03:16:58+00:00

Hi everyone

I’ve been building **AlphaAvatar**, an open-source and self-hostable **realtime full-multimodal personal AI assistant runtime**.

The idea behind AlphaAvatar is simple: I don’t think personal AI assistants should stay as stateless chatbots forever.

Most assistants today still work like this:

```
User asks something
↓
Assistant replies
↓
Session ends
↓
Most useful context is lost
```

AlphaAvatar is my attempt to explore what a persistent **personal AI butler** could look like — an assistant that can talk, see, remember, understand who it is interacting with, retrieve knowledge, call tools, manage tasks, and act across different channels over time.

Below is the current high-level architecture:

AlphaAvatar combines **voice, text, visual input, face identity, speaker detection, memory, persona, MCP tools, RAG, DeepResearch, status feedback, model orchestration, and channel integrations** into one assistant runtime.

The goal is not just to build another chat UI.

The goal is to build a runtime layer for long-term personal assistance.

At a high level, AlphaAvatar includes:

**Interaction layer** for realtime voice, text, camera input, and external channels

**Core runtime / agent layer** for session state, context management, and orchestration

**Memory + Persona layer** for persistent user context and identity-aware interaction

**Tool / knowledge layer** with MCP, RAG, and DeepResearch

**Model layer** for OpenAI-compatible LLMs, multimodal models, STT, TTS, speaker detection, and face recognition

**Storage / data layer** for self-hosted memory, documents, vector storage, and tool APIs

**Output layer** for realtime voice, text, avatar responses, tool actions, and status updates

One important direction of AlphaAvatar is that “multimodal” should not only mean accepting voice, text, and camera input.

The goal is to make the entire assistant runtime become **full multimodal**.

That means multimodal context should flow through the core modules of the system:

**Memory** should be able to learn from text, voice, visual frames, face identity, speaker identity, user actions, tool results, and recurring routines.

**Persona** should understand the user not only from written preferences, but also from interaction style, voice behavior, identity signals, and multimodal context.

**MCP tools** should be selected and called based on the full runtime context, not only the latest text prompt.

**RAG / DeepResearch** should work with documents, user context, tool results, and future visual/event memories.

**Status feedback** should expose what the assistant is doing across modalities, especially during long-running tool, retrieval, or research workflows.

**Channel plugins** should allow the same assistant runtime to work across voice, web, avatar UI, WhatsApp, Discord, and future channels.

So the long-term goal is not simply:

```
text + voice + camera → chatbot
```

but rather:

```
text + voice + vision + identity + memory + persona + tools + channels
        ↓
full-multimodal personal assistant runtime
```

This is why AlphaAvatar treats **Memory, Persona, MCP, RAG, DeepResearch, Status, Voice, Avatar, and Channel integrations** as composable runtime plugins.

Each plugin should eventually be able to consume, produce, or update multimodal context.

A real personal assistant should not only answer questions.

It should be able to:

remember useful long-term context

understand the user’s preferences and routines

know who is currently interacting with it

work across voice, text, camera, and external channels

retrieve documents and knowledge when needed

call tools and external services

provide progress updates during long-running actions

gradually become more useful as it learns from past interactions

This is why AlphaAvatar treats memory, persona, tools, and multimodal context as first-class runtime components, rather than small add-ons around a chatbot.

One of the key design choices is to keep the system modular.

AlphaAvatar is organized around a realtime runtime powered by components such as **AgentSession** and **AvatarEngine**, while capabilities are added through plugins.

Current plugin directions include:

**Memory Plugin** — extracts, stores, retrieves, and injects long-term user context

**Persona Plugin** — tracks preferences, identity state, interaction style, and user-related context

**MCP Plugin** — provides a unified tool interface for external actions

**RAG Plugin** — connects the assistant to documents and knowledge bases

**DeepResearch Plugin** — supports longer research workflows

**Status Plugin** — exposes intermediate progress during long-running actions

**Character / Avatar Plugin** — supports avatar-style interaction

**Channel Plugins** — connect the assistant to external channels such as WhatsApp

This plugin-based architecture makes the system easier to extend. A new channel, tool, model provider, memory backend, or avatar interface should be added without rewriting the core assistant runtime.

AlphaAvatar is designed for realtime interaction, not only text-based chat.

The current direction includes:

realtime voice interaction via **LiveKit RTC**

text interaction

sampled camera / visual input

face detection and recognition

speaker / voice target detection

avatar-style response UI

status-aware feedback during tool execution

For realtime assistants, silence during long-running tool calls feels unnatural.

So AlphaAvatar also includes a status-aware feedback loop. For example, when the assistant is retrieving memory, calling MCP tools, reading documents, or running a DeepResearch workflow, it can expose intermediate status updates instead of making the user wait without feedback.

A major part of AlphaAvatar is the idea that memory should not just be a chat summary.

Memory should become part of the assistant’s operating context.

The Memory module is designed to extract useful long-term information from interactions and retrieve relevant context when needed.

The Persona module tracks user-related context such as:

preferences

identity state

interaction style

session-level persona information

temporary-user to real-user identity merging

The next step is to push this further into multimodal memory.

Instead of only extracting memory from text conversations, AlphaAvatar should be able to build structured memory from:

visual frames

voice signals

face identity

speaker identity

user actions

environment changes

tool execution history

recurring routines

The long-term direction is **event-style multimodal memory**: connecting faces, voices, objects, places, actions, documents, tools, and time into a more useful personal memory space.

AlphaAvatar is designed to be self-hostable because personal assistants will eventually handle very sensitive data.

A real personal AI butler may know about your routines, documents, tasks, conversations, visual history, voice identity, face identity, preferences, and personal workflows.

That kind of data should not be locked inside a closed black-box service by default.

In AlphaAvatar, the persistent memory and storage layer can stay on the user’s own personal server, while model inference can run locally, on another private server, or through an optional OpenAI-compatible external model provider.

The model runtime and the personal data layer do not have to live on the same machine.

The next stage is pushing AlphaAvatar toward fuller multimodal support.

Some directions I’m working on:

deeper integration of visual input into Memory

expanding Persona with face / speaker / identity-aware context

improving realtime status feedback for long-running tool workflows

building event-style multimodal memory instead of isolated frame captions

connecting memory, tools, planning, reminders, and cross-channel workflows

making the assistant feel more like a persistent personal AI butler than a session-based chatbot

GitHub: [GitHub - AlphaAvatar/AlphaAvatar: A real-time interactive Omni Avatar built on LiveKit, which allows you to seamlessly integrate with any open source Avatar components (real-time model, visual, voice, memory, search, etc.). · GitHub](https://github.com/AlphaAvatar/AlphaAvatar)

Docs: [https://docs.alphaavatar.io](https://docs.alphaavatar.io/)

Website: [https://alphaavatar.ai](https://alphaavatar.ai/)

Demo: [https://www.alphaavatar.ai/demo](https://www.alphaavatar.ai/demo)

Community: [AlphaAvatar](https://discord.gg/RVBWbb8Xy)

I’d love to hear feedback from people working on realtime agents, OpenAI-compatible assistants, multimodal models, memory systems, MCP tools, RAG, voice AI, avatar interaction, or self-hosted AI infrastructure.

If anyone is interested in contributing or building in this direction together, collaboration is very welcome.
