AVTR-1: Open-weight real-time flow-matching transformer for audio-driven avatars

AVTR-1, an open-weight flow-matching transformer for audio-driven avatars, has been released for real-time dialogue generation. The model renders lip-synced speech and active listening at 25 frames per second on a single GPU, given a portrait image and dual-stream audio. The release includes model weights, TensorRT-accelerated inference code, and a production-ready backend available as an API or for self-hosting.

AVTR-1 is a flow-matching-based autoregressive model for live dialogue. Given a portrait image and dual-stream audio, it renders lip-synced speech and active listening at 25 fps on a single GPU. Built for production deployment: model weights, TensorRT-accelerated inference, and the live-session backend - available as an API or fully self-hosted trailer 720p small.mp4 - Model weights - Inference code - Interactive streaming demo - Technical report Coming soon - Production-ready back-end Coming soon - Linux - NVIDIA GPU Ampere or later recommended - CUDA 12.x + TensorRT 10.x pixi https://prefix.dev/ — curl -fsSL https://pixi.sh/install.sh | sh git clone https://github.com/avaturn-live/avtr-1.git cd avtr-1 pixi install export AVTR1 LOCAL STORAGE=/path/to/avtr1 storage All downloaded weights and built engines go here. Defaults to <project root /artifacts/ the repo checkout, not the caller's working directory when unset. pixi run download First run will prompt for a HuggingFace login via hf auth login automatically invoked as a dependency of download . Weights are pulled from two public HF repos by the previous step: AVTR-1 weights from avaturn-live/avtr-1 https://huggingface.co/avaturn-live/avtr-1 and LivePortrait weights repackaged as ONNX graphs from digital-avatar/ditto-talkinghead https://huggingface.co/digital-avatar/ditto-talkinghead . TRT engines are compute-capability specific and built locally — run the scripts below once per machine; outputs land under $AVTR1 LOCAL STORAGE . Build everything at once pixi run build-trt-engines Or individually pixi run build-trt-engines-avtr1 pixi run build-trt-engines-renderer pixi run build-trt-engines-hubert pixi run interactive-demo Single speaker. Avatar lip-syncs the given audio track. pixi run generate offline --speech example/speaker 1.ogg with a custom avatar and background: pixi run generate offline --speech example/speaker 1.ogg --avatar maria --bg minimal office Two-speaker dialogue. Avatar voices --speech and reacts active listening to the peer audio on --listen . Run twice with the tracks swapped to render both sides of the conversation. avatar = speaker 1 elena pixi run generate offline --speech example/speaker 1.ogg --listen example/speaker 2.ogg --avatar elena --out elena.mp4 avatar = speaker 2 marcus pixi run generate offline --speech example/speaker 2.ogg --listen example/speaker 1.ogg --avatar marcus --out marcus.mp4 stitch both sides into a single side-by-side video: ffmpeg -i elena.mp4 -i marcus.mp4 -filter complex \ " 0:v 1:v hstack=inputs=2 v ; 0:a 1:a amix=inputs=2 a " \ -map " v " -map " a " dialogue.mp4 Silence / idle motion. No audio — renders idle micro-motion for the given duration. pixi run generate offline --duration 10 Available avatars are the filenames without .png inside $AVTR1 LOCAL STORAGE/v1/avatars artifacts/reference frames/ after downloading. AVTR-1 generates motion in 5-frame chunks end-to-end. At 25 fps that's 200 ms of output per chunk, so any GPU under that line runs in real-time. | GPU | Latency / 5-frame chunk | Real-time factor | |---|---|---| | L40 | 84 ms | 2.4× | | A100 | 91 ms | 2.2× | | RTX 4060 Ti | 166 ms | 1.2× | | RTX 3070 | 181 ms | 1.1× | | L4 | 202 ms | 0.99× | | RTX 3060 Ti | 206 ms | 0.97× | | RTX 4060 | 232 ms | 0.86× | Real-time factor = 200 ms / latency. ≥ 1.0× means the GPU keeps up with 25 fps. TURN server setup optional ICE tries direct UDP first host candidates + STUN-reflexive candidates from a public STUN server and only needs a TURN relay when the network in between can't pass UDP between browser and streamer — typical when the streamer lives on a cloud VM whose security group blocks inbound UDP, or when one peer is behind symmetric NAT. If direct UDP works for your setup you can skip this section entirely. The browser's connectivity card after the engine dropdown tells you which path ICE actually picked, and the same UI links back here when the verdict is "only TURN works" or "nothing worked". The project is wired for Cloudflare's Realtime TURN . The free tier is generous enough for development; no credit card required. 1. Create a TURN application on Cloudflare - Sign in to dash.cloudflare.com https://dash.cloudflare.com . - Navigate to Realtime → TURN Server . - Click Create TURN App , give it a name e.g. avtr1-dev , and submit. 2. Copy the two credential values On the application's detail page you'll see: Turn Key ID — short identifier looks like a UUID without dashes . API Token — long secret shown only once at creation. Save it before navigating away. 3. Put them in .env CLOUDFLARE TURN KEY ID="<Turn Key ID " CLOUDFLARE TURN KEY TOKEN="<API Token " That's it. On the next /ice-servers request the streamer mints a fresh, short-lived TURN credential per session via Cloudflare's /v1/turn/keys/{kid}/credentials/generate endpoint — the long-lived API token never leaves the server. You can verify it picked up the keys by watching the streamer log for ice: using Cloudflare TURN on the first browser request. The browser-side connectivity probe the small status card under the controls tells you which ICE path actually wins: - ✓ host — the browser saw its own local interface; always present. - ✓ server-reflexive via STUN — the browser learned its public IP via STUN; doesn't prove the streamer is reachable on UDP from the browser. - ✓ relay via TURN — the browser successfully allocated a Cloudflare TURN relay; required when direct UDP can't traverse the network in between. If the relay check fails while TURN is configured the most likely cause is wrong credentials — re-check that you copied the full API Token not the Key ID twice into CLOUDFLARE TURN KEY TOKEN . Alternatives. Anything that speaks the standard TURN protocol works. Set TURN URL and optionally TURN USERNAME / TURN CREDENTIAL instead of the Cloudflare variables and resolve ice servers will use it verbatim — e.g. a self-hosted coturn https://github.com/coturn/coturn on a small VM. STUN-only also works if you can open the appropriate UDP port range inbound on whatever firewall sits in front of the streamer. This repository contains three separately licensed components: — build and demo tooling, released under the scripts/ AVTR-1 Community License LICENSE-MODEL.md /avaturn-live/avtr-1/blob/main/LICENSE-MODEL.md . Permits commercial use by entities under USD 10M annual revenue; entities at or above that threshold need a commercial agreement. The same license governs the AVTR-1 weights distributed at avaturn-live/avtr-1 https://huggingface.co/avaturn-live/avtr-1 .— Avaturn Renderer inference pipeline , released under the src/avtr1 renderer/ PolyForm Noncommercial License 1.0.0 with a Required Notice LICENSE-RENDERER.md /avaturn-live/avtr-1/blob/main/LICENSE-RENDERER.md . Noncommercial use only , regardless of revenue; any commercial use needs a separate Renderer Commercial License.— Avaturn Streamer orchestration backend , released under the src/avaturn live streamer/ PolyForm Noncommercial License 1.0.0 with a Required Notice and patent reservation LICENSE-STREAMER.md /avaturn-live/avtr-1/blob/main/LICENSE-STREAMER.md , PATENTS.md /avaturn-live/avtr-1/blob/main/PATENTS.md . Noncommercial use only , regardless of revenue; any commercial use needs a separate Streamer Commercial License. See LICENSE.md /avaturn-live/avtr-1/blob/main/LICENSE.md for the full component map and the consequences of the multi-license structure. In any conflict between this summary and the underlying license files, the license files control. The pipeline uses InsightFace's pretrained SCRFD detector and 2D106 landmark model, which are licensed for non-commercial research use only . To use AVTR-1 commercially you must either obtain a commercial license from InsightFace deepinsight@gmail.com mailto:deepinsight@gmail.com or replace these models with permissively-licensed alternatives e.g., MediaPipe . See THIRD-PARTY-NOTICES.md /avaturn-live/avtr-1/blob/main/THIRD-PARTY-NOTICES.md for the full picture. Commercial inquiries: hello@avaturn.me mailto:hello@avaturn.me