AVTR-1: Open-weight real-time flow-matching transformer for audio-driven avatars

wpnews.pro

AVTR-1 is a flow-matching-based autoregressive model for live dialogue. Given a portrait image and dual-stream audio, it renders lip-synced speech and active listening at 25 fps on a single GPU. Built for production deployment: model weights, TensorRT-accelerated inference, and the live-session backend - available as an API or fully self-hosted

trailer_720p_small.mp4 #

Model weights
Inference code
Interactive streaming demo
Technical report (Coming soon)
Production-ready back-end (Coming soon)
Linux
NVIDIA GPU (Ampere or later recommended)
CUDA 12.x + TensorRT 10.x pixi—curl -fsSL https://pixi.sh/install.sh | sh

git clone https://github.com/avaturn-live/avtr-1.git
cd avtr-1
pixi install
export AVTR1_LOCAL_STORAGE=/path/to/avtr1_storage

All downloaded weights and built engines go here. Defaults to <project_root>/artifacts/

(the repo checkout, not the caller's working directory) when unset.

pixi run download

First run will prompt for a HuggingFace login via hf auth login

(automatically invoked as a dependency of download

).

Weights are pulled from two public HF repos by the previous step: AVTR-1 weights from avaturn-live/avtr-1 and LivePortrait weights repackaged as ONNX graphs from digital-avatar/ditto-talkinghead. TRT engines are compute-capability specific and built locally — run the scripts below once per machine; outputs land under $AVTR1_LOCAL_STORAGE

.

pixi run build-trt-engines

pixi run build-trt-engines-avtr1
pixi run build-trt-engines-renderer
pixi run build-trt-engines-hubert
pixi run interactive-demo

Single speaker. Avatar lip-syncs the given audio track.

pixi run generate_offline --speech example/speaker_1.ogg

pixi run generate_offline --speech example/speaker_1.ogg --avatar maria --bg minimal_office

Two-speaker dialogue. Avatar voices --speech

and reacts (active listening) to the peer audio on --listen

. Run twice with the tracks swapped to render both sides of the conversation.

pixi run generate_offline --speech example/speaker_1.ogg --listen example/speaker_2.ogg --avatar elena  --out elena.mp4
pixi run generate_offline --speech example/speaker_2.ogg --listen example/speaker_1.ogg --avatar marcus --out marcus.mp4

ffmpeg -i elena.mp4 -i marcus.mp4 -filter_complex \
  "[0:v][1:v]hstack=inputs=2[v];[0:a][1:a]amix=inputs=2[a]" \
  -map "[v]" -map "[a]" dialogue.mp4

Silence / idle motion. No audio — renders idle micro-motion for the given duration.

pixi run generate_offline --duration 10

Available avatars are the filenames (without `.png` #

) inside $AVTR1_LOCAL_STORAGE/v1/avatars_artifacts/reference_frames/

after down.

AVTR-1 generates motion in 5-frame chunks end-to-end. At 25 fps that's 200 ms of output per chunk, so any GPU under that line runs in real-time.

GPU	Latency / 5-frame chunk	Real-time factor
L40	84 ms	2.4×
A100	91 ms	2.2×
RTX 4060 Ti	166 ms	1.2×
RTX 3070	181 ms	1.1×
L4	202 ms	0.99×
RTX 3060 Ti	206 ms	0.97×
RTX 4060	232 ms	0.86×

Real-time factor = 200 ms / latency. ≥ 1.0× means the GPU keeps up with 25 fps.

TURN server setup (optional)

ICE tries direct UDP first (host candidates + STUN-reflexive candidates from a public STUN server) and only needs a TURN relay when the network in between can't pass UDP between browser and streamer — typical when the streamer lives on a cloud VM whose security group blocks inbound UDP, or when one peer is behind symmetric NAT.

If direct UDP works for your setup you can skip this section entirely. The browser's connectivity card after the engine dropdown tells you which path ICE actually picked, and the same UI links back here when the verdict is "only TURN works" or "nothing worked".

The project is wired for Cloudflare's Realtime TURN. The free tier is generous enough for development; no credit card required.

1. Create a TURN application on Cloudflare

Sign in to dash.cloudflare.com. - Navigate to Realtime → TURN Server. - Click Create TURN App, give it a name (e.g.avtr1-dev

), and submit.

2. Copy the two credential values

On the application's detail page you'll see:

Turn Key ID— short identifier (looks like a UUID without dashes).** API Token**— long secret shown only once at creation. Save it before navigating away.

3. Put them in .env

CLOUDFLARE_TURN_KEY_ID="<Turn Key ID>"
CLOUDFLARE_TURN_KEY_TOKEN="<API Token>"

That's it. On the next /ice-servers

request the streamer mints a fresh, short-lived TURN credential per session via Cloudflare's /v1/turn/keys/{kid}/credentials/generate

endpoint — the long-lived API token never leaves the server. You can verify it picked up the keys by watching the streamer log for ice: using Cloudflare TURN

on the first browser request.

The browser-side connectivity probe (the small status card under the controls) tells you which ICE path actually wins:

✓ host— the browser saw its own local interface; always present. - ✓ server-reflexive via STUN— the browser learned its public IP via STUN; doesn't prove the streamer is reachable on UDP from the browser. - ✓ relay via TURN— the browser successfully allocated a Cloudflare TURN relay; required when direct UDP can't traverse the network in between.

If the relay check fails while TURN is configured the most likely cause is wrong credentials — re-check that you copied the full API Token (not the Key ID twice) into CLOUDFLARE_TURN_KEY_TOKEN

.

Alternatives. Anything that speaks the standard TURN protocol works. Set TURN_URL

(and optionally TURN_USERNAME

/ TURN_CREDENTIAL

) instead of the Cloudflare variables and resolve_ice_servers()

will use it verbatim — e.g. a self-hosted coturn on a small VM. STUN-only also works if you can open the appropriate UDP port range inbound on whatever firewall sits in front of the streamer.

This repository contains three separately licensed components:

— build and demo tooling, released under thescripts/

AVTR-1 Community License(LICENSE-MODEL.md). Permits commercial use by entities under USD 10M annual revenue; entities at or above that threshold need a commercial agreement. The same license governs the AVTR-1 weights distributed atavaturn-live/avtr-1.— Avaturn Renderer (inference pipeline), released under thesrc/avtr1_renderer/

PolyForm Noncommercial License 1.0.0 with a Required Notice (LICENSE-RENDERER.md).Noncommercial use only, regardless of revenue; any commercial use needs a separate Renderer Commercial License.— Avaturn Streamer (orchestration backend), released under thesrc/avaturn_live_streamer/

PolyForm Noncommercial License 1.0.0 with a Required Notice and patent reservation (LICENSE-STREAMER.md,PATENTS.md).Noncommercial use only, regardless of revenue; any commercial use needs a separate Streamer Commercial License.

See LICENSE.md for the full component map and the consequences of the multi-license structure. In any conflict between this summary and the underlying license files, the license files control.

The pipeline uses InsightFace's pretrained SCRFD detector and 2D106 landmark model, which are licensed for non-commercial research use only. To use AVTR-1 commercially you must either obtain a commercial license from InsightFace (deepinsight@gmail.com) or replace these models with permissively-licensed alternatives (e.g., MediaPipe). See THIRD-PARTY-NOTICES.md for the full picture.

Commercial inquiries: hello@avaturn.me

source & further reading

github.com — original article

AVTR-1: Open-weight real-time flow-matching transformer for audio-driven avatars

trailer_720p_small.mp4 #

Available avatars are the filenames (without .png #

Run your AI side-project on zahid.host

Available avatars are the filenames (without `.png` #