AVTR-1 is a flow-matching-based autoregressive model for live dialogue. Given a portrait image and dual-stream audio, it renders lip-synced speech and active listening at 25 fps on a single GPU. Built for production deployment: model weights, TensorRT-accelerated inference, and the live-session backend - available as an API or fully self-hosted
trailer_720p_small.mp4 #
-
Model weights
-
Inference code
-
Interactive streaming demo
-
Technical report (Coming soon)
-
Production-ready back-end (Coming soon)
-
Linux
-
NVIDIA GPU (Ampere or later recommended)
-
CUDA 12.x + TensorRT 10.x pixiβ
curl -fsSL https://pixi.sh/install.sh | sh
git clone https://github.com/avaturn-live/avtr-1.git
cd avtr-1
pixi install
export AVTR1_LOCAL_STORAGE=/path/to/avtr1_storage
All downloaded weights and built engines go here. Defaults to <project_root>/artifacts/
(the repo checkout, not the caller's working directory) when unset.
pixi run download
First run will prompt for a HuggingFace login via hf auth login
(automatically invoked as a dependency of download
).
Weights are pulled from two public HF repos by the previous step:
AVTR-1 weights from avaturn-live/avtr-1
and LivePortrait weights repackaged as ONNX graphs from
digital-avatar/ditto-talkinghead.
TRT engines are compute-capability specific and built locally β run the scripts
below once per machine; outputs land under $AVTR1_LOCAL_STORAGE
.
pixi run build-trt-engines
pixi run build-trt-engines-avtr1
pixi run build-trt-engines-renderer
pixi run build-trt-engines-hubert
pixi run interactive-demo
Single speaker. Avatar lip-syncs the given audio track.
pixi run generate_offline --speech example/speaker_1.ogg
pixi run generate_offline --speech example/speaker_1.ogg --avatar maria --bg minimal_office
Two-speaker dialogue. Avatar voices --speech
and reacts (active listening) to the peer audio on --listen
. Run twice with the tracks swapped to render both sides of the conversation.
pixi run generate_offline --speech example/speaker_1.ogg --listen example/speaker_2.ogg --avatar elena --out elena.mp4
pixi run generate_offline --speech example/speaker_2.ogg --listen example/speaker_1.ogg --avatar marcus --out marcus.mp4
ffmpeg -i elena.mp4 -i marcus.mp4 -filter_complex \
"[0:v][1:v]hstack=inputs=2[v];[0:a][1:a]amix=inputs=2[a]" \
-map "[v]" -map "[a]" dialogue.mp4
Silence / idle motion. No audio β renders idle micro-motion for the given duration.
pixi run generate_offline --duration 10
Available avatars are the filenames (without .png #
) inside
$AVTR1_LOCAL_STORAGE/v1/avatars_artifacts/reference_frames/
after down.
AVTR-1 generates motion in 5-frame chunks end-to-end. At 25 fps that's 200 ms of output per chunk, so any GPU under that line runs in real-time.
| GPU | Latency / 5-frame chunk | Real-time factor |
|---|---|---|
| L40 | 84 ms | 2.4Γ |
| A100 | 91 ms | 2.2Γ |
| RTX 4060 Ti | 166 ms | 1.2Γ |
| RTX 3070 | 181 ms | 1.1Γ |
| L4 | 202 ms | 0.99Γ |
| RTX 3060 Ti | 206 ms | 0.97Γ |
| RTX 4060 | 232 ms | 0.86Γ |
Real-time factor = 200 ms / latency. β₯ 1.0Γ means the GPU keeps up with 25 fps.
TURN server setup (optional)
ICE tries direct UDP first (host candidates + STUN-reflexive candidates from a public STUN server) and only needs a TURN relay when the network in between can't pass UDP between browser and streamer β typical when the streamer lives on a cloud VM whose security group blocks inbound UDP, or when one peer is behind symmetric NAT.
If direct UDP works for your setup you can skip this section entirely. The browser's connectivity card after the engine dropdown tells you which path ICE actually picked, and the same UI links back here when the verdict is "only TURN works" or "nothing worked".
The project is wired for Cloudflare's Realtime TURN. The free tier is generous enough for development; no credit card required.
1. Create a TURN application on Cloudflare
- Sign in to
dash.cloudflare.com. - Navigate to
Realtime β TURN Server. - Click
Create TURN App, give it a name (e.g.
avtr1-dev
), and submit.
2. Copy the two credential values
On the application's detail page you'll see:
Turn Key IDβ short identifier (looks like a UUID without dashes).** API Token**β long secret shown only once at creation. Save it before navigating away.
3. Put them in .env
CLOUDFLARE_TURN_KEY_ID="<Turn Key ID>"
CLOUDFLARE_TURN_KEY_TOKEN="<API Token>"
That's it. On the next /ice-servers
request the streamer mints a fresh,
short-lived TURN credential per session via Cloudflare's
/v1/turn/keys/{kid}/credentials/generate
endpoint β the long-lived API
token never leaves the server. You can verify it picked up the keys by
watching the streamer log for ice: using Cloudflare TURN
on the first browser request.
The browser-side connectivity probe (the small status card under the controls) tells you which ICE path actually wins:
- β hostβ the browser saw its own local interface; always present. - β server-reflexive via STUNβ the browser learned its public IP via STUN; doesn't prove the streamer is reachable on UDP from the browser. - β relay via TURNβ the browser successfully allocated a Cloudflare TURN relay; required when direct UDP can't traverse the network in between.
If the relay check fails while TURN is configured the most likely cause is
wrong credentials β re-check that you copied the full API Token (not
the Key ID twice) into CLOUDFLARE_TURN_KEY_TOKEN
.
Alternatives. Anything that speaks the standard TURN protocol works.
Set TURN_URL
(and optionally TURN_USERNAME
/ TURN_CREDENTIAL
) instead
of the Cloudflare variables and resolve_ice_servers()
will use it verbatim β e.g. a self-hosted coturn on a small VM. STUN-only also works if you can open the appropriate UDP port range inbound on whatever firewall sits in front of the streamer.
This repository contains three separately licensed components:
β build and demo tooling, released under thescripts/
AVTR-1 Community License(LICENSE-MODEL.md). Permits commercial use by entities under USD 10M annual revenue; entities at or above that threshold need a commercial agreement. The same license governs the AVTR-1 weights distributed atavaturn-live/avtr-1.β Avaturn Renderer (inference pipeline), released under thesrc/avtr1_renderer/
PolyForm Noncommercial License 1.0.0 with a Required Notice (LICENSE-RENDERER.md).Noncommercial use only, regardless of revenue; any commercial use needs a separate Renderer Commercial License.β Avaturn Streamer (orchestration backend), released under thesrc/avaturn_live_streamer/
PolyForm Noncommercial License 1.0.0 with a Required Notice and patent reservation (LICENSE-STREAMER.md,PATENTS.md).Noncommercial use only, regardless of revenue; any commercial use needs a separate Streamer Commercial License.
See LICENSE.md for the full component map and the consequences of the multi-license structure. In any conflict between this summary and the underlying license files, the license files control.
The pipeline uses InsightFace's pretrained SCRFD detector and 2D106 landmark model, which are licensed for non-commercial research use only. To use AVTR-1 commercially you must either obtain a commercial license from InsightFace (deepinsight@gmail.com) or replace these models with permissively-licensed alternatives (e.g., MediaPipe). See THIRD-PARTY-NOTICES.md for the full picture.
Commercial inquiries: hello@avaturn.me