{"slug": "avtr-1-open-weight-real-time-flow-matching-transformer-for-audio-driven-avatars", "title": "AVTR-1: Open-weight real-time flow-matching transformer for audio-driven avatars", "summary": "AVTR-1, an open-weight flow-matching transformer for audio-driven avatars, has been released for real-time dialogue generation. The model renders lip-synced speech and active listening at 25 frames per second on a single GPU, given a portrait image and dual-stream audio. The release includes model weights, TensorRT-accelerated inference code, and a production-ready backend available as an API or for self-hosting.", "body_md": "**AVTR-1** is a flow-matching-based autoregressive model for live dialogue. Given a portrait image and dual-stream audio, it renders lip-synced speech and active listening at 25 fps on a single GPU. Built for production deployment: model weights, TensorRT-accelerated inference, and the live-session backend - available as an API or fully self-hosted\n\n## trailer_720p_small.mp4\n\n- Model weights\n- Inference code\n- Interactive streaming demo\n- Technical report (Coming soon)\n- Production-ready back-end (Coming soon)\n\n- Linux\n- NVIDIA GPU (Ampere or later recommended)\n- CUDA 12.x + TensorRT 10.x\n[pixi](https://prefix.dev/)—`curl -fsSL https://pixi.sh/install.sh | sh`\n\n```\ngit clone https://github.com/avaturn-live/avtr-1.git\ncd avtr-1\npixi install\nexport AVTR1_LOCAL_STORAGE=/path/to/avtr1_storage\n```\n\nAll downloaded weights and built engines go here. Defaults to `<project_root>/artifacts/`\n\n(the repo checkout, not the caller's working directory) when unset.\n\n```\npixi run download\n```\n\nFirst run will prompt for a HuggingFace login via `hf auth login`\n\n(automatically invoked as a dependency of `download`\n\n).\n\nWeights are pulled from two public HF repos by the previous step:\nAVTR-1 weights from [avaturn-live/avtr-1](https://huggingface.co/avaturn-live/avtr-1)\nand LivePortrait weights repackaged as ONNX graphs from\n[digital-avatar/ditto-talkinghead](https://huggingface.co/digital-avatar/ditto-talkinghead).\nTRT engines are compute-capability specific and built locally — run the scripts\nbelow once per machine; outputs land under `$AVTR1_LOCAL_STORAGE`\n\n.\n\n```\n# Build everything at once\npixi run build-trt-engines\n\n# Or individually\npixi run build-trt-engines-avtr1\npixi run build-trt-engines-renderer\npixi run build-trt-engines-hubert\npixi run interactive-demo\n```\n\n**Single speaker.** Avatar lip-syncs the given audio track.\n\n```\npixi run generate_offline --speech example/speaker_1.ogg\n\n# with a custom avatar and background:\npixi run generate_offline --speech example/speaker_1.ogg --avatar maria --bg minimal_office\n```\n\n**Two-speaker dialogue.** Avatar voices `--speech`\n\nand reacts (active listening) to the peer audio on `--listen`\n\n. Run twice with the tracks swapped to render both sides of the conversation.\n\n```\n# avatar = speaker 1 (elena)\npixi run generate_offline --speech example/speaker_1.ogg --listen example/speaker_2.ogg --avatar elena  --out elena.mp4\n# avatar = speaker 2 (marcus)\npixi run generate_offline --speech example/speaker_2.ogg --listen example/speaker_1.ogg --avatar marcus --out marcus.mp4\n\n# stitch both sides into a single side-by-side video:\nffmpeg -i elena.mp4 -i marcus.mp4 -filter_complex \\\n  \"[0:v][1:v]hstack=inputs=2[v];[0:a][1:a]amix=inputs=2[a]\" \\\n  -map \"[v]\" -map \"[a]\" dialogue.mp4\n```\n\n**Silence / idle motion.** No audio — renders idle micro-motion for the given duration.\n\n```\npixi run generate_offline --duration 10\n```\n\n## Available avatars are the filenames (without `.png`\n\n) inside\n`$AVTR1_LOCAL_STORAGE/v1/avatars_artifacts/reference_frames/`\n\nafter downloading.\n\nAVTR-1 generates motion in 5-frame chunks end-to-end. At 25 fps that's 200 ms of output per chunk, so any GPU under that line runs in real-time.\n\n| GPU | Latency / 5-frame chunk | Real-time factor |\n|---|---|---|\n| L40 | 84 ms | 2.4× |\n| A100 | 91 ms | 2.2× |\n| RTX 4060 Ti | 166 ms | 1.2× |\n| RTX 3070 | 181 ms | 1.1× |\n| L4 | 202 ms | 0.99× |\n| RTX 3060 Ti | 206 ms | 0.97× |\n| RTX 4060 | 232 ms | 0.86× |\n\nReal-time factor = 200 ms / latency. ≥ 1.0× means the GPU keeps up with 25 fps.\n\n**TURN server setup** (optional)\n\nICE tries direct UDP first (host candidates + STUN-reflexive candidates from a public STUN server) and only needs a TURN relay when the network in between can't pass UDP between browser and streamer — typical when the streamer lives on a cloud VM whose security group blocks inbound UDP, or when one peer is behind symmetric NAT.\n\nIf direct UDP works for your setup you can skip this section entirely. The browser's connectivity card after the engine dropdown tells you which path ICE actually picked, and the same UI links back here when the verdict is \"only TURN works\" or \"nothing worked\".\n\nThe project is wired for **Cloudflare's Realtime TURN**. The free tier is\ngenerous enough for development; no credit card required.\n\n**1. Create a TURN application on Cloudflare**\n\n- Sign in to\n[dash.cloudflare.com](https://dash.cloudflare.com). - Navigate to\n**Realtime → TURN Server**. - Click\n**Create TURN App**, give it a name (e.g.`avtr1-dev`\n\n), and submit.\n\n**2. Copy the two credential values**\n\nOn the application's detail page you'll see:\n\n**Turn Key ID**— short identifier (looks like a UUID without dashes).** API Token**— long secret shown only once at creation. Save it before navigating away.\n\n**3. Put them in .env**\n\n```\nCLOUDFLARE_TURN_KEY_ID=\"<Turn Key ID>\"\nCLOUDFLARE_TURN_KEY_TOKEN=\"<API Token>\"\n```\n\nThat's it. On the next `/ice-servers`\n\nrequest the streamer mints a fresh,\nshort-lived TURN credential per session via Cloudflare's\n`/v1/turn/keys/{kid}/credentials/generate`\n\nendpoint — the long-lived API\ntoken never leaves the server. You can verify it picked up the keys by\nwatching the streamer log for `ice: using Cloudflare TURN`\n\non the first\nbrowser request.\n\nThe browser-side connectivity probe (the small status card under the controls) tells you which ICE path actually wins:\n\n- ✓\n**host**— the browser saw its own local interface; always present. - ✓\n**server-reflexive via STUN**— the browser learned its public IP via STUN; doesn't prove the streamer is reachable on UDP from the browser. - ✓\n**relay via TURN**— the browser successfully allocated a Cloudflare TURN relay; required when direct UDP can't traverse the network in between.\n\nIf the relay check fails while TURN is configured the most likely cause is\nwrong credentials — re-check that you copied the full **API Token** (not\nthe Key ID twice) into `CLOUDFLARE_TURN_KEY_TOKEN`\n\n.\n\n**Alternatives.** Anything that speaks the standard TURN protocol works.\nSet `TURN_URL`\n\n(and optionally `TURN_USERNAME`\n\n/ `TURN_CREDENTIAL`\n\n) instead\nof the Cloudflare variables and `resolve_ice_servers()`\n\nwill use it\nverbatim — e.g. a self-hosted [coturn](https://github.com/coturn/coturn) on\na small VM. STUN-only also works *if* you can open the appropriate UDP\nport range inbound on whatever firewall sits in front of the streamer.\n\nThis repository contains three separately licensed components:\n\n— build and demo tooling, released under the`scripts/`\n\n**AVTR-1 Community License**([LICENSE-MODEL.md](/avaturn-live/avtr-1/blob/main/LICENSE-MODEL.md)). Permits commercial use by entities under USD 10M annual revenue; entities at or above that threshold need a commercial agreement. The same license governs the AVTR-1 weights distributed at[avaturn-live/avtr-1](https://huggingface.co/avaturn-live/avtr-1).— Avaturn Renderer (inference pipeline), released under the`src/avtr1_renderer/`\n\n**PolyForm Noncommercial License 1.0.0** with a Required Notice ([LICENSE-RENDERER.md](/avaturn-live/avtr-1/blob/main/LICENSE-RENDERER.md)).**Noncommercial use only**, regardless of revenue; any commercial use needs a separate Renderer Commercial License.— Avaturn Streamer (orchestration backend), released under the`src/avaturn_live_streamer/`\n\n**PolyForm Noncommercial License 1.0.0** with a Required Notice and patent reservation ([LICENSE-STREAMER.md](/avaturn-live/avtr-1/blob/main/LICENSE-STREAMER.md),[PATENTS.md](/avaturn-live/avtr-1/blob/main/PATENTS.md)).**Noncommercial use only**, regardless of revenue; any commercial use needs a separate Streamer Commercial License.\n\nSee [LICENSE.md](/avaturn-live/avtr-1/blob/main/LICENSE.md) for the full component map and the consequences\nof the multi-license structure. In any conflict between this summary and the\nunderlying license files, the license files control.\n\nThe pipeline uses InsightFace's pretrained SCRFD detector and 2D106 landmark\nmodel, which are licensed for **non-commercial research use only**. To use\nAVTR-1 commercially you must either obtain a commercial license from\nInsightFace ([deepinsight@gmail.com](mailto:deepinsight@gmail.com)) or replace these models with\npermissively-licensed alternatives (e.g., MediaPipe). See\n[THIRD-PARTY-NOTICES.md](/avaturn-live/avtr-1/blob/main/THIRD-PARTY-NOTICES.md) for the full picture.\n\n**Commercial inquiries:** [hello@avaturn.me](mailto:hello@avaturn.me)", "url": "https://wpnews.pro/news/avtr-1-open-weight-real-time-flow-matching-transformer-for-audio-driven-avatars", "canonical_source": "https://github.com/avaturn-live/avtr-1", "published_at": "2026-05-27 14:26:33+00:00", "updated_at": "2026-05-27 14:46:50.258783+00:00", "lang": "en", "topics": ["artificial-intelligence", "machine-learning", "generative-ai", "computer-vision", "ai-products"], "entities": ["AVTR-1", "avaturn-live", "HuggingFace", "LivePortrait", "TensorRT", "NVIDIA"], "alternates": {"html": "https://wpnews.pro/news/avtr-1-open-weight-real-time-flow-matching-transformer-for-audio-driven-avatars", "markdown": "https://wpnews.pro/news/avtr-1-open-weight-real-time-flow-matching-transformer-for-audio-driven-avatars.md", "text": "https://wpnews.pro/news/avtr-1-open-weight-real-time-flow-matching-transformer-for-audio-driven-avatars.txt", "jsonld": "https://wpnews.pro/news/avtr-1-open-weight-real-time-flow-matching-transformer-for-audio-driven-avatars.jsonld"}}