LoongForge-A high-performance training framework for LLM, VLM, DIT, VLA models Baidu Baige has released LoongForge, an open-source training framework for large language models, vision-language models, diffusion transformers, and vision-language-action models, claiming up to 5.04Γ— training speedup over mainstream baselines. The framework supports both NVIDIA GPUs and Kunlun XPUs, and has already powered production training for enterprise customers in education, computer vision, and embodied AI, delivering 30% to 50% speedup on customer workloads. LoongForge, previously developed as AIAK-Training-LLM, is now publicly available on GitHub and has been used to train models including LLaVA-OneVision-2.0 and Wan 2.2. English | δΈ­ζ–‡ /baidu-baige/LoongForge/blob/master/README zh.md A modular, scalable, high-performance training framework for LLMs, VLMs, diffusion, and embodied models. πŸš€ Up to 5.04Γ— training speedup Β· 🌐 Native NVIDIA GPU & Kunlun XPU support πŸ‰ LoongForge is part of Baidu Baige's Loongopen-source series β€” named after the traditional Chineseloong boat ιΎ™θˆŸ , a symbol of coordinated power and forward momentum. LoongForge is a unified training framework for LLMs, VLMs, VLAs, and diffusion models , covering pre-training , continued pre-training , and SFT . Built upon Megatron-LM with deep systemic enhancements across model coverage , training performance , and hardware support , it delivers significant speedups over mainstream open-source baselines . Before going open-source, LoongForge was developed as AIAK-Training-LLM , Baidu Baige's training acceleration stack. It has supported production training for enterprise customers across Education , Computer Vision , and Embodied AI , typically delivering 30%~50% speedup over customer baselines , with the largest production runs reaching 5,000+ XPUs . 2026/05 ⚑ Accelerated Wan 2.2 training by 116% , and added CP and data packing support. 2026/05 ✨ Added training support for Kimi K2.5 / K2.6 , and introduced INT4 / NVFP4 PTQ. 2026/05 πŸŽ‰ v0.1.0 β€” first official tagged release of LoongForge. 2026/05 🌟 Powered the training and public release of LLaVA-OneVision-2.0 . 2026/05 πŸ€– Expanded VLA coverage with GR00T N1.6 ; 60%+ speedup on Pi0.5 and GR00T training. 2026/04 🧩 Added training support for MiniMax-M2.7 on both NVIDIA GPU and Kunlun XPU. 2026/04 πŸš€ LoongForge source code publicly available on GitHub. blog https://baidu-baige.github.io/LoongForge/blog/2026-04-announcing-loongforge.html 2025/10 🌟 Powered the training and public release of LLaVA-OneVision-1.5 under AIAK-Training-LLM , the predecessor of LoongForge. blog https://baidu-baige.github.io/LoongForge/blog/2025-10-llava-onevision-case-study.html See the full documentation for installation, tutorials, and advanced usage β€” English https://loongforge.readthedocs.io/en/latest/index.html Β· δΈ­ζ–‡ https://loongforge.readthedocs.io/zh-cn/latest/index.html . 1. Install β€” via Docker /baidu-baige/LoongForge/blob/master/docker prebuilt images coming soon or source build : NVIDIA GPU : Installation Guide https://loongforge.readthedocs.io/en/latest/get started/installation.html Kunlun XPU : Installation Guide https://loongforge.readthedocs.io/en/latest/kunlun tutorial/install p800.html 2. Launch your first training run β€” follow a tutorial for your target hardware and modality: NVIDIA GPU : LLM https://loongforge.readthedocs.io/en/latest/llm tutorial/quick start llm pretrain.html Β· VLM https://loongforge.readthedocs.io/en/latest/vlm tutorial/quick start vlm pretrain.html Β· VLA https://loongforge.readthedocs.io/en/latest/vla tutorial/quick start pi05 training.html Β· Diffusion WAN https://loongforge.readthedocs.io/en/latest/wan tutorial/quick start wan training.html Kunlun XPU : Kunlun XPU Tutorials https://loongforge.readthedocs.io/en/latest/kunlun tutorial/README.html 3. Explore β€” browse configs/models/ /baidu-baige/LoongForge/blob/master/configs/models and / /baidu-baige/LoongForge/blob/master/examples examples/ for ready-to-run scripts. /baidu-baige/LoongForge/blob/master/examples xpu examples xpu/ 🧩 Flexible Multi-Modal Composition β€” Configuration-driven assembly of VLMs from interchangeable ViT and LLM components. ⚑ Heterogeneous Parallelism β€” Independent TP / DP / recompute per model component e.g., ViT vs. LLM for optimal throughput and memory. blog https://baidu-baige.github.io/LoongForge/blog/2026-05-loongforge-heterogeneous-parallel-training.html πŸ”€ Decoupled Encoder-Decoder Training β€” Separates ViT and LLM into independent tasks, eliminating encoder-induced pipeline bubbles. βš–οΈ DP Load Balancing β€” Load-aware data redistribution mitigates sequence-packing imbalance, improving multi-node scaling efficiency. blog https://baidu-baige.github.io/LoongForge/blog/2026-05-loongforge-dp-load-balancing.html πŸš€ MoE-Native Optimization β€” Overlapped All2All / activation offload / compute, with further memory reduction beyond upstream Megatron-LM on DeepSeek-V3, Qwen3-MoE, etc. πŸ”¬ Adaptive FP8 Training β€” End-to-end FP8 for LLMs and VLMs with standard blockwise FP8 ; optional adaptive mode picks per-operator precision by GEMM shape and efficiency. πŸ”§ Custom Fused Operators β€” Fused kernels like FusedDSA for DSA-style models β€” TileLang version open-sourced, high-performance CUDA version available on Baidu Baige platform. πŸ” Flexible Checkpointing β€” Offline bidirectional Megatron ↔ HuggingFace conversion plus native online HF load/save β€” no format barriers across your workflow. 🧰 Versatile Pipelines & Data Tools β€” Out-of-the-box Pretrain / MidTrain / SFT / LoRA , with built-in dataset format conversion and sequence packing. 🌐 Heterogeneous Hardware β€” Native support for NVIDIA GPUs and Kunlun XPUs via a minimally-intrusive plugin design. πŸ“– Deep-dive: LLM features Β· VLM features Measured on v0.1.1 across LLM, VLM, VLA and DIT workloads against mainstream open-source training baselines: | Model | Type | Baseline | Configuration | Speedup | |---|---|---|---|---| | Qwen3-30B-A3B | MoE | Megatron-LM† | 32 Γ— A800‑ Β· GBS 1024 Β· 32K | 1.16Γ— | DeepSeek-V3.2 Lite Β§ | MoE + DSA | Megatron-LM† | Reduced-layer Β· GBS 128 Β· 8K | 5.04Γ— | | Qwen3-VL-30B-A3B | VLM | VeOmni† | 32 Γ— A800‑ Β· GBS 128 Β· 32K | 1.45Γ— | | GR00T N1.6 | VLA | LeRobot† | 8 Γ— A800‑ Β· GBS 128 Β· 224Γ—224 | 2.31Γ— | | Pi0.5 | VLA | OpenPI† | 8 Γ— A800‑ Β· GBS 112 Β· 224Γ—224 | 1.65Γ— | | Wan2.2 | DIT | DiffSynth† | 8 Γ— A800‑ Β· 480Γ—832x49 | 2.16Γ— | Β§Due to test-bed scale limits,DeepSeek-V3.2was validated separately on a reduced-layer configuration β€” LoongForge'sDSA CUDA kernel optimizationsstill deliver~5Γ— speedupover Megatron-LM and reach64K sequence baseline OOMs beyond 8K . †Numbers reflect baseline and LoongForge versions at the time of measurement, and may evolve as implementations change. ‑Validation on additional hardware is rolling out in upcoming releases. LLaVA-OneVision-2.0 https://github.com/EvolvingLMMs-Lab/LLaVA-OneVision-2 β€” Next-generation multimodal model, with new VideoCaption and Spatial datasets. LLaVA-OneVision-1.5 https://arxiv.org/abs/2509.23661 β€” Fully open framework for democratized multimodal training. Qianfan-VL https://github.com/baidubce/Qianfan-VL β€” Domain-Enhanced Vision-Language Models for Enterprise, 3B to 70B parameters. LoongForge supports a broad range of state-of-the-art models https://loongforge.readthedocs.io/en/latest/get started/support model.html across LLM, VLM, diffusion, and VLA. Modality | Architectures | Models | |---|---|---| LLM | DeepSeek-V2 | deepseek-v2-lite, deepseek-v2 | | DeepSeek-V3 | deepseek-v3, deepseek-v32 | | | LLaMA2 | llama2-7b, llama2-13b, llama2-70b | | | LLaMA3 | llama3-8b, llama3-70b | | | LLaMA3.1 | llama3.1-8b, llama3.1-70b, llama3.1-405b | | | Qwen | qwen-1.8b β†’ qwen-72b | | | Qwen1.5 | qwen1.5-0.5b β†’ qwen1.5-72b | | | Qwen2 | qwen2-0.5b β†’ qwen2-72b | | | Qwen2.5 | qwen2.5-0.5b β†’ qwen2.5-72b | | | Qwen3 | qwen3-0.6b β†’ qwen3-480b-a35b, qwen3-coder-30b-a3b | | | Qwen3-Next | qwen3-next-80b-a3b | | | MiniMax | minimax-m2.1, minimax-m2.5, minimax-m2.7 | | | MIMO | mimo-7b | | | GLM | glm5 | | VLM | Qwen2.5-VL | qwen2.5-vl-3b β†’ qwen2.5-vl-72b | | Qwen3-VL | qwen3-vl-30b-a3b, qwen3-vl-235b-a22b | | | Qwen3.5 | qwen3.5-0.8b β†’ qwen3.5-397b-a17b | | | Qwen3.6 | qwen3.6-27b, qwen3.6-35b-a3b | | | Kimi-K2.5 | kimi-k2.5, kimi-k2.6 | | | ERNIE4.5-VL | ernie4.5vl-28b-a3b | | | LLaVA-OneVision-1.5 | llava-onevision-1.5-4b | | | InternVL2.5 | internvl2.5-8b β†’ internvl2.5-78b | | | InternVL3.5 | internvl3.5-8b β†’ internvl3.5-241b-a28b | | | CustomCombinedModel | Flexible ViT + LLM backbone configuration | Diffusion VLA Model Support - LLM / VLM: ongoing validation and release of new models e.g., DeepSeek-V4 - Embodied AI: expanded WAM coverage e.g., DreamZero, LingBot VA Performance & Scaling - Adopt next-generation techniques introduced with DeepSeek-V4 - Advanced MoE load-balancing strategies - Long-sequence training with ChunkPipe scheduling and Context Parallelism - Further diffusion-model acceleration e.g., WAN - INT4 quantization-aware training - MTP Multi-Token Prediction scaling for speculative decoding πŸ“ Directory tree LoongForge/ β”œβ”€β”€ loongforge/ Core training framework β”‚ β”œβ”€β”€ train/ Training entry points & trainers β”‚ β”‚ β”œβ”€β”€ pretrain/ Pretrain LLM, VLM β”‚ β”‚ β”œβ”€β”€ sft/ SFT LLM, VLM, InternVL, ERNIE β”‚ β”‚ β”œβ”€β”€ diffusion/ Diffusion WAN β”‚ β”‚ └── embodied/ Embodied AI Pi0.5, GR00T β”‚ β”œβ”€β”€ models/ Unified model abstractions β”‚ β”‚ β”œβ”€β”€ foundation/ LLM backbones LLaMA, Qwen, DeepSeek, ... β”‚ β”‚ β”œβ”€β”€ encoder/ Vision encoders ViT, Qwen-VL, InternVL, ... β”‚ β”‚ β”œβ”€β”€ omni models/ Multi-modal composition β”‚ β”‚ β”œβ”€β”€ diffusion/ Diffusion models WAN β”‚ β”‚ β”œβ”€β”€ embodied/ Embodied models Pi0.5, GR00T β”‚ β”‚ └── common/ Shared layers and utilities β”‚ β”œβ”€β”€ data/ Data pipelines multi-modal, video, DP balance β”‚ β”œβ”€β”€ tokenizer/ Tokenizers β”‚ └── utils/ Config map, constants, etc. β”œβ”€β”€ third party/Loong-Megatron/ Patched Megatron-LM git submodule β”œβ”€β”€ configs/ Hydra YAML configs models, data β”œβ”€β”€ examples/ GPU launch scripts β”œβ”€β”€ examples xpu/ Kunlun XPU launch scripts β”œβ”€β”€ tools/ Checkpoint conversion, data preprocessing β”œβ”€β”€ ops/ Custom fused operators incl. open-sourced TileLang β”œβ”€β”€ patches/ TransformerEngine patches β”œβ”€β”€ docker/ Dockerfiles GPU & XPU β”œβ”€β”€ tests/ E2E test suite YAML-driven └── docs/ Documentation We warmly welcome community contributions β€” bug reports, feature proposals, and PRs alike. Please read our Contributing Guidelines https://github.com/baidu-baige/LoongForge/blob/master/CONTRIBUTING.md before submitting. LoongForge is released under the Apache License 2.0 https://github.com/baidu-baige/LoongForge/blob/master/LICENSE . Some files are derived from third-party open-source projects; please refer to the specific file headers for their respective copyright and attribution. @software{LoongForge2026, title = {LoongForge: A modular, scalable, high-performance training framework for LLMs, VLMs, diffusion, and embodied models}, author = {{The LoongForge Authors}}, year = {2026}, url = {https://github.com/baidu-baige/LoongForge} } LoongForge is built upon NVIDIA's Megatron-LM. We also drew inspiration from several excellent open-source projects, including but not limited to HuggingFace Transformers, LLaMA-Factory, and Megatron-Bridge. We sincerely thank these communities for their outstanding contributions. Open a GitHub issue for questions, feedback, or feature requests. You can also join our Slack community https://join.slack.com/t/baiduloongforge/shared invite/zt-3ys3kaq2p-cmdw0nDoaHGOcKibgys5Yw or scan the WeChat QR code below to join our developer community.