English | δΈζ
A modular, scalable, high-performance training framework for LLMs, VLMs, diffusion, and embodied models.
π Up to 5.04Γ training speedup Β·
π Native NVIDIA GPU & Kunlun XPU support
π LoongForge is part of Baidu Baige's
Loongopen-source series β named after the traditional Chineseloong boat (ιΎθ), a symbol of coordinated power and forward momentum.
LoongForge is a unified training framework for LLMs, VLMs, VLAs, and diffusion models, covering pre-training, continued pre-training, and SFT. Built upon Megatron-LM with deep systemic enhancements across model coverage, training performance, and hardware support, it delivers significant speedups over mainstream open-source baselines.
Before going open-source, LoongForge was developed as AIAK-Training-LLM, Baidu Baige's training acceleration stack. It has supported production training for enterprise customers across Education, Computer Vision, and Embodied AI, typically delivering 30%~50% speedup over customer baselines, with the largest production runs reaching 5,000+ XPUs.
[2026/05]β‘ Accelerated Wan 2.2training by** 116%, and added CP and data packing support.[2026/05]β¨ Added training support for Kimi K2.5 / K2.6**, and introduced** INT4 / NVFP4PTQ.[2026/05]π v0.1.0**β first official tagged release of LoongForge.[2026/05]π Powered the training and public release of LLaVA-OneVision-2.0.[2026/05]π€ Expanded VLA coverage with GR00T N1.6;** 60%+ speedupon Pi0.5 and GR00T training.[2026/04]π§© Added training support for MiniMax-M2.7on both NVIDIA GPU and Kunlun XPU.[2026/04]π LoongForge source code publicly available on GitHub.[blog][2025/10]π Powered the training and public release of LLaVA-OneVision-1.5under AIAK-Training-LLM**, the predecessor of LoongForge.[blog]
See the full documentation for installation, tutorials, and advanced usage β English Β· δΈζ.
1. Install β via Docker (
prebuilt images coming soon) or
source build:
NVIDIA GPU:Installation Guide** Kunlun XPU**:Installation Guide
2. Launch your first training run β follow a tutorial for your target hardware and modality:
NVIDIA GPU:LLMΒ·VLMΒ·VLAΒ·Diffusion (WAN)Kunlun XPU:Kunlun XPU Tutorials
3. Explore β browse configs/models/ and
examples/
examples_xpu/
π§© Flexible Multi-Modal Compositionβ Configuration-driven assembly of VLMs from interchangeable ViT and LLM components.β‘ Heterogeneous Parallelismβ Independent TP / DP / recompute per model component (e.g., ViT vs. LLM) for optimal throughput and memory. [blog]π Decoupled Encoder-Decoder Trainingβ Separates ViT and LLM into independent tasks, eliminating encoder-induced pipeline bubbles.βοΈ DP Load Balancingβ Load-aware data redistribution mitigates sequence-packing imbalance, improving multi-node scaling efficiency. [blog]π MoE-Native Optimizationβ Overlapped All2All / activation offload / compute, with** further memory reductionbeyond upstream Megatron-LM on DeepSeek-V3, Qwen3-MoE, etc.π¬ Adaptive FP8 Trainingβ End-to-end FP8 for LLMs and VLMs with standard blockwise FP8**; optional** adaptivemode picks per-operator precision by GEMM shape and efficiency.π§ Custom Fused Operatorsβ Fused kernels like FusedDSAfor DSA-style models β TileLang version open-sourced, high-performance CUDA version available on Baidu Baige platform.π Flexible Checkpointingβ Offline bidirectional Megatron β HuggingFaceconversion plus native online HF load/save β no format barriers across your workflow.π§° Versatile Pipelines & Data Toolsβ Out-of-the-box Pretrain / MidTrain / SFT / LoRA**, with built-in dataset format conversion and sequence packing.π Heterogeneous Hardwareβ Native support for** NVIDIA GPUsand Kunlun XPUs**via a minimally-intrusive plugin design.
π Deep-dive:
[LLM features]Β·[VLM features]
Measured on v0.1.1 across LLM, VLM, VLA and DIT workloads against mainstream open-source training baselines:
| Model | Type | Baseline | Configuration | Speedup |
|---|---|---|---|---|
| Qwen3-30B-A3B | MoE | Megatron-LMβ | ||
| 32 Γ A800β‘ Β· GBS 1024 Β· 32K | ||||
| 1.16Γ | ||||
| DeepSeek-V3.2 Lite Β§ | ||||
| MoE + DSA | Megatron-LMβ | |||
| Reduced-layer Β· GBS 128 Β· 8K | 5.04Γ | |||
| Qwen3-VL-30B-A3B | VLM | VeOmniβ | ||
| 32 Γ A800β‘ Β· GBS 128 Β· 32K | ||||
| 1.45Γ | ||||
| GR00T N1.6 | VLA | LeRobotβ | ||
| 8 Γ A800β‘ Β· GBS 128 Β· 224Γ224 | ||||
| 2.31Γ | ||||
| Pi0.5 | VLA | OpenPIβ | ||
| 8 Γ A800β‘ Β· GBS 112 Β· 224Γ224 | ||||
| 1.65Γ | ||||
| Wan2.2 | DIT | DiffSynthβ | ||
| 8 Γ A800β‘ Β· 480Γ832x49 | ||||
| 2.16Γ |
Β§Due to test-bed scale limits,DeepSeek-V3.2was validated separately on a reduced-layer configuration β LoongForge'sDSA CUDA kernel optimizationsstill deliver~5Γ speedupover Megatron-LM and reach64K sequence(baseline OOMs beyond 8K).
β Numbers reflect baseline and LoongForge versions at the time of measurement, and may evolve as implementations change.
β‘Validation on additional hardware is rolling out in upcoming releases.
LLaVA-OneVision-2.0β Next-generation multimodal model, with new VideoCaption and Spatial datasets.LLaVA-OneVision-1.5β Fully open framework for democratized multimodal training.Qianfan-VLβ Domain-Enhanced Vision-Language Models for Enterprise, 3B to 70B parameters.
LoongForge supports a broad range of state-of-the-art models across LLM, VLM, diffusion, and VLA.
Modality | Architectures | Models | |---|---|---| LLM | DeepSeek-V2 | deepseek-v2-lite, deepseek-v2 | | DeepSeek-V3 | deepseek-v3, deepseek-v32 | | | LLaMA2 | llama2-7b, llama2-13b, llama2-70b | | | LLaMA3 | llama3-8b, llama3-70b | | | LLaMA3.1 | llama3.1-8b, llama3.1-70b, llama3.1-405b | | | Qwen | qwen-1.8b β qwen-72b | | | Qwen1.5 | qwen1.5-0.5b β qwen1.5-72b | | | Qwen2 | qwen2-0.5b β qwen2-72b | | | Qwen2.5 | qwen2.5-0.5b β qwen2.5-72b | | | Qwen3 | qwen3-0.6b β qwen3-480b-a35b, qwen3-coder-30b-a3b | | | Qwen3-Next | qwen3-next-80b-a3b | | | MiniMax | minimax-m2.1, minimax-m2.5, minimax-m2.7 | | | MIMO | mimo-7b | | | GLM | glm5 | | VLM | Qwen2.5-VL | qwen2.5-vl-3b β qwen2.5-vl-72b | | Qwen3-VL | qwen3-vl-30b-a3b, qwen3-vl-235b-a22b | | | Qwen3.5 | qwen3.5-0.8b β qwen3.5-397b-a17b | | | Qwen3.6 | qwen3.6-27b, qwen3.6-35b-a3b | | | Kimi-K2.5 | kimi-k2.5, kimi-k2.6 | | | ERNIE4.5-VL | ernie4.5vl-28b-a3b | | | LLaVA-OneVision-1.5 | llava-onevision-1.5-4b | | | InternVL2.5 | internvl2.5-8b β internvl2.5-78b | | | InternVL3.5 | internvl3.5-8b β internvl3.5-241b-a28b | | | CustomCombinedModel | Flexible ViT + LLM backbone configuration ( |
DiffusionVLAModel Support
- LLM / VLM: ongoing validation and release of new models (e.g., DeepSeek-V4)
- Embodied AI: expanded WAM coverage (e.g., DreamZero, LingBot VA)
Performance & Scaling
- Adopt next-generation techniques introduced with DeepSeek-V4
- Advanced MoE load-balancing strategies
- Long-sequence training with ChunkPipe scheduling and Context Parallelism
- Further diffusion-model acceleration (e.g., WAN)
- INT4 quantization-aware training
- MTP (Multi-Token Prediction) scaling for speculative decoding
π Directory tree
LoongForge/
βββ loongforge/ # Core training framework
β βββ train/ # Training entry points & trainers
β β βββ pretrain/ # Pretrain (LLM, VLM)
β β βββ sft/ # SFT (LLM, VLM, InternVL, ERNIE)
β β βββ diffusion/ # Diffusion (WAN)
β β βββ embodied/ # Embodied AI (Pi0.5, GR00T)
β βββ models/ # Unified model abstractions
β β βββ foundation/ # LLM backbones (LLaMA, Qwen, DeepSeek, ...)
β β βββ encoder/ # Vision encoders (ViT, Qwen-VL, InternVL, ...)
β β βββ omni_models/ # Multi-modal composition
β β βββ diffusion/ # Diffusion models (WAN)
β β βββ embodied/ # Embodied models (Pi0.5, GR00T)
β β βββ common/ # Shared layers and utilities
β βββ data/ # Data pipelines (multi-modal, video, DP balance)
β βββ tokenizer/ # Tokenizers
β βββ utils/ # Config map, constants, etc.
βββ third_party/Loong-Megatron/ # Patched Megatron-LM (git submodule)
βββ configs/ # Hydra YAML configs (models, data)
βββ examples/ # GPU launch scripts
βββ examples_xpu/ # Kunlun XPU launch scripts
βββ tools/ # Checkpoint conversion, data preprocessing
βββ ops/ # Custom fused operators (incl. open-sourced TileLang)
βββ patches/ # TransformerEngine patches
βββ docker/ # Dockerfiles (GPU & XPU)
βββ tests/ # E2E test suite (YAML-driven)
βββ docs/ # Documentation
We warmly welcome community contributions β bug reports, feature proposals, and PRs alike. Please read our Contributing Guidelines before submitting.
LoongForge is released under the Apache License 2.0. Some files are derived from third-party open-source projects; please refer to the specific file headers for their respective copyright and attribution.
@software{LoongForge2026,
title = {LoongForge: A modular, scalable, high-performance training framework for LLMs, VLMs, diffusion, and embodied models},
author = {{The LoongForge Authors}},
year = {2026},
url = {https://github.com/baidu-baige/LoongForge}
}
LoongForge is built upon NVIDIA's Megatron-LM. We also drew inspiration from several excellent open-source projects, including but not limited to HuggingFace Transformers, LLaMA-Factory, and Megatron-Bridge. We sincerely thank these communities for their outstanding contributions.
Open a GitHub issue for questions, feedback, or feature requests. You can also join our Slack community or scan the WeChat QR code below to join our developer community.