LoongForge-A high-performance training framework for LLM, VLM, DIT, VLA models

wpnews.pro

English | 中文

A modular, scalable, high-performance training framework for LLMs, VLMs, diffusion, and embodied models.

🚀 Up to 5.04× training speedup ·

🌐 Native NVIDIA GPU & Kunlun XPU support

🐉 LoongForge is part of Baidu Baige's

Loongopen-source series — named after the traditional Chineseloong boat (龙舟), a symbol of coordinated power and forward momentum.

LoongForge is a unified training framework for LLMs, VLMs, VLAs, and diffusion models, covering pre-training, continued pre-training, and SFT. Built upon Megatron-LM with deep systemic enhancements across model coverage, training performance, and hardware support, it delivers significant speedups over mainstream open-source baselines.

Before going open-source, LoongForge was developed as AIAK-Training-LLM, Baidu Baige's training acceleration stack. It has supported production training for enterprise customers across Education, Computer Vision, and Embodied AI, typically delivering 30%~50% speedup over customer baselines, with the largest production runs reaching 5,000+ XPUs.

[2026/05]⚡ Accelerated Wan 2.2training by** 116%, and added CP and data packing support.[2026/05]✨ Added training support for Kimi K2.5 / K2.6**, and introduced** INT4 / NVFP4PTQ.[2026/05]🎉 v0.1.0**— first official tagged release of LoongForge.[2026/05]🌟 Powered the training and public release of LLaVA-OneVision-2.0.[2026/05]🤖 Expanded VLA coverage with GR00T N1.6;** 60%+ speedupon Pi0.5 and GR00T training.[2026/04]🧩 Added training support for MiniMax-M2.7on both NVIDIA GPU and Kunlun XPU.[2026/04]🚀 LoongForge source code publicly available on GitHub.[blog][2025/10]🌟 Powered the training and public release of LLaVA-OneVision-1.5under AIAK-Training-LLM**, the predecessor of LoongForge.[blog]

See the full documentation for installation, tutorials, and advanced usage — English · 中文.

1. Install — via Docker (

prebuilt images coming soon) or

source build:

NVIDIA GPU:Installation Guide** Kunlun XPU**:Installation Guide

2. Launch your first training run — follow a tutorial for your target hardware and modality:

NVIDIA GPU:LLM·VLM·VLA·Diffusion (WAN)Kunlun XPU:Kunlun XPU Tutorials

3. Explore — browse configs/models/ and

/

examples/

for ready-to-run scripts.

examples_xpu/

🧩 Flexible Multi-Modal Composition— Configuration-driven assembly of VLMs from interchangeable ViT and LLM components.⚡ Heterogeneous Parallelism— Independent TP / DP / recompute per model component (e.g., ViT vs. LLM) for optimal throughput and memory. [blog]🔀 Decoupled Encoder-Decoder Training— Separates ViT and LLM into independent tasks, eliminating encoder-induced pipeline bubbles.⚖️ DP Load Balancing— Load-aware data redistribution mitigates sequence-packing imbalance, improving multi-node scaling efficiency. [blog]🚀 MoE-Native Optimization— Overlapped All2All / activation offload / compute, with** further memory reductionbeyond upstream Megatron-LM on DeepSeek-V3, Qwen3-MoE, etc.🔬 Adaptive FP8 Training— End-to-end FP8 for LLMs and VLMs with standard blockwise FP8**; optional** adaptivemode picks per-operator precision by GEMM shape and efficiency.🔧 Custom Fused Operators— Fused kernels like FusedDSAfor DSA-style models — TileLang version open-sourced, high-performance CUDA version available on Baidu Baige platform.🔁 Flexible Checkpointing— Offline bidirectional Megatron ↔ HuggingFaceconversion plus native online HF load/save — no format barriers across your workflow.🧰 Versatile Pipelines & Data Tools— Out-of-the-box Pretrain / MidTrain / SFT / LoRA**, with built-in dataset format conversion and sequence packing.🌐 Heterogeneous Hardware— Native support for** NVIDIA GPUsand Kunlun XPUs**via a minimally-intrusive plugin design.

📖 Deep-dive:

[LLM features]·[VLM features]

Measured on v0.1.1 across LLM, VLM, VLA and DIT workloads against mainstream open-source training baselines:

Model	Type	Baseline
Qwen3-30B-A3B	MoE	Megatron-LM†
32 × A800‡ · GBS 1024 · 32K
1.16×
DeepSeek-V3.2 Lite §
MoE + DSA	Megatron-LM†
Reduced-layer · GBS 128 · 8K	5.04×
Qwen3-VL-30B-A3B	VLM	VeOmni†
32 × A800‡ · GBS 128 · 32K
1.45×
GR00T N1.6	VLA	LeRobot†
8 × A800‡ · GBS 128 · 224×224
2.31×
Pi0.5	VLA	OpenPI†
8 × A800‡ · GBS 112 · 224×224
1.65×
Wan2.2	DIT	DiffSynth†
8 × A800‡ · 480×832x49
2.16×

§Due to test-bed scale limits,DeepSeek-V3.2was validated separately on a reduced-layer configuration — LoongForge'sDSA CUDA kernel optimizationsstill deliver~5× speedupover Megatron-LM and reach64K sequence(baseline OOMs beyond 8K).

†Numbers reflect baseline and LoongForge versions at the time of measurement, and may evolve as implementations change.

‡Validation on additional hardware is rolling out in upcoming releases.

LLaVA-OneVision-2.0— Next-generation multimodal model, with new VideoCaption and Spatial datasets.LLaVA-OneVision-1.5— Fully open framework for democratized multimodal training.Qianfan-VL— Domain-Enhanced Vision-Language Models for Enterprise, 3B to 70B parameters.

LoongForge supports a broad range of state-of-the-art models across LLM, VLM, diffusion, and VLA.

Modality | Architectures | Models | |---|---|---| LLM | DeepSeek-V2 | deepseek-v2-lite, deepseek-v2 | | DeepSeek-V3 | deepseek-v3, deepseek-v32 | | | LLaMA2 | llama2-7b, llama2-13b, llama2-70b | | | LLaMA3 | llama3-8b, llama3-70b | | | LLaMA3.1 | llama3.1-8b, llama3.1-70b, llama3.1-405b | | | Qwen | qwen-1.8b → qwen-72b | | | Qwen1.5 | qwen1.5-0.5b → qwen1.5-72b | | | Qwen2 | qwen2-0.5b → qwen2-72b | | | Qwen2.5 | qwen2.5-0.5b → qwen2.5-72b | | | Qwen3 | qwen3-0.6b → qwen3-480b-a35b, qwen3-coder-30b-a3b | | | Qwen3-Next | qwen3-next-80b-a3b | | | MiniMax | minimax-m2.1, minimax-m2.5, minimax-m2.7 | | | MIMO | mimo-7b | | | GLM | glm5 | | VLM | Qwen2.5-VL | qwen2.5-vl-3b → qwen2.5-vl-72b | | Qwen3-VL | qwen3-vl-30b-a3b, qwen3-vl-235b-a22b | | | Qwen3.5 | qwen3.5-0.8b → qwen3.5-397b-a17b | | | Qwen3.6 | qwen3.6-27b, qwen3.6-35b-a3b | | | Kimi-K2.5 | kimi-k2.5, kimi-k2.6 | | | ERNIE4.5-VL | ernie4.5vl-28b-a3b | | | LLaVA-OneVision-1.5 | llava-onevision-1.5-4b | | | InternVL2.5 | internvl2.5-8b → internvl2.5-78b | | | InternVL3.5 | internvl3.5-8b → internvl3.5-241b-a28b | | | CustomCombinedModel | Flexible ViT + LLM backbone configuration ( |

DiffusionVLAModel Support

LLM / VLM: ongoing validation and release of new models (e.g., DeepSeek-V4)
Embodied AI: expanded WAM coverage (e.g., DreamZero, LingBot VA)

Performance & Scaling

Adopt next-generation techniques introduced with DeepSeek-V4
Advanced MoE load-balancing strategies
Long-sequence training with ChunkPipe scheduling and Context Parallelism
Further diffusion-model acceleration (e.g., WAN)
INT4 quantization-aware training
MTP (Multi-Token Prediction) scaling for speculative decoding

📁 Directory tree

LoongForge/
├── loongforge/                   # Core training framework
│   ├── train/                    # Training entry points & trainers
│   │   ├── pretrain/             #   Pretrain (LLM, VLM)
│   │   ├── sft/                  #   SFT (LLM, VLM, InternVL, ERNIE)
│   │   ├── diffusion/            #   Diffusion (WAN)
│   │   └── embodied/             #   Embodied AI (Pi0.5, GR00T)
│   ├── models/                   # Unified model abstractions
│   │   ├── foundation/           #   LLM backbones (LLaMA, Qwen, DeepSeek, ...)
│   │   ├── encoder/              #   Vision encoders (ViT, Qwen-VL, InternVL, ...)
│   │   ├── omni_models/          #   Multi-modal composition
│   │   ├── diffusion/            #   Diffusion models (WAN)
│   │   ├── embodied/             #   Embodied models (Pi0.5, GR00T)
│   │   └── common/               #   Shared layers and utilities
│   ├── data/                     # Data pipelines (multi-modal, video, DP balance)
│   ├── tokenizer/                # Tokenizers
│   └── utils/                    # Config map, constants, etc.
├── third_party/Loong-Megatron/   # Patched Megatron-LM (git submodule)
├── configs/                      # Hydra YAML configs (models, data)
├── examples/                     # GPU launch scripts
├── examples_xpu/                 # Kunlun XPU launch scripts
├── tools/                        # Checkpoint conversion, data preprocessing
├── ops/                          # Custom fused operators (incl. open-sourced TileLang)
├── patches/                      # TransformerEngine patches
├── docker/                       # Dockerfiles (GPU & XPU)
├── tests/                        # E2E test suite (YAML-driven)
└── docs/                         # Documentation

We warmly welcome community contributions — bug reports, feature proposals, and PRs alike. Please read our Contributing Guidelines before submitting.

LoongForge is released under the Apache License 2.0. Some files are derived from third-party open-source projects; please refer to the specific file headers for their respective copyright and attribution.

@software{LoongForge2026,
  title  = {LoongForge: A modular, scalable, high-performance training framework for LLMs, VLMs, diffusion, and embodied models},
  author = {{The LoongForge Authors}},
  year   = {2026},
  url    = {https://github.com/baidu-baige/LoongForge}
}

LoongForge is built upon NVIDIA's Megatron-LM. We also drew inspiration from several excellent open-source projects, including but not limited to HuggingFace Transformers, LLaMA-Factory, and Megatron-Bridge. We sincerely thank these communities for their outstanding contributions.

Open a GitHub issue for questions, feedback, or feature requests. You can also join our Slack community or scan the WeChat QR code below to join our developer community.

source & further reading

github.com — original article

LoongForge-A high-performance training framework for LLM, VLM, DIT, VLA models

A modular, scalable, high-performance training framework for LLMs, VLMs, diffusion, and embodied models.

Run your AI side-project on zahid.host