cd /news/large-language-models/loongforge-a-high-performance-traini… Β· home β€Ί topics β€Ί large-language-models β€Ί article
[ARTICLE Β· art-15604] src=github.com pub= topic=large-language-models verified=true sentiment=↑ positive

LoongForge-A high-performance training framework for LLM, VLM, DIT, VLA models

Baidu Baige has released LoongForge, an open-source training framework for large language models, vision-language models, diffusion transformers, and vision-language-action models, claiming up to 5.04Γ— training speedup over mainstream baselines. The framework supports both NVIDIA GPUs and Kunlun XPUs, and has already powered production training for enterprise customers in education, computer vision, and embodied AI, delivering 30% to 50% speedup on customer workloads. LoongForge, previously developed as AIAK-Training-LLM, is now publicly available on GitHub and has been used to train models including LLaVA-OneVision-2.0 and Wan 2.2.

read7 min publishedMay 27, 2026

English | δΈ­ζ–‡

A modular, scalable, high-performance training framework for LLMs, VLMs, diffusion, and embodied models.

πŸš€ Up to 5.04Γ— training speedup Β·

🌐 Native NVIDIA GPU & Kunlun XPU support

πŸ‰ LoongForge is part of Baidu Baige's

Loongopen-source series β€” named after the traditional Chineseloong boat (ιΎ™θˆŸ), a symbol of coordinated power and forward momentum.

LoongForge is a unified training framework for LLMs, VLMs, VLAs, and diffusion models, covering pre-training, continued pre-training, and SFT. Built upon Megatron-LM with deep systemic enhancements across model coverage, training performance, and hardware support, it delivers significant speedups over mainstream open-source baselines.

Before going open-source, LoongForge was developed as AIAK-Training-LLM, Baidu Baige's training acceleration stack. It has supported production training for enterprise customers across Education, Computer Vision, and Embodied AI, typically delivering 30%~50% speedup over customer baselines, with the largest production runs reaching 5,000+ XPUs.

[2026/05]⚑ Accelerated Wan 2.2training by** 116%, and added CP and data packing support.[2026/05]✨ Added training support for Kimi K2.5 / K2.6**, and introduced** INT4 / NVFP4PTQ.[2026/05]πŸŽ‰ v0.1.0**β€” first official tagged release of LoongForge.[2026/05]🌟 Powered the training and public release of LLaVA-OneVision-2.0.[2026/05]πŸ€– Expanded VLA coverage with GR00T N1.6;** 60%+ speedupon Pi0.5 and GR00T training.[2026/04]🧩 Added training support for MiniMax-M2.7on both NVIDIA GPU and Kunlun XPU.[2026/04]πŸš€ LoongForge source code publicly available on GitHub.[blog][2025/10]🌟 Powered the training and public release of LLaVA-OneVision-1.5under AIAK-Training-LLM**, the predecessor of LoongForge.[blog]

See the full documentation for installation, tutorials, and advanced usage β€” English Β· δΈ­ζ–‡.

1. Install β€” via Docker (

prebuilt images coming soon) or

source build:

NVIDIA GPU:Installation Guide** Kunlun XPU**:Installation Guide

2. Launch your first training run β€” follow a tutorial for your target hardware and modality:

NVIDIA GPU:LLMΒ·VLMΒ·VLAΒ·Diffusion (WAN)Kunlun XPU:Kunlun XPU Tutorials

3. Explore β€” browse configs/models/ and

/

examples/

for ready-to-run scripts.

examples_xpu/

🧩 Flexible Multi-Modal Compositionβ€” Configuration-driven assembly of VLMs from interchangeable ViT and LLM components.⚑ Heterogeneous Parallelismβ€” Independent TP / DP / recompute per model component (e.g., ViT vs. LLM) for optimal throughput and memory. [blog]πŸ”€ Decoupled Encoder-Decoder Trainingβ€” Separates ViT and LLM into independent tasks, eliminating encoder-induced pipeline bubbles.βš–οΈ DP Load Balancingβ€” Load-aware data redistribution mitigates sequence-packing imbalance, improving multi-node scaling efficiency. [blog]πŸš€ MoE-Native Optimizationβ€” Overlapped All2All / activation offload / compute, with** further memory reductionbeyond upstream Megatron-LM on DeepSeek-V3, Qwen3-MoE, etc.πŸ”¬ Adaptive FP8 Trainingβ€” End-to-end FP8 for LLMs and VLMs with standard blockwise FP8**; optional** adaptivemode picks per-operator precision by GEMM shape and efficiency.πŸ”§ Custom Fused Operatorsβ€” Fused kernels like FusedDSAfor DSA-style models β€” TileLang version open-sourced, high-performance CUDA version available on Baidu Baige platform.πŸ” Flexible Checkpointingβ€” Offline bidirectional Megatron ↔ HuggingFaceconversion plus native online HF load/save β€” no format barriers across your workflow.🧰 Versatile Pipelines & Data Toolsβ€” Out-of-the-box Pretrain / MidTrain / SFT / LoRA**, with built-in dataset format conversion and sequence packing.🌐 Heterogeneous Hardwareβ€” Native support for** NVIDIA GPUsand Kunlun XPUs**via a minimally-intrusive plugin design.

πŸ“– Deep-dive:

[LLM features]Β·[VLM features]

Measured on v0.1.1 across LLM, VLM, VLA and DIT workloads against mainstream open-source training baselines:

Model Type Baseline Configuration Speedup
Qwen3-30B-A3B MoE Megatron-LM†
32 Γ— A800‑ Β· GBS 1024 Β· 32K
1.16Γ—
DeepSeek-V3.2 Lite Β§
MoE + DSA Megatron-LM†
Reduced-layer Β· GBS 128 Β· 8K 5.04Γ—
Qwen3-VL-30B-A3B VLM VeOmni†
32 Γ— A800‑ Β· GBS 128 Β· 32K
1.45Γ—
GR00T N1.6 VLA LeRobot†
8 Γ— A800‑ Β· GBS 128 Β· 224Γ—224
2.31Γ—
Pi0.5 VLA OpenPI†
8 Γ— A800‑ Β· GBS 112 Β· 224Γ—224
1.65Γ—
Wan2.2 DIT DiffSynth†
8 Γ— A800‑ Β· 480Γ—832x49
2.16Γ—

Β§Due to test-bed scale limits,DeepSeek-V3.2was validated separately on a reduced-layer configuration β€” LoongForge'sDSA CUDA kernel optimizationsstill deliver~5Γ— speedupover Megatron-LM and reach64K sequence(baseline OOMs beyond 8K).

†Numbers reflect baseline and LoongForge versions at the time of measurement, and may evolve as implementations change.

‑Validation on additional hardware is rolling out in upcoming releases.

LLaVA-OneVision-2.0β€” Next-generation multimodal model, with new VideoCaption and Spatial datasets.LLaVA-OneVision-1.5β€” Fully open framework for democratized multimodal training.Qianfan-VLβ€” Domain-Enhanced Vision-Language Models for Enterprise, 3B to 70B parameters.

LoongForge supports a broad range of state-of-the-art models across LLM, VLM, diffusion, and VLA.

Modality | Architectures | Models | |---|---|---| LLM | DeepSeek-V2 | deepseek-v2-lite, deepseek-v2 | | DeepSeek-V3 | deepseek-v3, deepseek-v32 | | | LLaMA2 | llama2-7b, llama2-13b, llama2-70b | | | LLaMA3 | llama3-8b, llama3-70b | | | LLaMA3.1 | llama3.1-8b, llama3.1-70b, llama3.1-405b | | | Qwen | qwen-1.8b β†’ qwen-72b | | | Qwen1.5 | qwen1.5-0.5b β†’ qwen1.5-72b | | | Qwen2 | qwen2-0.5b β†’ qwen2-72b | | | Qwen2.5 | qwen2.5-0.5b β†’ qwen2.5-72b | | | Qwen3 | qwen3-0.6b β†’ qwen3-480b-a35b, qwen3-coder-30b-a3b | | | Qwen3-Next | qwen3-next-80b-a3b | | | MiniMax | minimax-m2.1, minimax-m2.5, minimax-m2.7 | | | MIMO | mimo-7b | | | GLM | glm5 | | VLM | Qwen2.5-VL | qwen2.5-vl-3b β†’ qwen2.5-vl-72b | | Qwen3-VL | qwen3-vl-30b-a3b, qwen3-vl-235b-a22b | | | Qwen3.5 | qwen3.5-0.8b β†’ qwen3.5-397b-a17b | | | Qwen3.6 | qwen3.6-27b, qwen3.6-35b-a3b | | | Kimi-K2.5 | kimi-k2.5, kimi-k2.6 | | | ERNIE4.5-VL | ernie4.5vl-28b-a3b | | | LLaVA-OneVision-1.5 | llava-onevision-1.5-4b | | | InternVL2.5 | internvl2.5-8b β†’ internvl2.5-78b | | | InternVL3.5 | internvl3.5-8b β†’ internvl3.5-241b-a28b | | | CustomCombinedModel | Flexible ViT + LLM backbone configuration ( |

DiffusionVLAModel Support

  • LLM / VLM: ongoing validation and release of new models (e.g., DeepSeek-V4)
  • Embodied AI: expanded WAM coverage (e.g., DreamZero, LingBot VA)

Performance & Scaling

  • Adopt next-generation techniques introduced with DeepSeek-V4
  • Advanced MoE load-balancing strategies
  • Long-sequence training with ChunkPipe scheduling and Context Parallelism
  • Further diffusion-model acceleration (e.g., WAN)
  • INT4 quantization-aware training
  • MTP (Multi-Token Prediction) scaling for speculative decoding

πŸ“ Directory tree

LoongForge/
β”œβ”€β”€ loongforge/                   # Core training framework
β”‚   β”œβ”€β”€ train/                    # Training entry points & trainers
β”‚   β”‚   β”œβ”€β”€ pretrain/             #   Pretrain (LLM, VLM)
β”‚   β”‚   β”œβ”€β”€ sft/                  #   SFT (LLM, VLM, InternVL, ERNIE)
β”‚   β”‚   β”œβ”€β”€ diffusion/            #   Diffusion (WAN)
β”‚   β”‚   └── embodied/             #   Embodied AI (Pi0.5, GR00T)
β”‚   β”œβ”€β”€ models/                   # Unified model abstractions
β”‚   β”‚   β”œβ”€β”€ foundation/           #   LLM backbones (LLaMA, Qwen, DeepSeek, ...)
β”‚   β”‚   β”œβ”€β”€ encoder/              #   Vision encoders (ViT, Qwen-VL, InternVL, ...)
β”‚   β”‚   β”œβ”€β”€ omni_models/          #   Multi-modal composition
β”‚   β”‚   β”œβ”€β”€ diffusion/            #   Diffusion models (WAN)
β”‚   β”‚   β”œβ”€β”€ embodied/             #   Embodied models (Pi0.5, GR00T)
β”‚   β”‚   └── common/               #   Shared layers and utilities
β”‚   β”œβ”€β”€ data/                     # Data pipelines (multi-modal, video, DP balance)
β”‚   β”œβ”€β”€ tokenizer/                # Tokenizers
β”‚   └── utils/                    # Config map, constants, etc.
β”œβ”€β”€ third_party/Loong-Megatron/   # Patched Megatron-LM (git submodule)
β”œβ”€β”€ configs/                      # Hydra YAML configs (models, data)
β”œβ”€β”€ examples/                     # GPU launch scripts
β”œβ”€β”€ examples_xpu/                 # Kunlun XPU launch scripts
β”œβ”€β”€ tools/                        # Checkpoint conversion, data preprocessing
β”œβ”€β”€ ops/                          # Custom fused operators (incl. open-sourced TileLang)
β”œβ”€β”€ patches/                      # TransformerEngine patches
β”œβ”€β”€ docker/                       # Dockerfiles (GPU & XPU)
β”œβ”€β”€ tests/                        # E2E test suite (YAML-driven)
└── docs/                         # Documentation

We warmly welcome community contributions β€” bug reports, feature proposals, and PRs alike. Please read our Contributing Guidelines before submitting.

LoongForge is released under the Apache License 2.0. Some files are derived from third-party open-source projects; please refer to the specific file headers for their respective copyright and attribution.

@software{LoongForge2026,
  title  = {LoongForge: A modular, scalable, high-performance training framework for LLMs, VLMs, diffusion, and embodied models},
  author = {{The LoongForge Authors}},
  year   = {2026},
  url    = {https://github.com/baidu-baige/LoongForge}
}

LoongForge is built upon NVIDIA's Megatron-LM. We also drew inspiration from several excellent open-source projects, including but not limited to HuggingFace Transformers, LLaMA-Factory, and Megatron-Bridge. We sincerely thank these communities for their outstanding contributions.

Open a GitHub issue for questions, feedback, or feature requests. You can also join our Slack community or scan the WeChat QR code below to join our developer community.

── more in #large-language-models 4 stories Β· sorted by recency
sponsored brought to you by zahid.host 4,200+ EU-deployed projects
reading about agents? ship yours in a single git push.

Run your AI side-project on zahid.host

EU-based hosting, git-push deploys, automatic HTTPS, no cold starts. Free tier with a custom domain β€” perfect for shipping the agent you just read about.

$git push zahid main
β†’ Live at https://your-agent.zahid.host βœ“
Get free account β†’ Pricing
from €0/mo Β· no card required
LIVE [news/loongforge-a-high-pe…] indexed:0 read:7min 2026-05-27 Β· β€”