Miles: A PyTorch-Native Stack for Large-Scale LLM RL Post-Training

wpnews.pro

Featured projects

TL;DR

Miles is RadixArk’s open source framework for large-scale LLM RL post-training. It composes SGLang for rollout, NVIDIA Megatron-LM for training, Ray orchestration, and PyTorch-native extensibility behind a small, pluggable trainer, with unified low-precision recipes, MoE-aware rollout/training alignment, fast NVIDIA NCCL/RDMA weight synchronization, observability, and fault tolerance built in — making frontier-scale LLM RL easier to build, reproduce, and operate.

Why Miles?

Reinforcement learning has become a central part of post-training large language models. But as models become larger, transition from dense to mixture-of-experts (MoE), and run across more distributed and specialized hardware (e.g. NVIDIA Blackwell and Hopper series), RL post-training is no longer just a training loop. It is a distributed systems problem.

A modern LLM RL framework needs to coordinate several moving pieces:

Rollout workers must generate samples at high throughput.
Trainers must consume those samples efficiently and compute stable policy updates.
The rollout policy and training policy must stay synchronized.
Large MoE models introduce routing behavior that must remain aligned across rollout and training.
Low-precision recipes need to work consistently across the full pipeline.
Long-running jobs need observability, checkpointing, and fault tolerance from the start.

Miles was built for this setting.

Miles is RadixArk’s open-source reinforcement learning framework for LLM post-training. It is built natively on SGLang for high-throughput rollout and integrates deeply with Megatron-LM for scalable training, uses Ray to orchestrate the distributed system, and keeps PyTorch as the common programming and numerical layer throughout the stack.

The goal is simple: make large-scale LLM RL training more composable, reproducible, and easier to scale, while keeping the core trainer small enough for researchers and infrastructure teams to customize.

The Miles Architecture

Miles follows a small-core, many-edges philosophy.

The core training loop is intentionally compact. The pieces that users most often want to change — rollout logic, reward computation, loss functions, sample filtering, metrics, and training-loop hooks — are attached at launch time through user-supplied Python modules. This lets teams adapt the system to new algorithms and production constraints without forking the framework.

Underneath that small core, Miles composes four major systems:

SGLang for high-throughput rollout generation.Megatron-LM for scalable distributed training.Ray for cluster orchestration, actor lifecycle, scheduling, and supervision.PyTorch for models, autograd, distributed primitives, dtype support, extensibility, and profiling.

This composition is important. RL post-training requires generation and training to work together, but the two phases have very different performance profiles: rollout is memory-bandwidth-bound (KV-cache and parameter reads dominate during decoding), while training is compute-bound and communication-heavy. Weight synchronization, sample transfer, checkpoint conversion, routing consistency, and low-precision behavior all need to be handled carefully across the boundary.

The rest of this post walks through how Miles handles each piece of that boundary — orchestration with Ray, scaling with Megatron-LM, extensibility with PyTorch, and what comes out of the box.

Ray: Orchestrating Long-Running RL Jobs

Miles is built directly on the Ray distributed runtime. In a Miles run, every long-lived process is represented as a Ray actor: trainer ranks, SGLang rollout servers, routing proxies, and asynchronous rollout workers all live inside Ray’s actor model.

This gives Miles a natural foundation for cluster-scale RL workloads.

Placing workers on GPUs

Miles uses Ray’s GPU-aware scheduler and placement groups for actor placement, supporting disaggregated (rollout and training on separate nodes) and colocated (rollout and training on the same nodes) layouts via launch-time Ray placement specs. Process placement must be rack-aware to facilitate careful colocation, reserving spare nodes, and key for error isolation, since isolating problems within a rack (e.g., distinguishing a bad GPU from a full rack issue) is not always straightforward.

Moving data across the RL pipeline

Prompts, samples, and updated weights cycle continuously between rollout actors and trainer ranks, and Miles uses Ray actors and tasks to coordinate that flow. For bulk weight transfer, Ray handles the control path while the tensor bytes move over dedicated NCCL/RDMA channels, giving Miles both Ray-level programmability and a fast path for large data.

Supervising long-running jobs

Because a Miles run is a Ray job end-to-end, it inherits Ray’s operator surface — job submission, worker supervision, log aggregation, and dashboard visibility — without bolt-on infrastructure. With fault tolerance enabled, Miles can recover failed ranks and keep week-long workloads moving on top of the same Ray substrate.

Supporting fully asynchronous RL

Because Ray actors are persistent, hold their own state, and are scheduled independently, Miles can run a fully asynchronous mode in which rollout and training no longer block on each other — rollout actors continuously stream samples into a queue that the trainer drains at its own pace.

Megatron-LM: Scaling the Training Backend

Miles uses Megatron-LM as its production training backend, plugging directly into Megatron’s argument parser, model-construction pipeline, training loop, parallelism primitives, and distributed checkpoint format rather than wrapping it as a black-box library. That gives Miles the infrastructure needed for frontier-scale dense and MoE training while preserving a clean user-facing workflow.

One argument surface

Megatron-LM already exposes a large distributed-training configuration surface — sequence length, rotary embeddings, grouped GEMM, all flavors of parallelism, optimizer settings, activation checkpointing, and more — and Miles reuses it directly rather than wrapping or re-declaring it. Users configure a Miles run through one launch script that combines Miles-specific options with standard Megatron options, avoiding duplicated configuration layers and keeping the training setup close to upstream Megatron behavior.

Model specs instead of long-lived forks

Frontier architectures change quickly, with new attention blocks, routing mechanisms, and expert layouts arriving across model families, so Miles handles them through plug-in model specs — small spec files that insert custom PyTorch components (for example, a gated attention-output module, a Gated-Delta-Net block, or a model-specific MoE router) directly into Megatron’s model pipeline. This lets Miles support new architectures — for example DeepSeek-V3/V4, GLM-4.7, and Qwen3 MoE variants — without maintaining a long-lived Megatron fork that constantly diverges from upstream.

Parallelism-aware checkpointing

Miles uses Megatron’s parallelism-aware distributed checkpoint format, so a model can be converted from Hugging Face once and then loaded across different tensor / pipeline / context / expert parallel configurations without re-converting weights from scratch. For teams operating large training jobs, this means checkpoint conversion and parallelism changes don’t become a separate engineering project every time the model or cluster shape changes.

Extending training without patching the backend

Miles exposes hooks at well-defined points in the training loop — after model initialization, before log-probability computation, and before each training step — so users can add auxiliary losses, custom metrics, sample-level diagnostics, clipping rules, or algorithm-specific behavior without editing Megatron internals. The design goal is simple: keep the backend powerful, but keep user customization outside it.

PyTorch: The Common Layer for Models, Numerics, and Extensibility

PyTorch is the common programming model inside Miles: model components are regular ttorch.nn.Modules

, losses are standard autograd graphs, and mixed precision, gradient checkpointing, distributed primitives, and profiling all stay inside familiar PyTorch workflows. This matters because LLM RL post-training changes fast — teams need to add new rewards, losses, routers, model modules, and debugging tools without learning a new abstraction each time.

PyTorch-native model extensibility

Miles’ plug-in model-spec mechanism is built around torch.nn.Modules

, so supporting a new architecture means writing the new component as ordinary PyTorch code and connecting it into Megatron’s model pipeline — autograd, mixed precision, gradient checkpointing, and module lifecycle all keep working the way PyTorch users expect. Teams don’t have to translate the model into a separate intermediate abstraction to get it running on Miles.

PyTorch-native RL customization

The same principle applies to RL algorithms: rollout functions, rewards, loss functions, sample filters, metrics, and training-loop hooks are all customized through Python modules provided at launch time, using standard PyTorch operations that compose with the rest of the training graph. A team can start from an existing recipe and replace the reward, add an auxiliary loss, change sample filtering, or instrument new diagnostics without rewriting the trainer.

Low-precision recipes across the pipeline

Miles builds its low-precision pipeline on PyTorch’s dtype system, with BF16, FP8, MXFP8, and INT4-QAT recipes that span training and rollout rather than living as isolated backend-only features. This consistency matters for RL because the policy used to generate samples and the policy used to compute training log probabilities must stay aligned, and Miles is designed to make those numerical choices explicit and reproducible.

Profiling and debugging in familiar tools

Large-scale RL performance issues can surface anywhere — rollout latency, training compute, collective communication, data movement, weight synchronization, sample filtering, or scheduling — so Miles wires in the PyTorch profiler to capture Chrome traces of training phases for inspection in standard tooling. Combined with Megatron’s PyTorch-based backend and graph-compile paths where supported, this keeps debugging and performance work inside the familiar PyTorch ecosystem.

What Miles Provides Out of the Box

Miles is designed to provide the core systems features needed for large-scale LLM RL post-training:

Rollout and training integration— Connects SGLang rollout with Megatron-LM training, with both disaggregated and colocated execution to fit different GPU budgets and utilization targets.Asynchronous execution— Fully async mode decouples rollout from training: rollout actors stream samples continuously into a queue that the trainer drains at its own pace, eliminating the per-iteration blocking between the two phases.Fast weight synchronization— After each training update, fresh weights flow to rollout workers over dedicated NCCL/RDMA channels, with Ray handling only the control path so bulk tensor bytes stay off the Python data path.MoE-aware rollout/training alignment— Rollout Routing Replay preserves routing decisions across the rollout/training boundary, reducing the trainer-vs-rollout routing mismatch that would otherwise destabilize MoE RL.Low-precision support— A unified BF16 / FP8 / MXFP8 / INT4-QAT pipeline designed as part of the end-to-end RL stack rather than as isolated training-only recipes.LoRA across rollout and training— LoRA is supported in both rollout and training paths, enabling parameter-efficient post-training that reduces cost and speeds up iteration on large base models.Fault tolerance and observability— Ray’s job and actor model provide supervision, log aggregation, and dashboard visibility, while rank-level fault tolerance keeps week-long training runs moving; PyTorch profiler integration covers the training-level view.Broad model and hardware support— Miles ships ready-to-run recipes for frontier and open-source models including DeepSeek-V4, Kimi K2.5 / K2.6, GLM-5 / 5.1, and Qwen3.5 / 3.6, with support for NVIDIA flagship Hopper / Blackwell GPUs.

A Small Core with Many Extension Points

One of Miles’ most important design choices is that the core trainer stays small.

Instead of forcing users to fork the framework for every new algorithm or model family, Miles exposes explicit extension points:

Rollout functions for custom generation behavior.Reward functions for task-specific supervision.Loss functions for new RL objectives.Sample filters for data selection and rejection.Training hooks for metrics, diagnostics, auxiliary losses, and custom update logic.Model specs for architecture-specific modules.

These extension points make Miles useful across a range of post-training workflows: classic RLHF-style training, rule-based reward training, code and agentic tasks, MoE post-training, low-precision experiments, and production pipelines that need custom observability or safety checks.

In short, Miles makes the systems-level decisions — placement, weight sync, fault tolerance, low-precision recipes — so that user code can focus on algorithm and product logic.

Looking Ahead

LLM post-training is moving quickly — larger models, longer contexts, more MoE, and more asynchronous, agentic, system-intensive RL pipelines — and Miles is built for that trajectory: by composing SGLang, Ray, Megatron-LM, and PyTorch behind a small pluggable trainer, it gives researchers and infrastructure teams a PyTorch-native path from algorithm experimentation to large-scale RL runs, which is why we are open-sourcing Miles to make frontier-scale LLM RL post-training easier to reproduce, extend, and operate.

source & further reading

pytorch.org — original article Introducing Cross-Repository CI Relay: Scalable CI for PyTorch’s Out-of-Tree Backends TokenSpeed-Kernel: Portable APIs and High-Performance Kernels for Multi-Silicon LLM Inference Serving DeepSeek-V4 on GB300 with SGLang: 5x Higher Throughput at the Same Interactivity Since Day-0

Miles: A PyTorch-Native Stack for Large-Scale LLM RL Post-Training

Featured projects

Run your AI side-project on zahid.host