FeynRL- Don't let systems swallow the algorithm FeynRL, an algorithm-first framework for post-training and fine-tuning large models, has been released as an open-source tool supporting supervised fine-tuning, preference learning, and reinforcement learning methods. The framework prioritizes clarity and locality of change over built-in features, enabling researchers to implement and modify algorithms without fighting infrastructure while scaling from single-GPU debugging to multi-node distributed runs. The release aims to provide a foundation for developing new post-training methods with community collaboration. Algorithm-first post-training framework for large models. "What I cannot create, I do not understand." — Richard Feynman FeynRL pronounced "FineRL" is an algorithm-first framework for post-training and fine-tuning large models. It supports supervised fine-tuning SFT , preference learning e.g., DPO , and reinforcement learning e.g., PPO, GRPO, CISPO, P3O , and is built for researchers and engineers who want to understand, modify, and develop new methods without fighting the infrastructure. The main goal of FeynRL is simple: make new algorithms easy to implement, easy to debug, and still possible to train at scale. The codebase is designed so that algorithmic logic stays local and systems logic stays explicit , which makes the framework easier to reason about, easier to extend, and more reliable to debug. FeynRL is a good fit if your goal is not only to run an existing recipe, but to build and test new post-training methods . - Algorithm-first design — Most method changes stay local: you can add new objectives, rewards, baselines, or update rules without reshaping the full stack. - Clear separation of concerns — Algorithm code stays algorithmic, and systems code stays systems. That keeps the codebase easier to understand, test, and extend. - One framework across post-training — SFT, DPO, and RL share the same workflow and configuration system, making comparisons easier and reducing duplicated infrastructure. - Scales beyond toy settings — Use the same framework for local single-GPU debugging or large multi-node distributed runs. FeynRL may not be the best fit if your main priority is the largest built-in feature surface out of the box, or if you mainly want a framework already optimized around a narrow workflow and do not expect to modify it much. There are already several strong open-source frameworks for post-training large models. Many are powerful and feature-rich, but they are often optimized around a narrower set of methods or execution patterns, and can become hard to modify once you want to try something new. FeynRL was built to make a different trade-off. Instead of optimizing first for the largest feature surface, it optimizes first for clarity, locality of change, and algorithm development . The codebase is structured so that algorithmic ideas are easy to implement and reason about, while the distributed systems layer remains explicit rather than hidden behind heavy abstractions. In practice, implementing a new algorithm typically means writing a single file with its own loss and update logic, not threading changes through the orchestration, rollout, and data layers. The framework is designed for scale from the start. It supports large-scale training with DeepSpeed, Ray, and vLLM, including sync and async execution modes, adaptive weight synchronization, and multi-node runs. The goal is to make it possible to do both: move fast on algorithms and still run realistic experiments at scale . This is the first public release, so expect rough edges. We are open-sourcing FeynRL not just as a library, but as a foundation for building new post-training methods with the community. For a detailed breakdown of the architecture, see the Architecture Overview . - 🧪 Training paradigms : RL PPO, GRPO, CISPO, P3O , preference-based learning DPO , and supervised fine-tuning SFT - 🖥️ Distributed training : Multi-GPU and multi-node via DeepSpeed ZeRO Stage 1/2/3 - 🎲 Rollouts / inference : vLLM-powered rollout engines with tensor parallelism - 🛰️ Orchestration : Ray for scheduling training and rollout workers across nodes - 🔀 Training-rollout scheduling : Sync and overlap async modes. In overlap mode, rollout generation and training run concurrently on separate GPU pools to reduce idle time, with a configurable staleness budget bounding how off-policy the replay data can drift. - 🔄 Weight sync : NCCL broadcast sync mode supports direct/disk fallbacks; async mode is NCCL-only at runtime, with a built-in NCCL watchdog and fail-fast on communicator destruction . - 🧷 Parameter-efficient fine-tuning : LoRA via PEFT - 🔢 Mixed-dataset sampling : Configurable multi-dataset sampling with ratios within a single training run - 📈 Experiment tracking : MLflow and Weights & Biases support - 🏅 Evaluation : Standalone eval pipeline with vLLM engines For RL, Ray orchestrates the full training loop: it schedules DeepSpeed training workers and vLLM rollout workers across nodes, and coordinates weight synchronization between them. In sync mode , each epoch generates all rollouts, trains on them, syncs weights, and repeats — fully on-policy and easy to reason about. In overlap mode also called async mode , rollout generation and training run concurrently on separate GPU pools so training GPUs don't sit idle waiting for rollouts. Generation is continuous across epoch boundaries and checkpoint saves — the only pauses are brief drains during weight sync, which runs once at the end of every non-final epoch. A configurable staleness budget bounds how off-policy the replay data can drift. Async mode uses NCCL for weight sync; sync mode supports a three-tier NCCL/direct/disk fallback chain. SFT and DPO are simpler because they only require a single model and no rollout workers, so they run directly on DeepSpeed without Ray. All paradigms support full fine-tuning and LoRA, and plug into mixed-dataset sampling, experiment tracking, and standalone evaluation without changing the overall workflow. The repository is organized so that algorithmic changes usually stay local: algs/ — Algorithm and optimization logic. Each algorithm PPO, GRPO, CISPO, P3O, DPO, SFT has its own module with a README documenting the math and pseudocode. rollouts/ — Rollout generation, vLLM engine wrappers, weight sync, and replay buffer. rewards/ — Pluggable reward functions GSM8K, math verification, and custom . data feeds/ — Data loading, sampling, and mixed-dataset support. data prep/ — Dataset preparation scripts. configs/ — YAML configs for RL, SFT, DPO, and evaluation, with full parameter reference /FeynRL-project/FeynRL/blob/main/configs/README.md . unit tests/ — Unit and integration tests. FeynRL is now publicly announced Since the preview, we've added a new async engine and a collection of tricks and ideas, many not easily found elsewhere, that materially improve training stability and reliability. Thanks to everyone who tried the preview and shared feedback. https://camo.githubusercontent.com/f93a159919a68fcd28979473b23422a647ac5b5d5c3e988b7789dc3c97706aca/68747470733a2f2f696d672e736869656c64732e696f2f62616467652f323032362d2d30342d2d32372d677265656e We're excited to publicly release FeynRL as a preview Some features and documentation are still evolving. We welcome feedback, bug reports, and contributions as we continue to build this together. https://camo.githubusercontent.com/78d8c9739f7992311fff19d572adec574a7c3bf8d1d4cb1455b520566b8e5087/68747470733a2f2f696d672e736869656c64732e696f2f62616467652f323032362d2d30332d2d30332d707572706c65 Installation & Setup — Configure your environment and dependencies. Quickstart & How-To — Learn how to launch jobs and run experiments. Experiments — Reference experiment results and the canonical example configs used to reproduce them. Configuration Reference — Full parameter guide for RL, SFT, DPO, and evaluation configs. Troubleshooting — Diagnose and fix common issues. Contributions are welcome Please see our Contributing Guidelines for details on how to get involved. Check out the FAQ /FeynRL-project/FeynRL/blob/main/docs/FAQ.md for common questions and answers.