{"slug": "feynrl-don-t-let-systems-swallow-the-algorithm", "title": "FeynRL- Don't let systems swallow the algorithm", "summary": "FeynRL, an algorithm-first framework for post-training and fine-tuning large models, has been released as an open-source tool supporting supervised fine-tuning, preference learning, and reinforcement learning methods. The framework prioritizes clarity and locality of change over built-in features, enabling researchers to implement and modify algorithms without fighting infrastructure while scaling from single-GPU debugging to multi-node distributed runs. The release aims to provide a foundation for developing new post-training methods with community collaboration.", "body_md": "Algorithm-first post-training framework for large models.\n\n*\"What I cannot create, I do not understand.\"* — Richard Feynman\n\n**FeynRL** (pronounced \"FineRL\") is an **algorithm-first** framework for **post-training and fine-tuning** large models. It supports supervised fine-tuning (SFT), preference learning (e.g., DPO), and reinforcement learning (e.g., PPO, GRPO, CISPO, P3O), and is built for researchers and engineers who want to understand, modify, and develop new methods without fighting the infrastructure.\n\nThe main goal of FeynRL is simple: make new algorithms easy to implement, easy to debug, and still possible to train at scale. The codebase is designed so that **algorithmic logic stays local** and **systems logic stays explicit**, which makes the framework easier to reason about, easier to extend, and more reliable to debug.\n\nFeynRL is a good fit if your goal is not only to run an existing recipe, but to **build and test new post-training methods**.\n\n-\n**Algorithm-first design**— Most method changes stay local: you can add new objectives, rewards, baselines, or update rules without reshaping the full stack. -\n**Clear separation of concerns**— Algorithm code stays algorithmic, and systems code stays systems. That keeps the codebase easier to understand, test, and extend. -\n**One framework across post-training**— SFT, DPO, and RL share the same workflow and configuration system, making comparisons easier and reducing duplicated infrastructure. -\n**Scales beyond toy settings**— Use the same framework for local single-GPU debugging or large multi-node distributed runs.\n\nFeynRL may not be the best fit if your main priority is the largest built-in feature surface out of the box, or if you mainly want a framework already optimized around a narrow workflow and do not expect to modify it much.\n\nThere are already several strong open-source frameworks for post-training large models. Many are powerful and feature-rich, but they are often optimized around a narrower set of methods or execution patterns, and can become hard to modify once you want to try something new.\n\nFeynRL was built to make a different trade-off. Instead of optimizing first for the largest feature surface, it optimizes first for **clarity, locality of change, and algorithm development**. The codebase is structured so that algorithmic ideas are easy to implement and reason about, while the distributed systems layer remains explicit rather than hidden behind heavy abstractions. In practice, implementing a new algorithm typically means writing a single file with its own loss and update logic, not threading changes through the orchestration, rollout, and data layers.\n\nThe framework is designed for scale from the start. It supports large-scale training with DeepSpeed, Ray, and vLLM, including sync and async execution modes, adaptive weight synchronization, and multi-node runs. The goal is to make it possible to do both: **move fast on algorithms and still run realistic experiments at scale**.\n\nThis is the first public release, so expect rough edges. We are open-sourcing FeynRL not just as a library, but as a foundation for building new post-training methods with the community.\n\nFor a detailed breakdown of the architecture, see the ** Architecture Overview**.\n\n- 🧪\n**Training paradigms**: RL (PPO, GRPO, CISPO, P3O), preference-based learning (DPO), and supervised fine-tuning (SFT) - 🖥️\n**Distributed training**: Multi-GPU and multi-node via DeepSpeed (ZeRO Stage 1/2/3) - 🎲\n**Rollouts / inference**: vLLM-powered rollout engines with tensor parallelism - 🛰️\n**Orchestration**: Ray for scheduling training and rollout workers across nodes - 🔀\n**Training-rollout scheduling**: Sync and overlap (async) modes. In overlap mode, rollout generation and training run concurrently on separate GPU pools to reduce idle time, with a configurable staleness budget bounding how off-policy the replay data can drift. - 🔄\n**Weight sync**: NCCL broadcast (sync mode supports direct/disk fallbacks; async mode is NCCL-only at runtime, with a built-in NCCL watchdog and fail-fast on communicator destruction). - 🧷\n**Parameter-efficient fine-tuning**: LoRA via PEFT - 🔢\n**Mixed-dataset sampling**: Configurable multi-dataset sampling with ratios within a single training run - 📈\n**Experiment tracking**: MLflow and Weights & Biases support - 🏅\n**Evaluation**: Standalone eval pipeline with vLLM engines\n\nFor RL, Ray orchestrates the full training loop: it schedules DeepSpeed training workers and vLLM rollout workers across nodes, and coordinates weight synchronization between them. In **sync mode**, each epoch generates all rollouts, trains on them, syncs weights, and repeats — fully on-policy and easy to reason about. In **overlap mode** (also called async mode), rollout generation and training run concurrently on separate GPU pools so training GPUs don't sit idle waiting for rollouts. Generation is continuous across epoch boundaries and checkpoint saves — the only pauses are brief drains during weight sync, which runs once at the end of every non-final epoch. A configurable staleness budget bounds how off-policy the replay data can drift. Async mode uses NCCL for weight sync; sync mode supports a three-tier NCCL/direct/disk fallback chain. SFT and DPO are simpler because they only require a single model and no rollout workers, so they run directly on DeepSpeed without Ray. All paradigms support full fine-tuning and LoRA, and plug into mixed-dataset sampling, experiment tracking, and standalone evaluation without changing the overall workflow.\n\nThe repository is organized so that algorithmic changes usually stay local:\n\n`algs/`\n\n— Algorithm and optimization logic. Each algorithm (PPO, GRPO, CISPO, P3O, DPO, SFT) has its own module with a README documenting the math and pseudocode.`rollouts/`\n\n— Rollout generation, vLLM engine wrappers, weight sync, and replay buffer.`rewards/`\n\n— Pluggable reward functions (GSM8K, math verification, and custom).`data_feeds/`\n\n— Data loading, sampling, and mixed-dataset support.`data_prep/`\n\n— Dataset preparation scripts.`configs/`\n\n— YAML configs for RL, SFT, DPO, and evaluation, with full[parameter reference](/FeynRL-project/FeynRL/blob/main/configs/README.md).`unit_tests/`\n\n— Unit and integration tests.\n\n[FeynRL is now publicly announced! Since the preview, we've added a new async engine and a collection of tricks and ideas, many not easily found elsewhere, that materially improve training stability and reliability. Thanks to everyone who tried the preview and shared feedback.](https://camo.githubusercontent.com/f93a159919a68fcd28979473b23422a647ac5b5d5c3e988b7789dc3c97706aca/68747470733a2f2f696d672e736869656c64732e696f2f62616467652f323032362d2d30342d2d32372d677265656e)[We're excited to publicly release FeynRL as a preview! Some features and documentation are still evolving. We welcome feedback, bug reports, and contributions as we continue to build this together.](https://camo.githubusercontent.com/78d8c9739f7992311fff19d572adec574a7c3bf8d1d4cb1455b520566b8e5087/68747470733a2f2f696d672e736869656c64732e696f2f62616467652f323032362d2d30332d2d30332d707572706c65)\n\n** Installation & Setup** — Configure your environment and dependencies.\n\n** Quickstart & How-To** — Learn how to launch jobs and run experiments.\n\n** Experiments** — Reference experiment results and the canonical example configs used to reproduce them.\n\n** Configuration Reference** — Full parameter guide for RL, SFT, DPO, and evaluation configs.\n\n** Troubleshooting** — Diagnose and fix common issues.\n\nContributions are welcome! Please see our ** Contributing Guidelines** for details on how to get involved.\n\nCheck out the [FAQ](/FeynRL-project/FeynRL/blob/main/docs/FAQ.md) for common questions and answers.", "url": "https://wpnews.pro/news/feynrl-don-t-let-systems-swallow-the-algorithm", "canonical_source": "https://github.com/FeynRL-project/FeynRL", "published_at": "2026-06-02 21:32:35+00:00", "updated_at": "2026-06-02 21:48:52.580914+00:00", "lang": "en", "topics": ["machine-learning", "large-language-models", "artificial-intelligence", "ai-research", "ai-tools"], "entities": ["FeynRL", "Richard Feynman", "SFT", "DPO", "PPO", "GRPO", "CISPO", "P3O"], "alternates": {"html": "https://wpnews.pro/news/feynrl-don-t-let-systems-swallow-the-algorithm", "markdown": "https://wpnews.pro/news/feynrl-don-t-let-systems-swallow-the-algorithm.md", "text": "https://wpnews.pro/news/feynrl-don-t-let-systems-swallow-the-algorithm.txt", "jsonld": "https://wpnews.pro/news/feynrl-don-t-let-systems-swallow-the-algorithm.jsonld"}}