RLP: Reinforcement as a Pretraining Objective

wpnews.pro

cd /news/artificial-intelligence/rlp-reinforcement-as-a-pretraining-o… · home › topics › artificial-intelligence › article

[ARTICLE · art-13686] src=research.nvidia.com ↗ pub=2026-05-16T18:14Z topic=artificial-intelligence verified=true sentiment=↑ positive

RLP: Reinforcement as a Pretraining Objective

Researchers have developed RLP, an information-driven reinforcement pretraining objective that integrates exploration and chain-of-thought reasoning into the pretraining phase of large language models, rather than reserving reinforcement learning for post-training. The approach rewards models for generating reasoning chains that improve next-token prediction, enabling verifier-free dense reward signals during pretraining on ordinary text. Pretraining with RLP on Qwen3-1.7B-Base improved overall math-and-science benchmark performance by 19%, while application to Nemotron-Nano-12B-v2 raised average scores from 42.81% to 61.32%, demonstrating significant gains in reasoning-heavy tasks across model scales.

read1 min views16 publishedMay 16, 2026

The dominant paradigm for training large reasoning models starts with pre-training using next-token prediction loss on vast amounts of data. Reinforcement learning, while powerful in scaling reasoning, is introduced only as the very last phase of post-training, preceded by supervised fine-tuning. While dominant, is this an optimal way of training? In this paper, we present RLP, an information-driven reinforcement pretraining objective, that brings the core spirit of reinforcement learning -- exploration -- to the last phase of pretraining. The key idea is to treat chain-of-thought as an exploratory action, with rewards computed based on the information gain it provides for predicting future tokens. This training objective essentially encourages the model to think for itself before predicting what comes next, thus teaching an independent thinking behavior earlier in the pretraining. More concretely, the reward signal measures the increase in log-likelihood of the next token when conditioning on both context and a sampled reasoning chain, compared to conditioning on context alone. This approach yields a verifier-free dense reward signal, allowing for efficient training for the full document stream during pretraining. Specifically, RLP reframes reinforcement learning for reasoning as a pretraining objective on ordinary text, bridging the gap between next-token prediction and the emergence of useful chain-of-thought reasoning. Pretraining with RLP on Qwen3-1.7B-Base lifts the overall average across an eight-benchmark math-and-science suite by 19%. With identical post-training, the gains compound, with the largest improvements on reasoning-heavy tasks such as AIME25 and MMLU-Pro. Applying RLP to the Nemotron-Nano-12B-v2 increases the overall average from 42.81% to 61.32% and raises the average on scientific reasoning by 23%, demonstrating scalability across architectures and model sizes.

source & further reading

research.nvidia.com — original article FlashAttention-4: Algorithm and Kernel Pipelining Co-Design

~/api · this article 200

$curl api.wpnews.pro/v1/news/rlp-reinforcement-as-a-p…

Read original on research.nvidia.com → research.nvidia.com/publication/2026-04_rlp-rein…

mentioned entities

RLP

Qwen3-1.7B-Base

metadata

slugrlp-reinforcement-as-a-pretraining-objective

topic#artificial-intelligence

secondary4 topics

sentimentpositive

canonicalresearch.nvidia.com

navigation

← prevAI-generated code is 'pain waiti…

next →iGRPO: Self-Feedback-Driven LLM …

── more in #artificial-intelligence 4 stories · sorted by recency

machinebrief.com · 15 Jul · #artificial-intelligence

AI Ethics in Education: ChatGPT's Role in Shaping Public Sentiment

machinebrief.com · 15 Jul · #artificial-intelligence

Evidence Selection in RAG with QUBO

machinebrief.com · 15 Jul · #artificial-intelligence

Cracking the Code of Causal AI: How Precision Beats Ambiguity

machinebrief.com · 15 Jul · #artificial-intelligence

Dynamic MCTS: Pushing AI Strategy in High-Uncertainty Games

── more on @rlp 3 stories trending now

wpnews · 23 May · #artificial-intelligence

AccessLens — a blind person's lanyard, powered by Gemma 4 on-device

wpnews · 27 May · #artificial-intelligence

How I Run Two Claude Accounts as One

wpnews · 21 May · #developer-tools

Antigravity CLI: A Hands-On Guide to Google's Terminal Coding Agent

sponsored brought to you by zahid.host 4,200+ EU-deployed projects

reading about agents? ship yours in a single git push.

Run your AI side-project on zahid.host

EU-based hosting, git-push deploys, automatic HTTPS, no cold starts. Free tier with a custom domain — perfect for shipping the agent you just read about.

$git push zahid main

→ Live at https://your-agent.zahid.host ✓

Get free account → Pricing

from €0/mo · no card required