RLVR

mentions 8 type Organization feed RSS

// recent coverage 8 mentions

04:00

2026-07-07

arxiv.org

large-language-models

Reinforcement Learning for Evidence-Seeking Diagnostic Reasoning with Large Language Models

Researchers formalized medical diagnosis as an iterative evidence-seeking task and used reinforcement learning with verifiable rewards to train large language models to autonomously gather clinical ev…

04:00

2026-07-07

arxiv.org

machine-learning

Reinforcement Learning for Data-Efficient Code-Switched ASR

Researchers propose a reinforcement learning with verifiable rewards (RLVR) method for data-efficient adaptation of audio-language models to code-switched automatic speech recognition (ASR). Using Qwe…

04:00

2026-06-29

arxiv.org

large-language-models

Tandem Reinforcement Learning with Verifiable Rewards

Researchers propose Tandem Reinforcement Learning (TRL), extending the tandem training paradigm to reinforcement learning with verifiable rewards (RLVR). Training Qwen3-4B-Instruct on competition math…

19:40

2026-06-27

lesswrong.com

large-language-models

Neuralese is Actually Probably Good for Alignment

Reinforcement Learning with Verifiable Rewards (RLVR) allows language models to bootstrap beyond human-level capabilities on exactly graded problems like coding and formal proofs, but alignment-flavor…

00:00

2026-06-22

aclanthology.org

large-language-models

A Few Bad Apples Spoil the Bunch: Preventing Global Entropy Collapse Driven by a Small Set of Tokens in LLM Reasoning

Researchers at ACL 2026 found that entropy collapse in LLM reasoning, which undermines test-time scaling, is driven by premature overconfidence at a small set of critical tokens. They proposed SCOPE, …

04:00

2026-06-16

arxiv.org

large-language-models

Nemotron 3 Ultra: Open, Efficient Mixture-of-Experts Hybrid Mamba-Transformer Model for Agentic Reasoning

NVIDIA released Nemotron 3 Ultra, a 550B-parameter hybrid Mamba-Transformer model with 55B active parameters, achieving up to 6x higher inference throughput than state-of-the-art LLMs while maintainin…

04:00

2026-06-04

arxiv.org

machine-learning

Self-Distilled Policy Gradient

Researchers introduced SDPG, a self-distilled policy-gradient framework that combines group-relative verifier advantages with normalized standard deviation and full-vocabulary on-policy self-distillat…

21:11

2026-05-20

vmax.ai

artificial-intelligence

PopuLoRA: Co-Evolving LLM Populations for Reasoning Self- Play

PopuLoRA is a method for training large language models (LLMs) that uses co-evolving populations of teacher and student adapters to generate and solve verifiable reasoning tasks, such as code and math…

// co-occurs with top 8 entities

GRPO 2 PopuLoRA 1 LLM 1 NVIDIA 1 Nemotron 3 Ultra 1 HuggingFace 1 LatentMoE 1 Multi Token Prediction 1