{"slug": "policy-conditioned-counterfactual-credit-for-verifiable-reinforcement-learning", "title": "Policy-Conditioned Counterfactual Credit for Verifiable Reinforcement Learning of Long-Horizon Language Agents", "summary": "Researchers have developed CVT-RL, a reinforcement learning algorithm that uses policy-conditioned counterfactual credit assignment to reduce unsupported evidence chains and shortcut actions in long-horizon language agents. In tests across long-context QA, ALFWorld, ScienceWorld, and web/tool tasks, CVT-RL improved average task success from 71.8% to 78.9% and reduced measured hacking from 7.2% to 3.9% compared to non-causal reinforcement learning baselines. The approach introduces intervention-validity gating and a doubly robust advantage estimator to provide verifiable rewards that directly measure each step's causal contribution to final success.", "body_md": "arXiv:2606.05263v1 Announce Type: new\nAbstract: Reinforcement learning with verifiable rewards improves reasoning and tool use, yet long-horizon language agents still learn unsupported evidence chains, belief drift, and shortcut actions that satisfy terminal checks. Existing process rewards are mostly correlational: they reward retrieval-, reflection-, or verification-like steps without estimating whether the step contributes to final verified success under a specified intervention. We propose CVT-RL, a constrained policy-gradient algorithm with dense verifiable rewards, intervention-validity gating, and a policy-conditioned counterfactual contribution (PCCC) estimator. Deletion, semantic substitution, evidence substitution, and tool-output perturbation define separate controlled interventions; continuations are sampled from a frozen reference policy, and a selection-adjusted doubly robust estimator augments the advantage. Belief control uses only prefix-observable labels, while an augmented Lagrangian constrains unsupported claims, skipped verification, tool tampering, and unsafe calls. On long-context QA, ALFWorld, ScienceWorld, and web/tool tasks, CVT-RL improves average task success from 71.8% for compute-matched non-causal RL and 75.4% for an information-matched counterfactual-process baseline to 78.9%, improves evidence F1 from 78.9 to 82.8 over the information-matched baseline, and reduces measured hacking from 7.2% to 3.9%. Independent human audit estimates 4.6% hacking for CVT-RL versus 8.1% for the information-matched baseline, and adaptive detector-evasion attacks raise hacking only to 7.1%. Stratified bootstrap and mixed-effects tests give p<0.01 after Holm correction for all primary metrics. Carefully scoped counterfactual credit, paired with validity gating, diagnostics, and verifiable constraints, provides a reproducible route toward more reliable long-horizon RL for language agents.", "url": "https://wpnews.pro/news/policy-conditioned-counterfactual-credit-for-verifiable-reinforcement-learning", "canonical_source": "https://arxiv.org/abs/2606.05263", "published_at": "2026-06-05 04:00:00+00:00", "updated_at": "2026-06-05 04:37:04.189129+00:00", "lang": "en", "topics": ["machine-learning", "large-language-models", "ai-agents", "ai-safety", "natural-language-processing"], "entities": ["CVT-RL", "PCCC", "ALFWorld", "ScienceWorld"], "alternates": {"html": "https://wpnews.pro/news/policy-conditioned-counterfactual-credit-for-verifiable-reinforcement-learning", "markdown": "https://wpnews.pro/news/policy-conditioned-counterfactual-credit-for-verifiable-reinforcement-learning.md", "text": "https://wpnews.pro/news/policy-conditioned-counterfactual-credit-for-verifiable-reinforcement-learning.txt", "jsonld": "https://wpnews.pro/news/policy-conditioned-counterfactual-credit-for-verifiable-reinforcement-learning.jsonld"}}