Self-CTRL: Self-Consistency Training with Reinforcement Learning

wpnews.pro

cd /news/large-language-models/self-ctrl-self-consistency-training-… · home › topics › large-language-models › article

[ARTICLE · art-32105] src=arxiv.org ↗ pub=2026-06-18T04:00Z topic=large-language-models verified=true sentiment=↑ positive

Self-CTRL: Self-Consistency Training with Reinforcement Learning

Researchers introduced Self-Consistency Training with Reinforcement Learning (Self-CTRL), a method that aligns language models' self-explanations with their actual behavior. In tests, the approach improved correlation between self-reported and measured biases from R²=0.24 to R²=0.64 in probabilistic reasoning, and boosted refusal prediction accuracy from 36% to 92% in constitutional AI scenarios while reducing HarmBench failure rate from 15.0% to 0.5%. The technique offers a pathway to safer, more transparent AI systems.

read1 min views1 publishedJun 18, 2026

arXiv:2606.18327v1 Announce Type: new Abstract: Language models (LMs) that faithfully describe their own behavior can more easily be audited, understood, and trusted by users. This paper describes Self-Consistency Training with Reinforcement Learning (Self-CTRL), a method that optimizes for consistency between a LM's self-explanations and behavior on related inputs by updating explanations to better predict behavior or updating behavior to better match explanations. We apply our method in two domains. First, we study a formal probabilistic reasoning task in which LMs must learn to imitate a family of biased samplers and evaluated on their ability to report the associated biases. We find that consistency training improves the correlation between self-reported and behaviorally-measured latent biases from $R^2=0.24$ to $R^2=0.64$ on a set of held-out distributions, matching the generalization of direct ground-truth supervision. Second, we study a constitutional AI domain in which LMs must describe when they will refuse or comply with user requests. Here, Self-CTRL produces rules that faithfully describe the model's behavior on held-out requests, improving the refusal predictions of a third-party auditor model from $36%$ to $92%$. In the other direction, behavior updates improve alignment, reducing HarmBench failure rate from $15.0%$ to $0.5%$ without substantially increasing refusal on harmless prompts. By aligning explanations and behavior, our work provides a general recipe for training AI models to be safer, more transparent, and more controllable.

source & further reading

arxiv.org — original article

~/api · this article 200

$curl api.wpnews.pro/v1/news/self-ctrl-self-consisten…

Read original on arxiv.org → arxiv.org/abs/2606.18327

mentioned entities

Self-CTRL

HarmBench

metadata

slugself-ctrl-self-consistency-training-with-reinforcement-learning

topic#large-language-models

secondary2 topics

sentimentpositive

canonicalarxiv.org

navigation

← prevIs AI Getting Quietly Dumber? A …

next →Most agentic AI projects in prod…

── more in #large-language-models 4 stories · sorted by recency

insinuator.net · 18 Jun · #large-language-models

Vulnerability Disclosure: Stealing Emails via Firefox's AI Features

actu.epfl.ch · 18 Jun · #large-language-models

EPFL launches the first open medical LLMs

arxiv.org · 18 Jun · #large-language-models

DeFAb: A Verifiable Benchmark for Defeasible Abduction in Foundation Models

cityam.com · 18 Jun · #large-language-models

City law firms ‘sleepwalking into a crisis’ over AI overreliance

── more on @self-ctrl 3 stories trending now

wpnews · 17 Jun · #developer-tools

CircleCI MCP Server: Debug Build Failures Without Leaving Your AI Coding Agent

wpnews · 17 Jun · #artificial-intelligence

How I Build Production AI Apps on Cloudflare with Claude Code

wpnews · 16 Jun · #large-language-models

I'm building CortexDB — an agent-native context database for AI agents

sponsored brought to you by zahid.host 4,200+ EU-deployed projects

reading about agents? ship yours in a single git push.

Run your AI side-project on zahid.host

EU-based hosting, git-push deploys, automatic HTTPS, no cold starts. Free tier with a custom domain — perfect for shipping the agent you just read about.

$git push zahid main

→ Live at https://your-agent.zahid.host ✓

Get free account → Pricing

from €0/mo · no card required