Improving Labeling Consistency with Detailed Constitutional Definitions and AI-Driven Evaluation

wpnews.pro

cd /news/artificial-intelligence/improving-labeling-consistency-with-… · home › topics › artificial-intelligence › article

[ARTICLE · art-14068] src=arxiv.org ↗ pub=2026-05-26T04:00Z topic=artificial-intelligence verified=true sentiment=· neutral

Improving Labeling Consistency with Detailed Constitutional Definitions and AI-Driven Evaluation

A new AI-driven workflow that uses detailed "constitutions" to define labeling categories and a frontier LLM to interpret them has reduced cross-model inconsistency by up to 57 times compared to standard paragraph definitions in content moderation tasks. The approach, tested on harassment, hate speech, and non-violent crime categories, shifts human responsibility from individual labeling decisions to high-level oversight of category meaning. The system also introduces a dual-axis safety evaluation that scores intent and content independently across full conversations, enabling downstream consumers to act on either dimension.

read1 min views7 publishedMay 26, 2026

arXiv:2605.24247v1 Announce Type: new Abstract: Many automated labeling pipelines classify inputs into categories defined by a written specification, content moderation being a prominent use case. Simple category definitions are not detailed enough for labelers to produce the accurate, consistent golden labels these pipelines require. One solution is to write a prescriptive definition that settles enough real boundary cases that labelers cannot disagree with the written interpretation. In practice, definitions at that level of detail exceed what a human annotator can hold in working memory, so annotators fall back on intuition and the labels drift from the written rules, regressing on accuracy and consistency. We propose and demonstrate the efficacy of an AI-driven workflow in which AI helps write a per-category constitution that defines the label in enough detail to cover edge cases, and a frontier LLM interprets it on each input to produce the golden label more consistently and accurately than humans reading the same document. We evaluate on three content moderation categories (harassment, hate speech, non-violent crime) and show that the approach reduces cross-model inconsistency by up to 57x compared to paragraph definitions, with cross-model disagreement diagnosing specification gaps and the human responsible for high-level decisions about what each category should mean rather than individual labeling calls. For the safety evaluation, we introduce a dual-axis formulation scoring intent and content independently over the full conversation, so downstream consumers can act on either axis or both.

source & further reading

arxiv.org — original article

── more in #artificial-intelligence 4 stories · sorted by recency

dissenter.com · 16 Jul · #artificial-intelligence

Alphabet Stock Sinks on Gemini Delays, Retail Investors Pay

cryptobriefing.com · 16 Jul · #artificial-intelligence

Airbnb CEO Brian Chesky’s X account hijacked to push AI-generated crypto thread

fortune.com · 16 Jul · #artificial-intelligence

Airbnb CEO Brian Chesky’s X account was hijacked in an AI slop hack pushing crypto tokenization

machinebrief.com · 16 Jul · #artificial-intelligence

AI Models: A Backdoor Hazard Waiting to Happen?

sponsored brought to you by zahid.host 4,200+ EU-deployed projects

reading about agents? ship yours in a single git push.

Run your AI side-project on zahid.host

EU-based hosting, git-push deploys, automatic HTTPS, no cold starts. Free tier with a custom domain — perfect for shipping the agent you just read about.

$git push zahid main

→ Live at https://your-agent.zahid.host ✓

Get free account → Pricing

from €0/mo · no card required