Anchor: Mitigating Artifact Drift in Agent Benchmark Generation

wpnews.pro

cd /news/ai-agents/anchor-mitigating-artifact-drift-in-… · home › topics › ai-agents › article

[ARTICLE · art-14902] src=arxiv.org ↗ pub=2026-05-27T04:00Z topic=ai-agents verified=true sentiment=· neutral

Anchor: Mitigating Artifact Drift in Agent Benchmark Generation

Researchers introduced Anchor, a task-generation pipeline that prevents artifact drift in AI agent benchmarks by formalizing business workflow specifications into constraint optimization programs. The system jointly produces instructions, environments, certified solutions, and verifiers from a single parametric specification, enabling controlled difficulty and known optimal solutions. Applied to create ERP-Bench, a benchmark of 300 long-horizon enterprise tasks, the approach revealed that frontier models satisfy explicit constraints in only 26.1% of trials and achieve fully optimal solutions in just 17.4% of trials.

read1 min views10 publishedMay 27, 2026

arXiv:2605.26321v1 Announce Type: new Abstract: AI agents are beginning to complete valuable, long-horizon business operations tasks, but training and evaluation environments for enterprise work still struggle to balance realism, verifiability, and scale. Environment and task creation frequently suffers from a failure mode we call artifact drift: when instructions, environments, oracles, and verifiers are created by loosely coupled processes, they frequently disagree on what a task requires, producing environments that are unsolvable, reward-hackable, or inconsistent. We introduce Anchor, a task-generation pipeline that formalizes domain experts' specifications of business workflows into constraint optimization programs. From a single parametric specification, the pipeline jointly produces a natural-language instruction, environment configuration, solver-certified ground-truth solution, and state-based verifier. With Anchor, altering parameters yields new tasks with controlled difficulty and known optimal solutions, producing harness-agnostic environments whose rewards depend solely on end-state business correctness. We apply Anchor to produce ERP-Bench: a benchmark of 300 long-horizon tasks spanning procurement and manufacturing workflows in a production-grade ERP system. We find that generation parameters predict realized difficulty, and that frontier models satisfy explicit task constraints in 26.1% of trials but reach a fully optimal solution in only 17.4% of trials. Overall, we show that Anchor and ERP-Bench offer a concrete recipe for building auditable evaluation environments for economically valuable agent work. We release the task generator and ERP-Bench dataset at erpbench.ai

source & further reading

arxiv.org — original article

~/api · this article 200

$curl api.wpnews.pro/v1/news/anchor-mitigating-artifa…

Read original on arxiv.org → arxiv.org/abs/2605.26321

mentioned entities

Anchor

ERP-Bench

metadata

sluganchor-mitigating-artifact-drift-in-agent-benchmark-generation

topic#ai-agents

secondary4 topics

sentimentneutral

canonicalarxiv.org

navigation

← prevSejong University launches Asia’…

next →European AI adoption hits 99% wi…

── more in #ai-agents 4 stories · sorted by recency

dev.to · 16 Jul · #ai-agents

Do We Need MCP Servers for Everyday Life?

dev.to · 16 Jul · #ai-agents

How to Build an AI Agent That Does Your Job (Step-by-Step)

aws.amazon.com · 16 Jul · #ai-agents

Introducing Grok on Amazon Bedrock

marktechpost.com · 16 Jul · #ai-agents

OpenAI Details GPT-Red: An Internal Automated Red-Teaming Model That Beat Human Red-Teamers 84% To 13% On Prompt Injection

── more on @anchor 3 stories trending now

wpnews · 27 May · #artificial-intelligence

How I Run Two Claude Accounts as One

wpnews · 26 May · #ai-agents

Think, Durable Objects, and the Real Shape of AI Applications

wpnews · 8 Jul · #ai-chips

D-Matrix launches Corsair AI inference platform, challenging Nvidia’s GPU dominance

sponsored brought to you by zahid.host 4,200+ EU-deployed projects

reading about agents? ship yours in a single git push.

Run your AI side-project on zahid.host

EU-based hosting, git-push deploys, automatic HTTPS, no cold starts. Free tier with a custom domain — perfect for shipping the agent you just read about.

$git push zahid main

→ Live at https://your-agent.zahid.host ✓

Get free account → Pricing

from €0/mo · no card required