cd /news/artificial-intelligence/design-and-report-benchmarks-for-kno… · home topics artificial-intelligence article
[ARTICLE · art-13573] src=arxiv.org pub= topic=artificial-intelligence verified=true sentiment=· neutral

Design and Report Benchmarks for Knowledge Work

Researchers have identified a fundamental flaw in how AI agents are evaluated for knowledge work, finding that higher benchmark scores do not reliably indicate real-world performance. The team proposes a three-step framework requiring explicit definition of work activities, tested settings, and appropriate work products, drawing from occupational studies to create an inventory of 18 work activities. Their analysis of three existing benchmarks reveals how current design choices create gaps between benchmarked tasks and the work claims their scores are meant to support.

read1 min publishedMay 25, 2026

arXiv:2605.23262v1 Announce Type: new Abstract: The development of LLM agents has led to a growing body of work on knowledge-work AI, including coding, research, and healthcare. However, current knowledge-work evaluation and benchmark design still largely follow the logic of traditional NLP tasks. As a result, higher benchmark performance does not reliably show that a system can carry out knowledge work in real-world deployment settings. This paper contributes a three-step approach for making explicit how benchmarked tasks represent the work claims attached to their scores: defining the work activity under evaluation, specifying the tested setting, and scoring the appropriate work product. We review work studies showing that knowledge work is organized through roles and responsibilities, local materials and tools, and artifacts that must remain usable in downstream workflows. We then translate these concerns into benchmark design and reporting guidance, covering how tasks should be mapped to work activities, how tested settings should specify materials, tools, roles, and constraints, and how scoring should focus on the work product left by the system. To name the work activity being evaluated and distinguish it from common benchmark tasks, we derive an inventory of 18 work activities from the O{*}NET occupational task database. We demonstrate the approach through three benchmark case analyses: GDPval, a non-code occupational deliverable benchmark; OfficeQA Pro, a grounded document-analysis benchmark scored by final answers; and APEX-SWE, a software-engineering benchmark with executable scored products. These cases show how benchmark design choices shape the strongest work claim a score can support, and where gaps arise between the benchmarked task, tested setting, scored product, and broader work claim.

── more in #artificial-intelligence 4 stories · sorted by recency
── more on @o*net 3 stories trending now
sponsored brought to you by zahid.host 4,200+ EU-deployed projects
reading about agents? ship yours in a single git push.

Run your AI side-project on zahid.host

EU-based hosting, git-push deploys, automatic HTTPS, no cold starts. Free tier with a custom domain — perfect for shipping the agent you just read about.

$git push zahid main
Live at https://your-agent.zahid.host
Get free account → Pricing
from €0/mo · no card required
LIVE [news/design-and-report-be…] indexed:0 read:1min 2026-05-25 ·