Design and Report Benchmarks for Knowledge Work

wpnews.pro

cd /news/artificial-intelligence/design-and-report-benchmarks-for-kno… · home › topics › artificial-intelligence › article

[ARTICLE · art-13573] src=arxiv.org ↗ pub=2026-05-25T04:00Z topic=artificial-intelligence verified=true sentiment=· neutral

Design and Report Benchmarks for Knowledge Work

Researchers have identified a fundamental flaw in how AI agents are evaluated for knowledge work, finding that higher benchmark scores do not reliably indicate real-world performance. The team proposes a three-step framework requiring explicit definition of work activities, tested settings, and appropriate work products, drawing from occupational studies to create an inventory of 18 work activities. Their analysis of three existing benchmarks reveals how current design choices create gaps between benchmarked tasks and the work claims their scores are meant to support.

read1 min views11 publishedMay 25, 2026

arXiv:2605.23262v1 Announce Type: new Abstract: The development of LLM agents has led to a growing body of work on knowledge-work AI, including coding, research, and healthcare. However, current knowledge-work evaluation and benchmark design still largely follow the logic of traditional NLP tasks. As a result, higher benchmark performance does not reliably show that a system can carry out knowledge work in real-world deployment settings. This paper contributes a three-step approach for making explicit how benchmarked tasks represent the work claims attached to their scores: defining the work activity under evaluation, specifying the tested setting, and scoring the appropriate work product. We review work studies showing that knowledge work is organized through roles and responsibilities, local materials and tools, and artifacts that must remain usable in downstream workflows. We then translate these concerns into benchmark design and reporting guidance, covering how tasks should be mapped to work activities, how tested settings should specify materials, tools, roles, and constraints, and how scoring should focus on the work product left by the system. To name the work activity being evaluated and distinguish it from common benchmark tasks, we derive an inventory of 18 work activities from the O{*}NET occupational task database. We demonstrate the approach through three benchmark case analyses: GDPval, a non-code occupational deliverable benchmark; OfficeQA Pro, a grounded document-analysis benchmark scored by final answers; and APEX-SWE, a software-engineering benchmark with executable scored products. These cases show how benchmark design choices shape the strongest work claim a score can support, and where gaps arise between the benchmarked task, tested setting, scored product, and broader work claim.

source & further reading

arxiv.org — original article

~/api · this article 200

$curl api.wpnews.pro/v1/news/design-and-report-benchm…

Read original on arxiv.org → arxiv.org/abs/2605.23262

mentioned entities

O*NET

metadata

slugdesign-and-report-benchmarks-for-knowledge-work

topic#artificial-intelligence

secondary4 topics

sentimentneutral

canonicalarxiv.org

navigation

← prevThe Eternal Sloptember

next →Samsung memory workers call off …

── more in #artificial-intelligence 4 stories · sorted by recency

runtimewire.com · 10 Jul · #artificial-intelligence

GPT-5.6 Sol Turns Blender Into an AI Speedrun

machinebrief.com · 10 Jul · #artificial-intelligence

Math Meets AI: How SageMath and LLMs Are Shaking Up Research

9to5mac.com · 10 Jul · #artificial-intelligence

Anthropic highlights Claude Code’s in-app browser on the desktop

dev.to · 10 Jul · #artificial-intelligence

Teaching Claude Code to Write and Grow Its Own Skills: A Self-Replicating Agent Environment

── more on @o*net 3 stories trending now

wpnews · 30 May · #ai-safety

Nightcord Security Analysis Report - Threat Investigation

wpnews · 27 May · #artificial-intelligence

How I Run Two Claude Accounts as One

wpnews · 8 Jul · #artificial-intelligence

SpaceXAI unveils Grok 4.5 AI model ahead of July 2026 public release

sponsored brought to you by zahid.host 4,200+ EU-deployed projects

reading about agents? ship yours in a single git push.

Run your AI side-project on zahid.host

EU-based hosting, git-push deploys, automatic HTTPS, no cold starts. Free tier with a custom domain — perfect for shipping the agent you just read about.

$git push zahid main

→ Live at https://your-agent.zahid.host ✓

Get free account → Pricing

from €0/mo · no card required