LoHoSearch: Benchmarking Long-Horizon Search Agents Beyond the Human Difficulty Ceiling

wpnews.pro

cd /news/artificial-intelligence/lohosearch-benchmarking-long-horizon… · home › topics › artificial-intelligence › article

[ARTICLE · art-24824] src=arxiv.org ↗ pub=2026-06-12T04:00Z topic=artificial-intelligence verified=true sentiment=· neutral

LoHoSearch: Benchmarking Long-Horizon Search Agents Beyond the Human Difficulty Ceiling

Researchers introduced LoHoSearch, a new benchmark of 544 human-verified questions across 11 domains designed to test long-horizon search agents beyond the human difficulty ceiling. The benchmark, built from a knowledge graph of 7 million Wikipedia entities, reduced the strongest model's accuracy to 34.74%, compared to over 90% on saturated benchmarks like BrowseComp. LoHoSearch provides a more demanding standard for evaluating search agents' long-horizon reasoning and context management capabilities.

read1 min views22 publishedJun 12, 2026

arXiv:2606.12837v1 Announce Type: new Abstract: Search agent benchmarks exemplified by BrowseComp have rapidly saturated over the past year, with the strongest models surpassing 90% accuracy. Since these benchmarks are predominantly human-authored, annotators lack a global perspective on entity statistics and cannot systematically maximize search space size and structural complexity. This creates a difficulty ceiling that is hard to break. To address this, we introduce LoHoSearch (Long-Horizon Search Agents), a challenging benchmark comprising 544 human-verified questions across 11 domains. LoHoSearch is constructed via an automated pipeline built upon a knowledge graph covering over 7 million Wikipedia entities, which selects relations with large search spaces and assembles them into structurally complex questions with KG-verified unique answers. Our evaluation demonstrates that even the strongest model achieves only 34.74% accuracy, and existing context management strategies (best +6.8%) yield far smaller gains than on prior benchmarks. LoHoSearch provides a more demanding standard for evaluating long-horizon reasoning and context management in search agents.

source & further reading

arxiv.org — original article

~/api · this article 200

$curl api.wpnews.pro/v1/news/lohosearch-benchmarking-…

Read original on arxiv.org → arxiv.org/abs/2606.12837

mentioned entities

LoHoSearch

BrowseComp

Wikipedia

metadata

sluglohosearch-benchmarking-long-horizon-search-agents-beyond-the-human-difficulty

topic#artificial-intelligence

secondary4 topics

sentimentneutral

canonicalarxiv.org

navigation

← prevLinear Coding Sessions

next →Can KKR Outmaneuver One of the B…

── more in #artificial-intelligence 4 stories · sorted by recency

news.ycombinator.com · 30 Jul · #artificial-intelligence

Wrapper Company vs. Recursion Company

sourcefeed.dev · 30 Jul · #artificial-intelligence

The Real Cost of Letting an Agent Run Your Business

opensourcemalware.com · 30 Jul · #artificial-intelligence

The OpenSourceMalware Show #15

byteiota.com · 30 Jul · #artificial-intelligence

IBM CodeAlchemy: ~1 Trillion Tokens of Open Code Data

── more on @lohosearch 3 stories trending now

wpnews · 28 Jul · #large-language-models

How to Download and Run Kimi K3 Open Weights

wpnews · 29 Jul · #ai-safety

News Summary for July 29, 2026

wpnews · 29 Jul · #ai-safety

Better security starts with better questions

sponsored brought to you by zahid.host 4,200+ EU-deployed projects

reading about agents? ship yours in a single git push.

Run your AI side-project on zahid.host

EU-based hosting, git-push deploys, automatic HTTPS, no cold starts. Free tier with a custom domain — perfect for shipping the agent you just read about.

$git push zahid main

→ Live at https://your-agent.zahid.host ✓

Get free account → Pricing

from €0/mo · no card required