{"slug": "lohosearch-benchmarking-long-horizon-search-agents-beyond-the-human-difficulty", "title": "LoHoSearch: Benchmarking Long-Horizon Search Agents Beyond the Human Difficulty Ceiling", "summary": "Researchers introduced LoHoSearch, a new benchmark of 544 human-verified questions across 11 domains designed to test long-horizon search agents beyond the human difficulty ceiling. The benchmark, built from a knowledge graph of 7 million Wikipedia entities, reduced the strongest model's accuracy to 34.74%, compared to over 90% on saturated benchmarks like BrowseComp. LoHoSearch provides a more demanding standard for evaluating search agents' long-horizon reasoning and context management capabilities.", "body_md": "arXiv:2606.12837v1 Announce Type: new\nAbstract: Search agent benchmarks exemplified by BrowseComp have rapidly saturated over the past year, with the strongest models surpassing 90% accuracy. Since these benchmarks are predominantly human-authored, annotators lack a global perspective on entity statistics and cannot systematically maximize search space size and structural complexity. This creates a difficulty ceiling that is hard to break. To address this, we introduce LoHoSearch (Long-Horizon Search Agents), a challenging benchmark comprising 544 human-verified questions across 11 domains. LoHoSearch is constructed via an automated pipeline built upon a knowledge graph covering over 7 million Wikipedia entities, which selects relations with large search spaces and assembles them into structurally complex questions with KG-verified unique answers. Our evaluation demonstrates that even the strongest model achieves only 34.74% accuracy, and existing context management strategies (best +6.8%) yield far smaller gains than on prior benchmarks. LoHoSearch provides a more demanding standard for evaluating long-horizon reasoning and context management in search agents.", "url": "https://wpnews.pro/news/lohosearch-benchmarking-long-horizon-search-agents-beyond-the-human-difficulty", "canonical_source": "https://arxiv.org/abs/2606.12837", "published_at": "2026-06-12 04:00:00+00:00", "updated_at": "2026-06-12 04:56:17.499971+00:00", "lang": "en", "topics": ["artificial-intelligence", "ai-research", "ai-agents", "large-language-models", "natural-language-processing"], "entities": ["LoHoSearch", "BrowseComp", "Wikipedia"], "alternates": {"html": "https://wpnews.pro/news/lohosearch-benchmarking-long-horizon-search-agents-beyond-the-human-difficulty", "markdown": "https://wpnews.pro/news/lohosearch-benchmarking-long-horizon-search-agents-beyond-the-human-difficulty.md", "text": "https://wpnews.pro/news/lohosearch-benchmarking-long-horizon-search-agents-beyond-the-human-difficulty.txt", "jsonld": "https://wpnews.pro/news/lohosearch-benchmarking-long-horizon-search-agents-beyond-the-human-difficulty.jsonld"}}