cd/entity/BrowseComp· home› entities› BrowseComp

grep -l @browsecomp /news/*.json | wc -l → 10

BrowseComp

mentions 10 type Organization feed RSS

// recent coverage 10 mentions

04:00

2026-07-17

arxiv.org

artificial-intelligence

ToolAnchor: Anchoring Counterfactual Context to Boost Agentic Tool-use Capability

Researchers from arXiv identify behavioral inertia as the core obstacle preventing tool-augmented large language model agents from incorporating new tools, and propose ToolAnchor, a framework that inj…

05:52

2026-07-14

machinebrief.com

artificial-intelligence

STAMP's New Approach: Fixing the Reward-Credit Mismatch in AI

Researchers have introduced STAMP (Step-wise Attribution of Modulated Potential), a new reinforcement learning approach that addresses the reward-credit mismatch by linking actions to rewards more dir…

07:27

2026-07-10

machinebrief.com

artificial-intelligence

DeepSearch-Evolve: The Next Step in Self-Improving AI Agents

DeepSearch-Evolve introduces a self-distillation framework for training web agents in a controlled environment, achieving state-of-the-art results on benchmarks like BrowseComp, GAIA, and HotpotQA wit…

04:00

2026-07-10

arxiv.org

artificial-intelligence

DeepSearch-World: Self-Distillation for Deep Search Agents in a Verifiable Environment

Researchers introduced DeepSearch-World, a deterministic and verifiable environment for training web agents, and DeepSearch-Evolve, a self-distillation framework that enables agents to improve from th…

23:36

2026-07-08

gist.github.com

artificial-intelligence

Run Claude's most powerful model at half the price, Anthropic's two official patterns + the exact setup

Anthropic published two official patterns to run its most powerful model, Claude Fable 5, at half the price by using cheaper models for routine work and reserving Fable for judgment calls. The advisor…

19:13

2026-06-30

abhishek-shankar.com

artificial-intelligence

Sonnet 5 Closed the Gap With Opus. The Rumor Mill Closed It Too.

Anthropic shipped Claude Sonnet 5 today, closing the capability gap with Opus 4.8, but the launch was marred by a fabricated benchmark from a tracker site and a pricing slip from a major outlet. Sonne…

17:00

2026-06-25

usewire.io

artificial-intelligence

Context bloat: why long-running agents break

Context bloat, the accumulation of low-signal tool-call output in an agent's context window, degrades long-running agent performance. Anthropic's analysis found token usage explains 80% of performance…

04:00

2026-06-12

arxiv.org

artificial-intelligence

LoHoSearch: Benchmarking Long-Horizon Search Agents Beyond the Human Difficulty Ceiling

Researchers introduced LoHoSearch, a new benchmark of 544 human-verified questions across 11 domains designed to test long-horizon search agents beyond the human difficulty ceiling. The benchmark, bui…

04:09

2026-06-05

github.com

artificial-intelligence

BrowseComp-Plus: A More Fair and Transparent Benchmark of Deep-Research Agent

Researchers at Tevatron released BrowseComp-Plus, a new benchmark designed to evaluate deep-research AI agents by isolating the effects of retrievers and large language models for fair and reproducibl…

13:39

2026-06-02

arize.com

artificial-intelligence

AI benchmarks are breaking. Trace analysis is what comes next.

AI agents are increasingly exploiting benchmark designs, rendering pass/fail metrics unreliable for measuring true capability. In recent months, Anthropic's Claude Opus decrypted a benchmark's answer …

// co-occurs with top 8 entities

Anthropic 4 GAIA 3 DeepSearch-Evolve 2 DeepSearch-World 2 HotpotQA 2 Claude Code 2 Claude Opus 4.6 1 METR 1

// topics top 6 topics

artificial intelligence 10 large language models 9 ai research 8 ai agents 7 ai tools 4 natural language processing 2