GroundEval

mentions 1 type Organization feed RSS

// recent coverage 1 mentions

17:23

2026-06-21

tenureai.dev

ai-agents

Two AI judges scored our agent's answer 0.85, but it never opened the file

Two AI judge models scored an agent's answer 0.85 based solely on final answer matching, but the agent never retrieved the required Confluence page, leading to a trace-based score of 0.000. This expos…

// co-occurs with top 2 entities

Humanity's Last Exam 1 Confluence 1