17:23
2026-06-21
tenureai.dev
ai-agents
Two AI judges scored our agent's answer 0.85, but it never opened the file
Two AI judge models scored an agent's answer 0.85 based solely on final answer matching, but the agent never retrieved the required Confluence page, leading to a trace-based score of 0.000. This expos…