04:00
2026-07-01
arxiv.org
large-language-models
BayesBench: Evaluating LLM Belief Trajectories Under Multi-Turn Evidence Accumulation
Researchers introduced BayesBench, a suite of simulation environments to evaluate how large language models update beliefs under multi-turn evidence accumulation. Testing seven LLMs from 3B to 70B parβ¦