04:00
2026-06-16
arxiv.org
large-language-models
Towards Verifiable Agentic Data Science: Solving Irregular TSQA Via Tool-Grounded Reasoning
Researchers introduced IRTS-ToolBench, a benchmark of 1,700 questions across 10 task types and 13 domains, to evaluate large language models and AI agents on irregular time series question answering. โฆ