{"slug": "constructing-evaluation-datasets-for-procedural-reasoning-balancing-naturalness", "title": "Constructing Evaluation Datasets for Procedural Reasoning: Balancing Naturalness, Grounding, and Multi-Hop Coverage", "summary": "Researchers at arXiv have introduced a framework for generating and validating question-answer datasets that test procedural reasoning in AI-supported learning systems. Comparing three strategies across 23 instructional topics, the team found that strict generation from Task-Method-Knowledge (TMK) models produced the highest quality, with 96.5% grounded questions and 92.6% usable items, while transcript-first and TMK-aware approaches yielded more natural but less grounded questions. The findings demonstrate that procedural richness and natural phrasing do not ensure representational grounding, highlighting the need for explicit validation methods in evaluation datasets.", "body_md": "arXiv:2606.12767v1 Announce Type: new\nAbstract: Evaluating procedural reasoning in AI-supported learning systems requires question-answer datasets that are both learner-like and grounded in the instructional knowledge the system is expected to use. We study how TMK-based question generation strategies affect dataset quality for procedural and multi-hop reasoning.\nWe compare three strategies: strict generation from Task-Method-Knowledge (TMK) models, transcript-first generation with post-hoc TMK filtering, and TMK-aware generation that combines transcripts with structured guidance. To evaluate generated items, we introduce a grounding validation framework based on closed-set evidence units extracted from TMK models. The framework measures whether answers are supported by the underlying representation, whether questions are self-contained, and whether they target multi-hop procedural reasoning.\nAcross 23 instructional topics and 690 generated question-answer pairs, strict TMK generation achieves the strongest overall quality, with 96.5% grounded questions and 92.6% usable questions. Transcript-first generation produces more learner-like questions but more context-dependent or weakly grounded items, while TMK-aware generation yields high raw multi-hop coverage but lower grounding. These results show that procedural richness and natural phrasing do not guarantee representational grounding, motivating explicit representation-aware validation for evaluation datasets in AI-supported learning.", "url": "https://wpnews.pro/news/constructing-evaluation-datasets-for-procedural-reasoning-balancing-naturalness", "canonical_source": "https://arxiv.org/abs/2606.12767", "published_at": "2026-06-12 04:00:00+00:00", "updated_at": "2026-06-12 04:53:07.871341+00:00", "lang": "en", "topics": ["artificial-intelligence", "natural-language-processing", "ai-research"], "entities": [], "alternates": {"html": "https://wpnews.pro/news/constructing-evaluation-datasets-for-procedural-reasoning-balancing-naturalness", "markdown": "https://wpnews.pro/news/constructing-evaluation-datasets-for-procedural-reasoning-balancing-naturalness.md", "text": "https://wpnews.pro/news/constructing-evaluation-datasets-for-procedural-reasoning-balancing-naturalness.txt", "jsonld": "https://wpnews.pro/news/constructing-evaluation-datasets-for-procedural-reasoning-balancing-naturalness.jsonld"}}