{"slug": "how-fine-grained-should-a-rag-benchmark-be-a-hierarchical-framework-for-question", "title": "How Fine-Grained Should a RAG Benchmark Be? A Hierarchical Framework for Synthetic Question Generation", "summary": "Researchers from a study on arXiv introduce HieraRAG, a hierarchical framework for determining optimal granularity in RAG benchmark construction, using 5,872 synthetic QA pairs from FineWeb-10BT across three dimensions. The framework reveals that optimal granularity varies by dimension—question complexity benefits from fine-grained distinctions while answer type and linguistic variation peak at medium granularity—and provides a portable procedure for practitioners to determine evaluation granularity in their own RAG settings.", "body_md": "arXiv:2606.12789v1 Announce Type: new\nAbstract: Evaluating retrieval-augmented generation (RAG) systems requires benchmarks that capture diverse question characteristics, yet practitioners lack empirical guidance on which dimensions to vary and at what granularity. We present HieraRAG, a hierarchical framework for studying granularity in RAG benchmark construction, defining optimal granularity as the level that maximizes discriminative power (the standard deviation of generation quality across categories) within a given RAG configuration. As a case study, we generate 5,872 synthetic question-answer (QA) pairs from FineWeb-10BT across 3 dimensions (Question Complexity, Answer Type, Linguistic Variation) at 3 granularity levels (2, 4, and 8 categories). With a BM25+Falcon-3-10B pipeline, optimal granularity varies by dimension: complexity benefits from fine-grained distinctions (discriminative power: 0.053) while answer type and linguistic variation peak at medium granularity. We introduce a Coherence Ratio metric to quantify whether fine-grained splits cleanly subdivide parent categories, revealing structural differences across dimensions (Question Complexity: 0.40 vs. Answer Type: 1.44). Human evaluation of 110 stratified QA pairs confirms synthetic quality. While these specific findings reflect a single configuration, HieraRAG provides a portable procedure and validation metric for practitioners to determine evaluation granularity within their own RAG settings.", "url": "https://wpnews.pro/news/how-fine-grained-should-a-rag-benchmark-be-a-hierarchical-framework-for-question", "canonical_source": "https://arxiv.org/abs/2606.12789", "published_at": "2026-06-12 04:00:00+00:00", "updated_at": "2026-06-12 04:55:53.756589+00:00", "lang": "en", "topics": ["artificial-intelligence", "machine-learning", "large-language-models", "natural-language-processing", "ai-research"], "entities": ["HieraRAG", "FineWeb-10BT", "BM25", "Falcon-3-10B"], "alternates": {"html": "https://wpnews.pro/news/how-fine-grained-should-a-rag-benchmark-be-a-hierarchical-framework-for-question", "markdown": "https://wpnews.pro/news/how-fine-grained-should-a-rag-benchmark-be-a-hierarchical-framework-for-question.md", "text": "https://wpnews.pro/news/how-fine-grained-should-a-rag-benchmark-be-a-hierarchical-framework-for-question.txt", "jsonld": "https://wpnews.pro/news/how-fine-grained-should-a-rag-benchmark-be-a-hierarchical-framework-for-question.jsonld"}}