cd /news/large-language-models/test-time-verification-for-text-to-s… · home topics large-language-models article
[ARTICLE · art-45912] src=arxiv.org ↗ pub= topic=large-language-models verified=true sentiment=↑ positive

Test-Time Verification for Text-to-SQL via Outcome Reward Models

Researchers introduced GradeSQL, a framework for training Outcome Reward Models (ORMs) to improve test-time verification in Text-to-SQL tasks. ORM-based selection outperformed execution-based Best-of-N and Majority Voting, achieving gains of up to +4.33% on BIRD and +2.10% on Spider benchmarks. The approach provides a scalable alternative to heuristic strategies for structured query generation.

read1 min views1 publishedJul 1, 2026

arXiv:2606.30851v1 Announce Type: new Abstract: Improving the reliability of large language models (LLMs) at inference time is a central challenge in structured reasoning tasks such as Text-to-SQL. Common test-time inference strategies, including Best-of-N sampling and Majority Voting, rely on heuristic signals such as execution success or output frequency, which provide limited semantic discrimination across candidate outputs. In this work, we study Outcome Reward Models (ORMs) as learned semantic scoring functions for test-time verification in Text-to-SQL. While ORMs have been previously explored for test-time scaling and alignment, their application to structured query generation remains underexplored. We introduce GradeSQL, a scalable framework for training task-specific ORMs via automated candidate generation and execution-based labeling, enabling verifier training without manual annotation. We integrate ORMs into a verification-driven Best-of-N pipeline and evaluate our approach on the BIRD and Spider benchmarks across multiple open-source LLM families. ORM-based selection consistently outperforms execution-based Best-of-N and Majority Voting, with gains of up to +4.33% on BIRD and +2.10% on Spider. We further show that ORMs scale effectively with larger candidate sets and yield stronger improvements on complex queries. Overall, our results demonstrate that ORM-based verification provides a simple, effective, and scalable alternative to heuristic test-time selection strategies for Text-to-SQL. Code datasets and models are publicly available.

── more in #large-language-models 4 stories · sorted by recency
── more on @gradesql 3 stories trending now
sponsored brought to you by zahid.host 4,200+ EU-deployed projects
reading about agents? ship yours in a single git push.

Run your AI side-project on zahid.host

EU-based hosting, git-push deploys, automatic HTTPS, no cold starts. Free tier with a custom domain — perfect for shipping the agent you just read about.

$git push zahid main
Live at https://your-agent.zahid.host
Get free account → Pricing
from €0/mo · no card required
LIVE [news/test-time-verificati…] indexed:0 read:1min 2026-07-01 ·