Test-Time Verification for Text-to-SQL via Outcome Reward Models

wpnews.pro

cd /news/large-language-models/test-time-verification-for-text-to-s… · home › topics › large-language-models › article

[ARTICLE · art-45912] src=arxiv.org ↗ pub=2026-07-01T04:00Z topic=large-language-models verified=true sentiment=↑ positive

Test-Time Verification for Text-to-SQL via Outcome Reward Models

Researchers introduced GradeSQL, a framework for training Outcome Reward Models (ORMs) to improve test-time verification in Text-to-SQL tasks. ORM-based selection outperformed execution-based Best-of-N and Majority Voting, achieving gains of up to +4.33% on BIRD and +2.10% on Spider benchmarks. The approach provides a scalable alternative to heuristic strategies for structured query generation.

read1 min views1 publishedJul 1, 2026

arXiv:2606.30851v1 Announce Type: new Abstract: Improving the reliability of large language models (LLMs) at inference time is a central challenge in structured reasoning tasks such as Text-to-SQL. Common test-time inference strategies, including Best-of-N sampling and Majority Voting, rely on heuristic signals such as execution success or output frequency, which provide limited semantic discrimination across candidate outputs. In this work, we study Outcome Reward Models (ORMs) as learned semantic scoring functions for test-time verification in Text-to-SQL. While ORMs have been previously explored for test-time scaling and alignment, their application to structured query generation remains underexplored. We introduce GradeSQL, a scalable framework for training task-specific ORMs via automated candidate generation and execution-based labeling, enabling verifier training without manual annotation. We integrate ORMs into a verification-driven Best-of-N pipeline and evaluate our approach on the BIRD and Spider benchmarks across multiple open-source LLM families. ORM-based selection consistently outperforms execution-based Best-of-N and Majority Voting, with gains of up to +4.33% on BIRD and +2.10% on Spider. We further show that ORMs scale effectively with larger candidate sets and yield stronger improvements on complex queries. Overall, our results demonstrate that ORM-based verification provides a simple, effective, and scalable alternative to heuristic test-time selection strategies for Text-to-SQL. Code datasets and models are publicly available.

source & further reading

arxiv.org — original article

~/api · this article 200

$curl api.wpnews.pro/v1/news/test-time-verification-f…

Read original on arxiv.org → arxiv.org/abs/2606.30851

mentioned entities

GradeSQL

BIRD

Spider

Outcome Reward Models

Large Language Models

metadata

slugtest-time-verification-for-text-to-sql-via-outcome-reward-models

topic#large-language-models

secondary2 topics

sentimentpositive

canonicalarxiv.org

navigation

← prevI Built 5 Free AI Tools That Rep…

next →Sivers emission övertecknades "f…

── more in #large-language-models 4 stories · sorted by recency

arxiv.org · 1 Jul · #large-language-models

Training Therapeutic Judges and Multi-Agent Systems for Human-Aligned Mental Health Support

arxiv.org · 1 Jul · #large-language-models

Bridging Scientific Heritage: An Arabic--Russian Parallel Corpus and LLM Benchmark for Sustainable Knowledge Transfer

arxiv.org · 1 Jul · #large-language-models

Beyond Clean Text: Evaluating Encoder and Decoder Robustness for Bangla Event Detection in Noisy Text

arxiv.org · 1 Jul · #large-language-models

Indi-RomCoM: Code-Mixed Benchmark for Evaluating LLMs on Romanized Indic-English Instructions

── more on @gradesql 3 stories trending now

wpnews · 30 May · #ai-tools

I was wasting 10 minutes every Claude session. So I built a fix.

wpnews · 27 May · #machine-learning

hunting for headroom on modded-nanoGPT (WR #82)

wpnews · 2 Jun · #ai-products

Microsoft launches Discovery platform for scientific R&D with Ginkgo Bioworks partnership

sponsored brought to you by zahid.host 4,200+ EU-deployed projects

reading about agents? ship yours in a single git push.

Run your AI side-project on zahid.host

EU-based hosting, git-push deploys, automatic HTTPS, no cold starts. Free tier with a custom domain — perfect for shipping the agent you just read about.

$git push zahid main

→ Live at https://your-agent.zahid.host ✓

Get free account → Pricing

from €0/mo · no card required