Excited to share our paper: Think Again or Think Longer? Selective Verification for Budget-Aware Reasoning
We introduce SEVRA, a serving-layer controller that decides when a frozen reasoning model should keep its first answer and when it should run active verification.
The main finding is simple but important: verification is useful, but not always worth the extra compute. On MATH500, selective verification improves over always verifying while reducing harmful answer flips and verification tokens. On GSM8K, it verifies only a small fraction of examples but still improves accuracy. However, a longer initial solve can sometimes match the same accuracy with fewer total tokens.
So our practical takeaway is:
Tune the initial reasoning budget first; then use selective verification when explicit checks, bounded retries, auditability, or regression-risk control matter.
A few questions we would love feedback on:
When should a reasoning system verify instead of simply thinking longer?
Should harmful answer flips be reported more often in test-time compute papers?
Are cheap serving signals like token count and completion status enough for routing, or do we need learned controllers?
What is the best way to evaluate test-time reasoning policies beyond accuracy and token cost?
Paper: [Paper page - Think Again or Think Longer? Selective Verification for Budget-Aware Reasoning](https://huggingface.co/papers/2606.19808)
Code: [GitHub - Sajib-006/SEVRA: Selective verification for budget-aware LLM reasoning, with reusable routing, gate training, and policy evaluation. · GitHub](https://github.com/Sajib-006/SEVRA)
Feedback and discussion are very welcome.