04:00
2026-06-05
arxiv.org
large-language-models
Statistically Reliable LLM-Based Ranking Evaluation via Prediction-Powered Inference
Researchers have developed PRECISE, a method that combines small human-labeled datasets with large LLM-generated judgments to produce bias-corrected estimates of ranking evaluation metrics. The approaβ¦