Fixing FOLIO and MALLS: Verified Annotations and an LLM-assisted Framework to Focus Human Relabeling

A systematic human audit of the NL-to-FOL benchmarks FOLIO and MALLS found that approximately 39% and 36% of entries, respectively, contain incorrect FOL formalizations, with additional errors in ambiguous natural language sentences and NLI labels. Correcting these ground truths boosted the accuracy of state-of-the-art LLMs by 9 to 22 percentage points on a reference task. The researchers released an LLM-assisted framework that focuses human reviewers on the most error-prone instances, achieving 90% dataset accuracy after reviewing fewer than 24% of entries.

arXiv:2606.02837v1 Announce Type: new Abstract: Accurate translation from Natural Language to First-Order Logic NL-to-FOL underpins neurosymbolic AI systems and Natural Language Inference NLI , making the quality of NL-to-FOL benchmarks essential -- yet these datasets have never been rigorously audited. Our first contribution is to present a systematic human inspection of the validation split of \textsf{FOLIO} and a subset of \textsf{MALLS} test instances, finding that approximately 39% and 36% of entries, respectively, contain incorrect FOL formalizations i.e., ground truth labels , with additional rates of ambiguous NL sentences 16.4% and 48% and incorrect NLI labels in \textsf{FOLIO} 8.4% . Our second contribution is to develop and release corrected ground truths for such datasets, showing that annotation errors distort model evaluation on a reference benchmark task: testing three state-of-the-art LLMs Gemma~4 31B-it, Qwen3-30B-A3B, and GPT-4o-mini with the corrected ground truths yields accuracy gains from +9 to +22 percentage points. Motivated by these findings, we propose an LLM-based framework to support humans in manual reviewing NL-to-FOL datasets. By directing reviewers toward the most error-prone instances, we empirically show that it is possible to achieve 90% dataset accuracy after reviewing fewer than 24% of instances, compared to over 70% required by unguided review. We release all human-verified annotations and the code for our framework.