{"slug": "fixing-folio-and-malls-verified-annotations-and-an-llm-assisted-framework-to", "title": "Fixing FOLIO and MALLS: Verified Annotations and an LLM-assisted Framework to Focus Human Relabeling", "summary": "A systematic human audit of the NL-to-FOL benchmarks FOLIO and MALLS found that approximately 39% and 36% of entries, respectively, contain incorrect FOL formalizations, with additional errors in ambiguous natural language sentences and NLI labels. Correcting these ground truths boosted the accuracy of state-of-the-art LLMs by 9 to 22 percentage points on a reference task. The researchers released an LLM-assisted framework that focuses human reviewers on the most error-prone instances, achieving 90% dataset accuracy after reviewing fewer than 24% of entries.", "body_md": "arXiv:2606.02837v1 Announce Type: new\nAbstract: Accurate translation from Natural Language to First-Order Logic (NL-to-FOL) underpins neurosymbolic AI systems and Natural Language Inference (NLI), making the quality of NL-to-FOL benchmarks essential -- yet these datasets have never been rigorously audited. Our first contribution is to present a systematic human inspection of the validation split of \\textsf{FOLIO} and a subset of \\textsf{MALLS} test instances, finding that approximately 39% and 36% of entries, respectively, contain incorrect FOL formalizations (i.e., ground truth labels), with additional rates of ambiguous NL sentences (16.4% and 48%) and incorrect NLI labels in \\textsf{FOLIO} (8.4%). Our second contribution is to develop and release corrected ground truths for such datasets, showing that annotation errors distort model evaluation on a reference benchmark task: testing three state-of-the-art LLMs (Gemma~4 31B-it, Qwen3-30B-A3B, and GPT-4o-mini) with the corrected ground truths yields accuracy gains from +9 to +22 percentage points. Motivated by these findings, we propose an LLM-based framework to support humans in manual reviewing NL-to-FOL datasets. By directing reviewers toward the most error-prone instances, we empirically show that it is possible to achieve 90% dataset accuracy after reviewing fewer than 24% of instances, compared to over 70% required by unguided review. We release all human-verified annotations and the code for our framework.", "url": "https://wpnews.pro/news/fixing-folio-and-malls-verified-annotations-and-an-llm-assisted-framework-to", "canonical_source": "https://arxiv.org/abs/2606.02837", "published_at": "2026-06-03 04:00:00+00:00", "updated_at": "2026-06-03 04:23:01.707287+00:00", "lang": "en", "topics": ["natural-language-processing", "large-language-models", "artificial-intelligence", "machine-learning", "ai-research"], "entities": ["FOLIO", "MALLS", "Gemma 4 31B-it", "Qwen3-30B-A3B", "GPT-4o-mini"], "alternates": {"html": "https://wpnews.pro/news/fixing-folio-and-malls-verified-annotations-and-an-llm-assisted-framework-to", "markdown": "https://wpnews.pro/news/fixing-folio-and-malls-verified-annotations-and-an-llm-assisted-framework-to.md", "text": "https://wpnews.pro/news/fixing-folio-and-malls-verified-annotations-and-an-llm-assisted-framework-to.txt", "jsonld": "https://wpnews.pro/news/fixing-folio-and-malls-verified-annotations-and-an-llm-assisted-framework-to.jsonld"}}