Fixing FOLIO and MALLS: Verified Annotations and an LLM-assisted Framework to Focus Human Relabeling

wpnews.pro

cd /news/natural-language-processing/fixing-folio-and-malls-verified-anno… · home › topics › natural-language-processing › article

[ARTICLE · art-19929] src=arxiv.org ↗ pub=2026-06-03T04:00Z topic=natural-language-processing verified=true sentiment=· neutral

Fixing FOLIO and MALLS: Verified Annotations and an LLM-assisted Framework to Focus Human Relabeling

A systematic human audit of the NL-to-FOL benchmarks FOLIO and MALLS found that approximately 39% and 36% of entries, respectively, contain incorrect FOL formalizations, with additional errors in ambiguous natural language sentences and NLI labels. Correcting these ground truths boosted the accuracy of state-of-the-art LLMs by 9 to 22 percentage points on a reference task. The researchers released an LLM-assisted framework that focuses human reviewers on the most error-prone instances, achieving 90% dataset accuracy after reviewing fewer than 24% of entries.

read1 min views12 publishedJun 3, 2026

arXiv:2606.02837v1 Announce Type: new Abstract: Accurate translation from Natural Language to First-Order Logic (NL-to-FOL) underpins neurosymbolic AI systems and Natural Language Inference (NLI), making the quality of NL-to-FOL benchmarks essential -- yet these datasets have never been rigorously audited. Our first contribution is to present a systematic human inspection of the validation split of \textsf{FOLIO} and a subset of \textsf{MALLS} test instances, finding that approximately 39% and 36% of entries, respectively, contain incorrect FOL formalizations (i.e., ground truth labels), with additional rates of ambiguous NL sentences (16.4% and 48%) and incorrect NLI labels in \textsf{FOLIO} (8.4%). Our second contribution is to develop and release corrected ground truths for such datasets, showing that annotation errors distort model evaluation on a reference benchmark task: testing three state-of-the-art LLMs (Gemma~4 31B-it, Qwen3-30B-A3B, and GPT-4o-mini) with the corrected ground truths yields accuracy gains from +9 to +22 percentage points. Motivated by these findings, we propose an LLM-based framework to support humans in manual reviewing NL-to-FOL datasets. By directing reviewers toward the most error-prone instances, we empirically show that it is possible to achieve 90% dataset accuracy after reviewing fewer than 24% of instances, compared to over 70% required by unguided review. We release all human-verified annotations and the code for our framework.

source & further reading

arxiv.org — original article

~/api · this article 200

$curl api.wpnews.pro/v1/news/fixing-folio-and-malls-v…

Read original on arxiv.org → arxiv.org/abs/2606.02837

mentioned entities

FOLIO

MALLS

Gemma 4 31B-it

Qwen3-30B-A3B

GPT-4o-mini

metadata

slugfixing-folio-and-malls-verified-annotations-and-an-llm-assisted-framework-to

topic#natural-language-processing

secondary4 topics

sentimentneutral

canonicalarxiv.org

navigation

← prevAI Agent Deployment Architecture…

next →Achei interessante, talvez você …

── more in #natural-language-processing 4 stories · sorted by recency

machinebrief.com · 22 Jul · #natural-language-processing

DAIS: Dependency-Aware Intermediate QA Supervision for Complex Reasoning

dev.to · 22 Jul · #natural-language-processing

What Teaching a Machine to Think Taught Me

dev.to · 21 Jul · #natural-language-processing

How Michael Vicente’s RAG Project Teach Me About Building Smarter AI?

dev.to · 20 Jul · #natural-language-processing

Building Production-Grade LLM Evaluation Pipelines: From Vibes to Metrics

── more on @folio 3 stories trending now

wpnews · 30 May · #ai-safety

Nightcord Security Analysis Report - Threat Investigation

wpnews · 26 May · #ai-agents

Think, Durable Objects, and the Real Shape of AI Applications

wpnews · 8 Jul · #ai-tools

What's the Future of Clay?

sponsored brought to you by zahid.host 4,200+ EU-deployed projects

reading about agents? ship yours in a single git push.

Run your AI side-project on zahid.host

EU-based hosting, git-push deploys, automatic HTTPS, no cold starts. Free tier with a custom domain — perfect for shipping the agent you just read about.

$git push zahid main

→ Live at https://your-agent.zahid.host ✓

Get free account → Pricing

from €0/mo · no card required