Redact or Keep? A Fully Local AI Cascade for Educational Dialogue De-Identification

wpnews.pro

cd /news/natural-language-processing/redact-or-keep-a-fully-local-ai-casc… · home › topics › natural-language-processing › article

[ARTICLE · art-32069] src=arxiv.org ↗ pub=2026-06-18T04:00Z topic=natural-language-processing verified=true sentiment=↑ positive

Redact or Keep? A Fully Local AI Cascade for Educational Dialogue De-Identification

Researchers propose a fully local AI cascade for de-identifying educational dialogue that achieves 0.958 macro F1 on math tutoring transcripts, outperforming a commercial API (0.706) and LLM-only baselines (0.767), while running entirely on a single laptop. The system reframes de-identification as constrained privacy triage, using a recall-first proposer and context-aware reviewer to distinguish personal names from curricular terms like 'Riemann'.

read1 min views2 publishedJun 18, 2026

arXiv:2606.18372v1 Announce Type: new Abstract: Educational dialogue is a valuable but sensitive resource for research: the same transcripts that capture authentic learning often capture personally identifiable information (PII) entangled with curricular content, where "Riemann" may refer to a real student or to a mathematical concept. Existing approaches force a tradeoff between governance and accuracy. Commercial Large Language Models (LLMs) can handle this ambiguity but require sending student data to third parties, while local named entity recognition (NER) systems preserve governance but over-redact curricular terms. We propose a fully local cascade framework that reframes de-identification from open-ended entity recognition to constrained privacy triage. A recall-first union proposer combines two lightweight encoders with deterministic rules to over-generate candidate spans; a context-aware reviewer then makes a binary Redact/Keep decision for each candidate using surrounding dialogue and speaker role. We evaluate three reviewer configurations against same-family LLM-only baselines and a commercial API on math tutoring transcripts from two large platforms. The strongest local configuration reaches 0.958 macro F1, compared with 0.767 for a same-family LLM-only baseline and 0.706 for the commercial API, while running entirely on a single laptop. On a targeted challenge set of curricular-personal name ambiguity, the same configuration degrades by only 0.03 F1 versus 0.19 to 0.25 for smaller reviewers. These results suggest that for educational de-identification, problem formulation matters more than model scale.

source & further reading

arxiv.org — original article

~/api · this article 200

$curl api.wpnews.pro/v1/news/redact-or-keep-a-fully-l…

Read original on arxiv.org → arxiv.org/abs/2606.18372

mentioned entities

arXiv

Riemann

metadata

slugredact-or-keep-a-fully-local-ai-cascade-for-educational-dialogue-de

topic#natural-language-processing

secondary3 topics

sentimentpositive

canonicalarxiv.org

navigation

← prevIs AI Getting Quietly Dumber? A …

next →Most agentic AI projects in prod…

── more in #natural-language-processing 4 stories · sorted by recency

arxiv.org · 18 Jun · #natural-language-processing

CaVe-VLM-CoT: An Interpretable Vision-Language Model Framework

arxiv.org · 18 Jun · #natural-language-processing

LLM Parameters for Math Across Languages: Shared or Separate?

arxiv.org · 18 Jun · #natural-language-processing

MCompassRAG: Topic Metadata as a Semantic Compass for Paragraph-Level Retrieval

arxiv.org · 18 Jun · #natural-language-processing

PreUnlearn: Auditing Collateral Knowledge Damage Before Large Language Model Unlearning

── more on @arxiv 3 stories trending now

wpnews · 17 Jun · #developer-tools

CircleCI MCP Server: Debug Build Failures Without Leaving Your AI Coding Agent

wpnews · 17 Jun · #artificial-intelligence

How I Build Production AI Apps on Cloudflare with Claude Code

wpnews · 16 Jun · #large-language-models

I'm building CortexDB — an agent-native context database for AI agents

sponsored brought to you by zahid.host 4,200+ EU-deployed projects

reading about agents? ship yours in a single git push.

Run your AI side-project on zahid.host

EU-based hosting, git-push deploys, automatic HTTPS, no cold starts. Free tier with a custom domain — perfect for shipping the agent you just read about.

$git push zahid main

→ Live at https://your-agent.zahid.host ✓

Get free account → Pricing

from €0/mo · no card required