A Reproducible Universal Dependencies-Style Pipeline for Katharevousa Greek Parliamentary Text

wpnews.pro

cd /news/natural-language-processing/a-reproducible-universal-dependencie… · home › topics › natural-language-processing › article

[ARTICLE · art-13637] src=arxiv.org ↗ pub=2026-05-25T04:00Z topic=natural-language-processing verified=true sentiment=· neutral

A Reproducible Universal Dependencies-Style Pipeline for Katharevousa Greek Parliamentary Text

Researchers have developed a reproducible NLP pipeline for Katharevousa Greek, a historical language variant used in Greek parliamentary archives, by creating a Universal Dependencies-style parsing resource from 1,697 sentences of post-junta parliamentary questions. The pipeline, which includes OCR reconstruction and LLM-assisted annotation, achieved a best LAS of 0.5162 using an XLM-R model, significantly outperforming off-the-shelf parsers like spaCy Greek. The entire workflow, code, and annotated dataset are publicly released to enable auditable syntactic analysis of historical parliamentary texts.

read1 min views10 publishedMay 25, 2026

arXiv:2605.22978v1 Announce Type: new Abstract: Katharevousa Greek remains poorly served by contemporary NLP pipelines despite its importance for legal, administrative, and parliamentary archives. We present a reproducible workflow for building and evaluating a Universal Dependencies-style parsing resource for Katharevousa parliamentary questions from Greece's early post-junta period. The pipeline links OCR-aware reconstruction, schema-constrained LLM-assisted annotation, automatic validation, deterministic CoNLL-U snapshotting, fixed-split evaluation, and model-family comparison. The frozen automatically validated reference set contains 1{,}697 sentences, split into 1{,}357 training sentences and 340 held-out test sentences. We compare off-the-shelf Greek and Ancient Greek parsers, a feature-based parser, mBERT, XLM-R, and custom Stanza training under the same scoring protocol. Off-the-shelf systems show substantial register mismatch: the strongest external baseline, spaCy Greek, reaches 0.4183 LAS. The best structural parser, an XLM-R model, reaches 0.8893 UPOS accuracy, 0.7250 dependency-relation F1, 0.6098 UAS, and 0.5162 LAS, an absolute LAS gain of 0.0980 over the best external baseline. The feature-based model remains competitive for UPOS and relation labeling, indicating that transparent lexical-context features still matter at this data scale. Beyond scores, the paper contributes an auditable methodology for turning difficult historical parliamentary OCR into reusable syntactic NLP infrastructure. The entire pipeline -- code, schema, frozen reference annotations, fixed train/test split, and per-model benchmark reports -- is released as an open-access companion to this paper.

source & further reading

arxiv.org — original article

~/api · this article 200

$curl api.wpnews.pro/v1/news/a-reproducible-universal…

Read original on arxiv.org → arxiv.org/abs/2605.22978

mentioned entities

Universal Dependencies

spaCy

mBERT

XLM-R

Stanza

Katharevousa Greek

Greece

OCR

metadata

sluga-reproducible-universal-dependencies-style-pipeline-for-katharevousa-greek-text

topic#natural-language-processing

secondary4 topics

sentimentneutral

canonicalarxiv.org

navigation

← prevThe Eternal Sloptember

next →Samsung memory workers call off …

── more in #natural-language-processing 4 stories · sorted by recency

pub.towardsai.net · 9 Jul · #natural-language-processing

Natural Language Processing (NLP) for Business: From Chatbots to Document Intelligence

letsdatascience.com · 9 Jul · #natural-language-processing

Embedding Guide Navigates Sequence Space for Enzymes

letsdatascience.com · 9 Jul · #natural-language-processing

Spiking neurons control linear systems with predictive impulses

eos.org · 9 Jul · #natural-language-processing

Comparing Machine Learning Models of Raindrop Formation

── more on @universal dependencies 3 stories trending now

wpnews · 27 May · #artificial-intelligence

How I Run Two Claude Accounts as One

wpnews · 8 Jul · #artificial-intelligence

Anthropic's "J-lens" reveals workspace in Claude mirrors theory of consciousness

wpnews · 30 May · #ai-safety

Nightcord Security Analysis Report - Threat Investigation

sponsored brought to you by zahid.host 4,200+ EU-deployed projects

reading about agents? ship yours in a single git push.

Run your AI side-project on zahid.host

EU-based hosting, git-push deploys, automatic HTTPS, no cold starts. Free tier with a custom domain — perfect for shipping the agent you just read about.

$git push zahid main

→ Live at https://your-agent.zahid.host ✓

Get free account → Pricing

from €0/mo · no card required