cd /news/ai-tools/llm-free-layout-aware-pdf-chunker-in… · home topics ai-tools article
[ARTICLE · art-43202] src=github.com ↗ pub= topic=ai-tools verified=true sentiment=↑ positive

LLM-free, layout-aware PDF chunker in pure Rust

Developer Matthias Nordwig released pdf-struct-chunker, an open-source PDF chunker written in pure Rust that splits documents into semantically meaningful chunks by analyzing layout features like X/Y coordinates, font sizes, and bold detection, without relying on LLMs or cloud APIs. The tool aims to improve RAG pipelines by preserving document structure and metadata, processing a 100-page PDF in under a second on a standard laptop.

read5 min views1 publishedJun 29, 2026
LLM-free, layout-aware PDF chunker in pure Rust
Image: source

Split PDFs into semantically meaningful chunks — without LLMs, without APIs, without cloud dependencies.

🌐

Author: Matthias Nordwig·[programmiere.de]

Most RAG chunkers blindly split documents by token count or character limit. This destroys document structure — headings, sections, and paragraphs get ripped apart. The result: your vector search returns incoherent fragments with no context about where they came from.

pdf-struct-chunker solves this by analyzing the actual layout of a PDF: X/Y coordinates, font sizes, and bold detection. It understands where a heading starts, where a paragraph ends, and where a new section begins. Each chunk carries structured metadata (section

, heading

, page

) so your RAG pipeline knows exactly what it's looking at.

No LLM needed. No API calls. Runs offline. Written in pure Rust.

❌ Standard RAG (Fixed-size overlap)

// Chunk 1
"This regulation applies to all companies. § 2 De-"

// Chunk 2
"finitions. In this regulation, the following terms"

Result: Words are cut in half, headings are disconnected from their content.

✅ pdf-struct-chunker (Layout-aware)

// Chunk 1
{
  "metadata": { "section": "§ 1", "heading": "Scope" },
  "text": "This regulation applies to all companies."
}
// Chunk 2
{
  "metadata": { "section": "§ 2", "heading": "Definitions" },
  "text": "In this regulation, the following terms..."
}

Result: Clean, semantic chunks with perfect metadata.

Built for speed and Edge-AI scenarios:

No GPU required(pure CPU processing)** Extremely fast**: Processes a 100-page PDF in< 1 second

on a standard laptop.Low memory footprint: Operates entirely in-memory without creating temporary files.

A sample PDF is included — you can try it immediately after cloning:

git clone https://github.com/matthiasnordwig/pdf-struct-chunker.git
cd pdf-struct-chunker
cargo run --release -- -i fixtures/sample.pdf --format json --pretty

Output:

[
  {
    "index": 0,
    "char_start": 0,
    "char_end": 441,
    "text": "§ 1 Anwendungsbereich\nDiese Verordnung gilt für alle Unternehmen ...",
    "signature": "§ 1 Anwendungsbereich\nDiese Verordnung gilt für alle Unternehmen",
    "metadata": {
      "section": "§ 1",
      "heading": "Anwendungsbereich",
      "page": 2
    }
  },
  {
    "index": 1,
    "text": "§ 2 Begriffsbestimmungen\nIm Sinne dieser Verordnung ...",
    "metadata": {
      "section": "§ 2",
      "heading": "Begriffsbestimmungen",
      "page": 2
    }
  }
]

Every chunk knows its section, heading, and page number — ready for embedding.

PDF bytes ──► pdf_oxide (extract characters with X/Y positions + font sizes)
                │
                ▼
          Line Classification
          (match lines against your regex profiles, or fall back to font-size heuristics)
                │
                ▼
          Chunk Assembly
          (split at headings, merge small fragments, split overflow at sentence boundaries)
                │
                ▼
          Vec<Chunk> { text, section, heading, page }

The chunker processes each PDF page by extracting character-level bounding boxes, reconstructing lines from Y-coordinates, classifying them using configurable regex patterns (or font-size heuristics as fallback), and assembling them into semantically coherent chunks with structural metadata.

git clone https://github.com/matthiasnordwig/pdf-struct-chunker.git
cd pdf-struct-chunker
cargo build --release

Add this to your Cargo.toml

:

[dependencies]
pdf-struct-chunker = { git = "https://github.com/matthiasnordwig/pdf-struct-chunker" }
pdf-struct-chunker [OPTIONS] --input <INPUT>
Flag Description Default
-i, --input <FILE>
Path to the input PDF file Required
-p, --profile <FILE>
Path to a JSON profile with custom regex rules (see below) Built-in defaults
-o, --output <FILE>
Output file path stdout
--format <FORMAT>
Output format: jsonl or json
jsonl
--pretty
Pretty-print JSON output false
--stats
Print chunk statistics instead of the chunks themselves false
pdf-struct-chunker -i document.pdf -o result.jsonl

pdf-struct-chunker -i document.pdf --format json --pretty

pdf-struct-chunker -i document.pdf --stats

pdf-struct-chunker -i document.pdf -p my_rules.json --format json --pretty

The core function operates entirely in-memory — no file I/O, no temp files. Feed it bytes from anywhere (file, HTTP upload, S3, database) and get chunks back instantly:

use pdf_struct_chunker::{chunk_pdf, Profile};

fn main() {
    let bytes = std::fs::read("document.pdf").unwrap();

    let chunks = chunk_pdf(&bytes, None).unwrap();

    for chunk in &chunks {
        println!("[{}] {} (p.{})",
            chunk.metadata.section,
            chunk.metadata.heading,
            chunk.metadata.page.unwrap_or(0),
        );
    }
}

By default, the chunker uses built-in heuristics optimized for legal and regulatory documents (detecting §

, Article

, Chapter

, etc.). You can override this with your own regex rules.

Create a .json

file (e.g., my_rules.json

) and pass it via --profile

:

pdf-struct-chunker -i document.pdf -p my_rules.json

The simplest profile just removes unwanted lines:

{
  "patterns": [
    {
      "role": "ignore",
      "regex": "Page \\d+",
      "flags": "i",
      "priority": 100
    }
  ]
}

This removes every line matching "Page 1", "Page 2", etc. from the output.

{
  "min_chunk_chars": 200,
  "max_chunk_chars": 1500,
  "patterns": [
    {
      "role": "ignore",
      "regex": "(?:Page|Footer text)",
      "flags": "i",
      "priority": 200
    },
    {
      "role": "heading_l1",
      "regex": "^((?:Chapter|Section)\\s*[\\d]+)\\s*(.*)",
      "flags": "i",
      "priority": 100
    },
    {
      "role": "definition",
      "regex": "\\b(?:means|shall mean|is defined as)",
      "flags": "i",
      "priority": 50
    }
  ]
}
Role What it does
heading_l1
Starts a new chunk. Regex capture group 1 becomes metadata.section (e.g., "Chapter 3"), group 2 becomes metadata.heading (e.g., "Data Protection").
definition
Triggers a soft split. If the current chunk has already reached min_chunk_chars , the chunker flushes it and starts a new one.
ignore
Removes the line entirely. Use this for page numbers, footers, headers, or any boilerplate you don't want in your chunks.
Field Description Default
min_chunk_chars
Minimum chunk size before a "soft" split (at definitions or list items) is allowed 200
max_chunk_chars
Maximum chunk size — forces a split at the nearest sentence boundary 1500
patterns[].regex
Regular expression matched against each text line
patterns[].role
One of: heading_l1 , definition , ignore
patterns[].flags
"i" = case-insensitive, "m" = multiline
""
patterns[].priority
Higher value = evaluated first when multiple patterns match the same line 0

If you have any questions, feature requests, or just want to say hi, feel free to open an issue or reach out via my website:

If this tool saved you time and you'd like to support its development, you can buy me a coffee via PayPal. ☕

MIT © Matthias Nordwig

── more in #ai-tools 4 stories · sorted by recency
── more on @matthias nordwig 3 stories trending now
sponsored brought to you by zahid.host 4,200+ EU-deployed projects
reading about agents? ship yours in a single git push.

Run your AI side-project on zahid.host

EU-based hosting, git-push deploys, automatic HTTPS, no cold starts. Free tier with a custom domain — perfect for shipping the agent you just read about.

$git push zahid main
Live at https://your-agent.zahid.host
Get free account → Pricing
from €0/mo · no card required
LIVE [news/llm-free-layout-awar…] indexed:0 read:5min 2026-06-29 ·