# LLM-free, layout-aware PDF chunker in pure Rust

> Source: <https://github.com/matthiasnordwig/pdf-struct-chunker>
> Published: 2026-06-29 09:06:32+00:00

**Split PDFs into semantically meaningful chunks — without LLMs, without APIs, without cloud dependencies.**

🌐

Author: Matthias Nordwig·[programmiere.de]

Most RAG chunkers blindly split documents by token count or character limit. This destroys document structure — headings, sections, and paragraphs get ripped apart. The result: your vector search returns incoherent fragments with no context about where they came from.

**pdf-struct-chunker** solves this by analyzing the actual *layout* of a PDF: X/Y coordinates, font sizes, and bold detection. It understands where a heading starts, where a paragraph ends, and where a new section begins. Each chunk carries structured metadata (`section`

, `heading`

, `page`

) so your RAG pipeline knows exactly what it's looking at.

No LLM needed. No API calls. Runs offline. Written in pure Rust.

**❌ Standard RAG (Fixed-size overlap)**

```
// Chunk 1
"This regulation applies to all companies. § 2 De-"

// Chunk 2
"finitions. In this regulation, the following terms"
```

*Result: Words are cut in half, headings are disconnected from their content.*

**✅ pdf-struct-chunker (Layout-aware)**

```
// Chunk 1
{
  "metadata": { "section": "§ 1", "heading": "Scope" },
  "text": "This regulation applies to all companies."
}
// Chunk 2
{
  "metadata": { "section": "§ 2", "heading": "Definitions" },
  "text": "In this regulation, the following terms..."
}
```

*Result: Clean, semantic chunks with perfect metadata.*

Built for speed and Edge-AI scenarios:

**No GPU required**(pure CPU processing)** Extremely fast**: Processes a 100-page PDF in`< 1 second`

on a standard laptop.**Low memory footprint**: Operates entirely in-memory without creating temporary files.

A sample PDF is included — you can try it immediately after cloning:

```
git clone https://github.com/matthiasnordwig/pdf-struct-chunker.git
cd pdf-struct-chunker
cargo run --release -- -i fixtures/sample.pdf --format json --pretty
```

Output:

```
[
  {
    "index": 0,
    "char_start": 0,
    "char_end": 441,
    "text": "§ 1 Anwendungsbereich\nDiese Verordnung gilt für alle Unternehmen ...",
    "signature": "§ 1 Anwendungsbereich\nDiese Verordnung gilt für alle Unternehmen",
    "metadata": {
      "section": "§ 1",
      "heading": "Anwendungsbereich",
      "page": 2
    }
  },
  {
    "index": 1,
    "text": "§ 2 Begriffsbestimmungen\nIm Sinne dieser Verordnung ...",
    "metadata": {
      "section": "§ 2",
      "heading": "Begriffsbestimmungen",
      "page": 2
    }
  }
]
```

Every chunk knows its section, heading, and page number — ready for embedding.

```
PDF bytes ──► pdf_oxide (extract characters with X/Y positions + font sizes)
                │
                ▼
          Line Classification
          (match lines against your regex profiles, or fall back to font-size heuristics)
                │
                ▼
          Chunk Assembly
          (split at headings, merge small fragments, split overflow at sentence boundaries)
                │
                ▼
          Vec<Chunk> { text, section, heading, page }
```

The chunker processes each PDF page by extracting character-level bounding boxes, reconstructing lines from Y-coordinates, classifying them using configurable regex patterns (or font-size heuristics as fallback), and assembling them into semantically coherent chunks with structural metadata.

```
git clone https://github.com/matthiasnordwig/pdf-struct-chunker.git
cd pdf-struct-chunker
cargo build --release
```

Add this to your `Cargo.toml`

:

```
[dependencies]
pdf-struct-chunker = { git = "https://github.com/matthiasnordwig/pdf-struct-chunker" }
pdf-struct-chunker [OPTIONS] --input <INPUT>
```

| Flag | Description | Default |
|---|---|---|
`-i, --input <FILE>` |
Path to the input PDF file | Required |
`-p, --profile <FILE>` |
Path to a JSON profile with custom regex rules (see below) | Built-in defaults |
`-o, --output <FILE>` |
Output file path | `stdout` |
`--format <FORMAT>` |
Output format: `jsonl` or `json` |
`jsonl` |
`--pretty` |
Pretty-print JSON output | `false` |
`--stats` |
Print chunk statistics instead of the chunks themselves | `false` |

```
# Chunk a PDF and save as JSONL
pdf-struct-chunker -i document.pdf -o result.jsonl

# Pretty-print JSON to the console
pdf-struct-chunker -i document.pdf --format json --pretty

# See how many chunks were created and their sizes
pdf-struct-chunker -i document.pdf --stats

# Use your own regex rules
pdf-struct-chunker -i document.pdf -p my_rules.json --format json --pretty
```

The core function operates entirely in-memory — no file I/O, no temp files. Feed it bytes from anywhere (file, HTTP upload, S3, database) and get chunks back instantly:

``` js
use pdf_struct_chunker::{chunk_pdf, Profile};

fn main() {
    let bytes = std::fs::read("document.pdf").unwrap();

    let chunks = chunk_pdf(&bytes, None).unwrap();

    for chunk in &chunks {
        println!("[{}] {} (p.{})",
            chunk.metadata.section,
            chunk.metadata.heading,
            chunk.metadata.page.unwrap_or(0),
        );
    }
}
```

By default, the chunker uses built-in heuristics optimized for legal and regulatory documents (detecting `§`

, `Article`

, `Chapter`

, etc.). You can override this with your own regex rules.

Create a `.json`

file (e.g., `my_rules.json`

) and pass it via `--profile`

:

```
pdf-struct-chunker -i document.pdf -p my_rules.json
```

The simplest profile just removes unwanted lines:

```
{
  "patterns": [
    {
      "role": "ignore",
      "regex": "Page \\d+",
      "flags": "i",
      "priority": 100
    }
  ]
}
```

This removes every line matching "Page 1", "Page 2", etc. from the output.

```
{
  "min_chunk_chars": 200,
  "max_chunk_chars": 1500,
  "patterns": [
    {
      "role": "ignore",
      "regex": "(?:Page|Footer text)",
      "flags": "i",
      "priority": 200
    },
    {
      "role": "heading_l1",
      "regex": "^((?:Chapter|Section)\\s*[\\d]+)\\s*(.*)",
      "flags": "i",
      "priority": 100
    },
    {
      "role": "definition",
      "regex": "\\b(?:means|shall mean|is defined as)",
      "flags": "i",
      "priority": 50
    }
  ]
}
```

| Role | What it does |
|---|---|
`heading_l1` |
Starts a new chunk. Regex capture group 1 becomes `metadata.section` (e.g., "Chapter 3"), group 2 becomes `metadata.heading` (e.g., "Data Protection"). |
`definition` |
Triggers a soft split. If the current chunk has already reached `min_chunk_chars` , the chunker flushes it and starts a new one. |
`ignore` |
Removes the line entirely. Use this for page numbers, footers, headers, or any boilerplate you don't want in your chunks. |

| Field | Description | Default |
|---|---|---|
`min_chunk_chars` |
Minimum chunk size before a "soft" split (at definitions or list items) is allowed | `200` |
`max_chunk_chars` |
Maximum chunk size — forces a split at the nearest sentence boundary | `1500` |
`patterns[].regex` |
Regular expression matched against each text line | — |
`patterns[].role` |
One of: `heading_l1` , `definition` , `ignore` |
— |
`patterns[].flags` |
`"i"` = case-insensitive, `"m"` = multiline |
`""` |
`patterns[].priority` |
Higher value = evaluated first when multiple patterns match the same line | `0` |

If you have any questions, feature requests, or just want to say hi, feel free to open an issue or reach out via my website:

If this tool saved you time and you'd like to support its development, you can [buy me a coffee via PayPal](https://www.paypal.me/MatthiasNordwig). ☕

MIT © Matthias Nordwig
