Split PDFs into semantically meaningful chunks — without LLMs, without APIs, without cloud dependencies.
🌐
Author: Matthias Nordwig·[programmiere.de]
Most RAG chunkers blindly split documents by token count or character limit. This destroys document structure — headings, sections, and paragraphs get ripped apart. The result: your vector search returns incoherent fragments with no context about where they came from.
pdf-struct-chunker solves this by analyzing the actual layout of a PDF: X/Y coordinates, font sizes, and bold detection. It understands where a heading starts, where a paragraph ends, and where a new section begins. Each chunk carries structured metadata (section
, heading
, page
) so your RAG pipeline knows exactly what it's looking at.
No LLM needed. No API calls. Runs offline. Written in pure Rust.
❌ Standard RAG (Fixed-size overlap)
// Chunk 1
"This regulation applies to all companies. § 2 De-"
// Chunk 2
"finitions. In this regulation, the following terms"
Result: Words are cut in half, headings are disconnected from their content.
✅ pdf-struct-chunker (Layout-aware)
// Chunk 1
{
"metadata": { "section": "§ 1", "heading": "Scope" },
"text": "This regulation applies to all companies."
}
// Chunk 2
{
"metadata": { "section": "§ 2", "heading": "Definitions" },
"text": "In this regulation, the following terms..."
}
Result: Clean, semantic chunks with perfect metadata.
Built for speed and Edge-AI scenarios:
No GPU required(pure CPU processing)** Extremely fast**: Processes a 100-page PDF in< 1 second
on a standard laptop.Low memory footprint: Operates entirely in-memory without creating temporary files.
A sample PDF is included — you can try it immediately after cloning:
git clone https://github.com/matthiasnordwig/pdf-struct-chunker.git
cd pdf-struct-chunker
cargo run --release -- -i fixtures/sample.pdf --format json --pretty
Output:
[
{
"index": 0,
"char_start": 0,
"char_end": 441,
"text": "§ 1 Anwendungsbereich\nDiese Verordnung gilt für alle Unternehmen ...",
"signature": "§ 1 Anwendungsbereich\nDiese Verordnung gilt für alle Unternehmen",
"metadata": {
"section": "§ 1",
"heading": "Anwendungsbereich",
"page": 2
}
},
{
"index": 1,
"text": "§ 2 Begriffsbestimmungen\nIm Sinne dieser Verordnung ...",
"metadata": {
"section": "§ 2",
"heading": "Begriffsbestimmungen",
"page": 2
}
}
]
Every chunk knows its section, heading, and page number — ready for embedding.
PDF bytes ──► pdf_oxide (extract characters with X/Y positions + font sizes)
│
▼
Line Classification
(match lines against your regex profiles, or fall back to font-size heuristics)
│
▼
Chunk Assembly
(split at headings, merge small fragments, split overflow at sentence boundaries)
│
▼
Vec<Chunk> { text, section, heading, page }
The chunker processes each PDF page by extracting character-level bounding boxes, reconstructing lines from Y-coordinates, classifying them using configurable regex patterns (or font-size heuristics as fallback), and assembling them into semantically coherent chunks with structural metadata.
git clone https://github.com/matthiasnordwig/pdf-struct-chunker.git
cd pdf-struct-chunker
cargo build --release
Add this to your Cargo.toml
:
[dependencies]
pdf-struct-chunker = { git = "https://github.com/matthiasnordwig/pdf-struct-chunker" }
pdf-struct-chunker [OPTIONS] --input <INPUT>
| Flag | Description | Default |
|---|---|---|
-i, --input <FILE> |
||
| Path to the input PDF file | Required | |
-p, --profile <FILE> |
||
| Path to a JSON profile with custom regex rules (see below) | Built-in defaults | |
-o, --output <FILE> |
||
| Output file path | stdout |
|
--format <FORMAT> |
||
Output format: jsonl or json |
||
jsonl |
||
--pretty |
||
| Pretty-print JSON output | false |
|
--stats |
||
| Print chunk statistics instead of the chunks themselves | false |
pdf-struct-chunker -i document.pdf -o result.jsonl
pdf-struct-chunker -i document.pdf --format json --pretty
pdf-struct-chunker -i document.pdf --stats
pdf-struct-chunker -i document.pdf -p my_rules.json --format json --pretty
The core function operates entirely in-memory — no file I/O, no temp files. Feed it bytes from anywhere (file, HTTP upload, S3, database) and get chunks back instantly:
use pdf_struct_chunker::{chunk_pdf, Profile};
fn main() {
let bytes = std::fs::read("document.pdf").unwrap();
let chunks = chunk_pdf(&bytes, None).unwrap();
for chunk in &chunks {
println!("[{}] {} (p.{})",
chunk.metadata.section,
chunk.metadata.heading,
chunk.metadata.page.unwrap_or(0),
);
}
}
By default, the chunker uses built-in heuristics optimized for legal and regulatory documents (detecting §
, Article
, Chapter
, etc.). You can override this with your own regex rules.
Create a .json
file (e.g., my_rules.json
) and pass it via --profile
:
pdf-struct-chunker -i document.pdf -p my_rules.json
The simplest profile just removes unwanted lines:
{
"patterns": [
{
"role": "ignore",
"regex": "Page \\d+",
"flags": "i",
"priority": 100
}
]
}
This removes every line matching "Page 1", "Page 2", etc. from the output.
{
"min_chunk_chars": 200,
"max_chunk_chars": 1500,
"patterns": [
{
"role": "ignore",
"regex": "(?:Page|Footer text)",
"flags": "i",
"priority": 200
},
{
"role": "heading_l1",
"regex": "^((?:Chapter|Section)\\s*[\\d]+)\\s*(.*)",
"flags": "i",
"priority": 100
},
{
"role": "definition",
"regex": "\\b(?:means|shall mean|is defined as)",
"flags": "i",
"priority": 50
}
]
}
| Role | What it does |
|---|---|
heading_l1 |
|
Starts a new chunk. Regex capture group 1 becomes metadata.section (e.g., "Chapter 3"), group 2 becomes metadata.heading (e.g., "Data Protection"). |
|
definition |
|
Triggers a soft split. If the current chunk has already reached min_chunk_chars , the chunker flushes it and starts a new one. |
|
ignore |
|
| Removes the line entirely. Use this for page numbers, footers, headers, or any boilerplate you don't want in your chunks. |
| Field | Description | Default |
|---|---|---|
min_chunk_chars |
||
| Minimum chunk size before a "soft" split (at definitions or list items) is allowed | 200 |
|
max_chunk_chars |
||
| Maximum chunk size — forces a split at the nearest sentence boundary | 1500 |
|
patterns[].regex |
||
| Regular expression matched against each text line | — | |
patterns[].role |
||
One of: heading_l1 , definition , ignore |
||
| — | ||
patterns[].flags |
||
"i" = case-insensitive, "m" = multiline |
||
"" |
||
patterns[].priority |
||
| Higher value = evaluated first when multiple patterns match the same line | 0 |
If you have any questions, feature requests, or just want to say hi, feel free to open an issue or reach out via my website:
If this tool saved you time and you'd like to support its development, you can buy me a coffee via PayPal. ☕
MIT © Matthias Nordwig