LLM-free, layout-aware PDF chunker in pure Rust

Developer Matthias Nordwig released pdf-struct-chunker, an open-source PDF chunker written in pure Rust that splits documents into semantically meaningful chunks by analyzing layout features like X/Y coordinates, font sizes, and bold detection, without relying on LLMs or cloud APIs. The tool aims to improve RAG pipelines by preserving document structure and metadata, processing a 100-page PDF in under a second on a standard laptop.

Split PDFs into semantically meaningful chunks — without LLMs, without APIs, without cloud dependencies. 🌐 Author: Matthias Nordwig· programmiere.de Most RAG chunkers blindly split documents by token count or character limit. This destroys document structure — headings, sections, and paragraphs get ripped apart. The result: your vector search returns incoherent fragments with no context about where they came from. pdf-struct-chunker solves this by analyzing the actual layout of a PDF: X/Y coordinates, font sizes, and bold detection. It understands where a heading starts, where a paragraph ends, and where a new section begins. Each chunk carries structured metadata section , heading , page so your RAG pipeline knows exactly what it's looking at. No LLM needed. No API calls. Runs offline. Written in pure Rust. ❌ Standard RAG Fixed-size overlap // Chunk 1 "This regulation applies to all companies. § 2 De-" // Chunk 2 "finitions. In this regulation, the following terms" Result: Words are cut in half, headings are disconnected from their content. ✅ pdf-struct-chunker Layout-aware // Chunk 1 { "metadata": { "section": "§ 1", "heading": "Scope" }, "text": "This regulation applies to all companies." } // Chunk 2 { "metadata": { "section": "§ 2", "heading": "Definitions" }, "text": "In this regulation, the following terms..." } Result: Clean, semantic chunks with perfect metadata. Built for speed and Edge-AI scenarios: No GPU required pure CPU processing Extremely fast : Processes a 100-page PDF in < 1 second on a standard laptop. Low memory footprint : Operates entirely in-memory without creating temporary files. A sample PDF is included — you can try it immediately after cloning: git clone https://github.com/matthiasnordwig/pdf-struct-chunker.git cd pdf-struct-chunker cargo run --release -- -i fixtures/sample.pdf --format json --pretty Output: { "index": 0, "char start": 0, "char end": 441, "text": "§ 1 Anwendungsbereich\nDiese Verordnung gilt für alle Unternehmen ...", "signature": "§ 1 Anwendungsbereich\nDiese Verordnung gilt für alle Unternehmen", "metadata": { "section": "§ 1", "heading": "Anwendungsbereich", "page": 2 } }, { "index": 1, "text": "§ 2 Begriffsbestimmungen\nIm Sinne dieser Verordnung ...", "metadata": { "section": "§ 2", "heading": "Begriffsbestimmungen", "page": 2 } } Every chunk knows its section, heading, and page number — ready for embedding. PDF bytes ──► pdf oxide extract characters with X/Y positions + font sizes │ ▼ Line Classification match lines against your regex profiles, or fall back to font-size heuristics │ ▼ Chunk Assembly split at headings, merge small fragments, split overflow at sentence boundaries │ ▼ Vec<Chunk { text, section, heading, page } The chunker processes each PDF page by extracting character-level bounding boxes, reconstructing lines from Y-coordinates, classifying them using configurable regex patterns or font-size heuristics as fallback , and assembling them into semantically coherent chunks with structural metadata. git clone https://github.com/matthiasnordwig/pdf-struct-chunker.git cd pdf-struct-chunker cargo build --release Add this to your Cargo.toml : dependencies pdf-struct-chunker = { git = "https://github.com/matthiasnordwig/pdf-struct-chunker" } pdf-struct-chunker OPTIONS --input <INPUT | Flag | Description | Default | |---|---|---| -i, --input <FILE | Path to the input PDF file | Required | -p, --profile <FILE | Path to a JSON profile with custom regex rules see below | Built-in defaults | -o, --output <FILE | Output file path | stdout | --format <FORMAT | Output format: jsonl or json | jsonl | --pretty | Pretty-print JSON output | false | --stats | Print chunk statistics instead of the chunks themselves | false | Chunk a PDF and save as JSONL pdf-struct-chunker -i document.pdf -o result.jsonl Pretty-print JSON to the console pdf-struct-chunker -i document.pdf --format json --pretty See how many chunks were created and their sizes pdf-struct-chunker -i document.pdf --stats Use your own regex rules pdf-struct-chunker -i document.pdf -p my rules.json --format json --pretty The core function operates entirely in-memory — no file I/O, no temp files. Feed it bytes from anywhere file, HTTP upload, S3, database and get chunks back instantly: js use pdf struct chunker::{chunk pdf, Profile}; fn main { let bytes = std::fs::read "document.pdf" .unwrap ; let chunks = chunk pdf &bytes, None .unwrap ; for chunk in &chunks { println " {} {} p.{} ", chunk.metadata.section, chunk.metadata.heading, chunk.metadata.page.unwrap or 0 , ; } } By default, the chunker uses built-in heuristics optimized for legal and regulatory documents detecting § , Article , Chapter , etc. . You can override this with your own regex rules. Create a .json file e.g., my rules.json and pass it via --profile : pdf-struct-chunker -i document.pdf -p my rules.json The simplest profile just removes unwanted lines: { "patterns": { "role": "ignore", "regex": "Page \\d+", "flags": "i", "priority": 100 } } This removes every line matching "Page 1", "Page 2", etc. from the output. { "min chunk chars": 200, "max chunk chars": 1500, "patterns": { "role": "ignore", "regex": " ?:Page|Footer text ", "flags": "i", "priority": 200 }, { "role": "heading l1", "regex": "^ ?:Chapter|Section \\s \\d + \\s . ", "flags": "i", "priority": 100 }, { "role": "definition", "regex": "\\b ?:means|shall mean|is defined as ", "flags": "i", "priority": 50 } } | Role | What it does | |---|---| heading l1 | Starts a new chunk. Regex capture group 1 becomes metadata.section e.g., "Chapter 3" , group 2 becomes metadata.heading e.g., "Data Protection" . | definition | Triggers a soft split. If the current chunk has already reached min chunk chars , the chunker flushes it and starts a new one. | ignore | Removes the line entirely. Use this for page numbers, footers, headers, or any boilerplate you don't want in your chunks. | | Field | Description | Default | |---|---|---| min chunk chars | Minimum chunk size before a "soft" split at definitions or list items is allowed | 200 | max chunk chars | Maximum chunk size — forces a split at the nearest sentence boundary | 1500 | patterns .regex | Regular expression matched against each text line | — | patterns .role | One of: heading l1 , definition , ignore | — | patterns .flags | "i" = case-insensitive, "m" = multiline | "" | patterns .priority | Higher value = evaluated first when multiple patterns match the same line | 0 | If you have any questions, feature requests, or just want to say hi, feel free to open an issue or reach out via my website: If this tool saved you time and you'd like to support its development, you can buy me a coffee via PayPal https://www.paypal.me/MatthiasNordwig . ☕ MIT © Matthias Nordwig