LLM-free, layout-aware PDF chunker in pure Rust Developer Matthias Nordwig released pdf-struct-chunker, an open-source PDF chunker written in pure Rust that splits documents into semantically meaningful chunks by analyzing layout features like X/Y coordinates, font sizes, and bold detection, without relying on LLMs or cloud APIs. The tool aims to improve RAG pipelines by preserving document structure and metadata, processing a 100-page PDF in under a second on a standard laptop. Split PDFs into semantically meaningful chunks — without LLMs, without APIs, without cloud dependencies. 🌐 Author: Matthias Nordwig· programmiere.de Most RAG chunkers blindly split documents by token count or character limit. This destroys document structure — headings, sections, and paragraphs get ripped apart. The result: your vector search returns incoherent fragments with no context about where they came from. pdf-struct-chunker solves this by analyzing the actual layout of a PDF: X/Y coordinates, font sizes, and bold detection. It understands where a heading starts, where a paragraph ends, and where a new section begins. Each chunk carries structured metadata section , heading , page so your RAG pipeline knows exactly what it's looking at. No LLM needed. No API calls. Runs offline. Written in pure Rust. ❌ Standard RAG Fixed-size overlap // Chunk 1 "This regulation applies to all companies. § 2 De-" // Chunk 2 "finitions. In this regulation, the following terms" Result: Words are cut in half, headings are disconnected from their content. ✅ pdf-struct-chunker Layout-aware // Chunk 1 { "metadata": { "section": "§ 1", "heading": "Scope" }, "text": "This regulation applies to all companies." } // Chunk 2 { "metadata": { "section": "§ 2", "heading": "Definitions" }, "text": "In this regulation, the following terms..." } Result: Clean, semantic chunks with perfect metadata. Built for speed and Edge-AI scenarios: No GPU required pure CPU processing Extremely fast : Processes a 100-page PDF in < 1 second on a standard laptop. Low memory footprint : Operates entirely in-memory without creating temporary files. A sample PDF is included — you can try it immediately after cloning: git clone https://github.com/matthiasnordwig/pdf-struct-chunker.git cd pdf-struct-chunker cargo run --release -- -i fixtures/sample.pdf --format json --pretty Output: { "index": 0, "char start": 0, "char end": 441, "text": "§ 1 Anwendungsbereich\nDiese Verordnung gilt für alle Unternehmen ...", "signature": "§ 1 Anwendungsbereich\nDiese Verordnung gilt für alle Unternehmen", "metadata": { "section": "§ 1", "heading": "Anwendungsbereich", "page": 2 } }, { "index": 1, "text": "§ 2 Begriffsbestimmungen\nIm Sinne dieser Verordnung ...", "metadata": { "section": "§ 2", "heading": "Begriffsbestimmungen", "page": 2 } } Every chunk knows its section, heading, and page number — ready for embedding. PDF bytes ──► pdf oxide extract characters with X/Y positions + font sizes │ ▼ Line Classification match lines against your regex profiles, or fall back to font-size heuristics │ ▼ Chunk Assembly split at headings, merge small fragments, split overflow at sentence boundaries │ ▼ Vec