{"slug": "llm-free-layout-aware-pdf-chunker-in-pure-rust", "title": "LLM-free, layout-aware PDF chunker in pure Rust", "summary": "Developer Matthias Nordwig released pdf-struct-chunker, an open-source PDF chunker written in pure Rust that splits documents into semantically meaningful chunks by analyzing layout features like X/Y coordinates, font sizes, and bold detection, without relying on LLMs or cloud APIs. The tool aims to improve RAG pipelines by preserving document structure and metadata, processing a 100-page PDF in under a second on a standard laptop.", "body_md": "**Split PDFs into semantically meaningful chunks — without LLMs, without APIs, without cloud dependencies.**\n\n🌐\n\nAuthor: Matthias Nordwig·[programmiere.de]\n\nMost RAG chunkers blindly split documents by token count or character limit. This destroys document structure — headings, sections, and paragraphs get ripped apart. The result: your vector search returns incoherent fragments with no context about where they came from.\n\n**pdf-struct-chunker** solves this by analyzing the actual *layout* of a PDF: X/Y coordinates, font sizes, and bold detection. It understands where a heading starts, where a paragraph ends, and where a new section begins. Each chunk carries structured metadata (`section`\n\n, `heading`\n\n, `page`\n\n) so your RAG pipeline knows exactly what it's looking at.\n\nNo LLM needed. No API calls. Runs offline. Written in pure Rust.\n\n**❌ Standard RAG (Fixed-size overlap)**\n\n```\n// Chunk 1\n\"This regulation applies to all companies. § 2 De-\"\n\n// Chunk 2\n\"finitions. In this regulation, the following terms\"\n```\n\n*Result: Words are cut in half, headings are disconnected from their content.*\n\n**✅ pdf-struct-chunker (Layout-aware)**\n\n```\n// Chunk 1\n{\n  \"metadata\": { \"section\": \"§ 1\", \"heading\": \"Scope\" },\n  \"text\": \"This regulation applies to all companies.\"\n}\n// Chunk 2\n{\n  \"metadata\": { \"section\": \"§ 2\", \"heading\": \"Definitions\" },\n  \"text\": \"In this regulation, the following terms...\"\n}\n```\n\n*Result: Clean, semantic chunks with perfect metadata.*\n\nBuilt for speed and Edge-AI scenarios:\n\n**No GPU required**(pure CPU processing)** Extremely fast**: Processes a 100-page PDF in`< 1 second`\n\non a standard laptop.**Low memory footprint**: Operates entirely in-memory without creating temporary files.\n\nA sample PDF is included — you can try it immediately after cloning:\n\n```\ngit clone https://github.com/matthiasnordwig/pdf-struct-chunker.git\ncd pdf-struct-chunker\ncargo run --release -- -i fixtures/sample.pdf --format json --pretty\n```\n\nOutput:\n\n```\n[\n  {\n    \"index\": 0,\n    \"char_start\": 0,\n    \"char_end\": 441,\n    \"text\": \"§ 1 Anwendungsbereich\\nDiese Verordnung gilt für alle Unternehmen ...\",\n    \"signature\": \"§ 1 Anwendungsbereich\\nDiese Verordnung gilt für alle Unternehmen\",\n    \"metadata\": {\n      \"section\": \"§ 1\",\n      \"heading\": \"Anwendungsbereich\",\n      \"page\": 2\n    }\n  },\n  {\n    \"index\": 1,\n    \"text\": \"§ 2 Begriffsbestimmungen\\nIm Sinne dieser Verordnung ...\",\n    \"metadata\": {\n      \"section\": \"§ 2\",\n      \"heading\": \"Begriffsbestimmungen\",\n      \"page\": 2\n    }\n  }\n]\n```\n\nEvery chunk knows its section, heading, and page number — ready for embedding.\n\n```\nPDF bytes ──► pdf_oxide (extract characters with X/Y positions + font sizes)\n                │\n                ▼\n          Line Classification\n          (match lines against your regex profiles, or fall back to font-size heuristics)\n                │\n                ▼\n          Chunk Assembly\n          (split at headings, merge small fragments, split overflow at sentence boundaries)\n                │\n                ▼\n          Vec<Chunk> { text, section, heading, page }\n```\n\nThe chunker processes each PDF page by extracting character-level bounding boxes, reconstructing lines from Y-coordinates, classifying them using configurable regex patterns (or font-size heuristics as fallback), and assembling them into semantically coherent chunks with structural metadata.\n\n```\ngit clone https://github.com/matthiasnordwig/pdf-struct-chunker.git\ncd pdf-struct-chunker\ncargo build --release\n```\n\nAdd this to your `Cargo.toml`\n\n:\n\n```\n[dependencies]\npdf-struct-chunker = { git = \"https://github.com/matthiasnordwig/pdf-struct-chunker\" }\npdf-struct-chunker [OPTIONS] --input <INPUT>\n```\n\n| Flag | Description | Default |\n|---|---|---|\n`-i, --input <FILE>` |\nPath to the input PDF file | Required |\n`-p, --profile <FILE>` |\nPath to a JSON profile with custom regex rules (see below) | Built-in defaults |\n`-o, --output <FILE>` |\nOutput file path | `stdout` |\n`--format <FORMAT>` |\nOutput format: `jsonl` or `json` |\n`jsonl` |\n`--pretty` |\nPretty-print JSON output | `false` |\n`--stats` |\nPrint chunk statistics instead of the chunks themselves | `false` |\n\n```\n# Chunk a PDF and save as JSONL\npdf-struct-chunker -i document.pdf -o result.jsonl\n\n# Pretty-print JSON to the console\npdf-struct-chunker -i document.pdf --format json --pretty\n\n# See how many chunks were created and their sizes\npdf-struct-chunker -i document.pdf --stats\n\n# Use your own regex rules\npdf-struct-chunker -i document.pdf -p my_rules.json --format json --pretty\n```\n\nThe core function operates entirely in-memory — no file I/O, no temp files. Feed it bytes from anywhere (file, HTTP upload, S3, database) and get chunks back instantly:\n\n``` js\nuse pdf_struct_chunker::{chunk_pdf, Profile};\n\nfn main() {\n    let bytes = std::fs::read(\"document.pdf\").unwrap();\n\n    let chunks = chunk_pdf(&bytes, None).unwrap();\n\n    for chunk in &chunks {\n        println!(\"[{}] {} (p.{})\",\n            chunk.metadata.section,\n            chunk.metadata.heading,\n            chunk.metadata.page.unwrap_or(0),\n        );\n    }\n}\n```\n\nBy default, the chunker uses built-in heuristics optimized for legal and regulatory documents (detecting `§`\n\n, `Article`\n\n, `Chapter`\n\n, etc.). You can override this with your own regex rules.\n\nCreate a `.json`\n\nfile (e.g., `my_rules.json`\n\n) and pass it via `--profile`\n\n:\n\n```\npdf-struct-chunker -i document.pdf -p my_rules.json\n```\n\nThe simplest profile just removes unwanted lines:\n\n```\n{\n  \"patterns\": [\n    {\n      \"role\": \"ignore\",\n      \"regex\": \"Page \\\\d+\",\n      \"flags\": \"i\",\n      \"priority\": 100\n    }\n  ]\n}\n```\n\nThis removes every line matching \"Page 1\", \"Page 2\", etc. from the output.\n\n```\n{\n  \"min_chunk_chars\": 200,\n  \"max_chunk_chars\": 1500,\n  \"patterns\": [\n    {\n      \"role\": \"ignore\",\n      \"regex\": \"(?:Page|Footer text)\",\n      \"flags\": \"i\",\n      \"priority\": 200\n    },\n    {\n      \"role\": \"heading_l1\",\n      \"regex\": \"^((?:Chapter|Section)\\\\s*[\\\\d]+)\\\\s*(.*)\",\n      \"flags\": \"i\",\n      \"priority\": 100\n    },\n    {\n      \"role\": \"definition\",\n      \"regex\": \"\\\\b(?:means|shall mean|is defined as)\",\n      \"flags\": \"i\",\n      \"priority\": 50\n    }\n  ]\n}\n```\n\n| Role | What it does |\n|---|---|\n`heading_l1` |\nStarts a new chunk. Regex capture group 1 becomes `metadata.section` (e.g., \"Chapter 3\"), group 2 becomes `metadata.heading` (e.g., \"Data Protection\"). |\n`definition` |\nTriggers a soft split. If the current chunk has already reached `min_chunk_chars` , the chunker flushes it and starts a new one. |\n`ignore` |\nRemoves the line entirely. Use this for page numbers, footers, headers, or any boilerplate you don't want in your chunks. |\n\n| Field | Description | Default |\n|---|---|---|\n`min_chunk_chars` |\nMinimum chunk size before a \"soft\" split (at definitions or list items) is allowed | `200` |\n`max_chunk_chars` |\nMaximum chunk size — forces a split at the nearest sentence boundary | `1500` |\n`patterns[].regex` |\nRegular expression matched against each text line | — |\n`patterns[].role` |\nOne of: `heading_l1` , `definition` , `ignore` |\n— |\n`patterns[].flags` |\n`\"i\"` = case-insensitive, `\"m\"` = multiline |\n`\"\"` |\n`patterns[].priority` |\nHigher value = evaluated first when multiple patterns match the same line | `0` |\n\nIf you have any questions, feature requests, or just want to say hi, feel free to open an issue or reach out via my website:\n\nIf this tool saved you time and you'd like to support its development, you can [buy me a coffee via PayPal](https://www.paypal.me/MatthiasNordwig). ☕\n\nMIT © Matthias Nordwig", "url": "https://wpnews.pro/news/llm-free-layout-aware-pdf-chunker-in-pure-rust", "canonical_source": "https://github.com/matthiasnordwig/pdf-struct-chunker", "published_at": "2026-06-29 09:06:32+00:00", "updated_at": "2026-06-29 09:28:34.848035+00:00", "lang": "en", "topics": ["ai-tools", "developer-tools", "natural-language-processing"], "entities": ["Matthias Nordwig", "pdf-struct-chunker", "Rust", "pdf_oxide"], "alternates": {"html": "https://wpnews.pro/news/llm-free-layout-aware-pdf-chunker-in-pure-rust", "markdown": "https://wpnews.pro/news/llm-free-layout-aware-pdf-chunker-in-pure-rust.md", "text": "https://wpnews.pro/news/llm-free-layout-aware-pdf-chunker-in-pure-rust.txt", "jsonld": "https://wpnews.pro/news/llm-free-layout-aware-pdf-chunker-in-pure-rust.jsonld"}}