{"slug": "markdown-comes-to-liteparse", "title": "Markdown Comes to Liteparse", "summary": "LlamaIndex released LiteParse v2.1, an open-source PDF-to-markdown pipeline that achieved top scores on three benchmarks against model-free approaches. The tool uses a heuristic rule-based approach with a custom PDFium fork to classify text into markdown elements, prioritizing speed over accuracy compared to AI-based parsers.", "body_md": "A few weeks ago, we launched LiteParse 2.0 as the fastest tool for converting PDFs to text. However, a few questions kept coming up again and again: Where are the benchmarks? Does it output markdown?\n\n[LiteParse v2.1](https://developers.llamaindex.ai/liteparse/) answers this by delivering the fastest open-source, model-free, pdf-to-markdown pipeline. We measured our performance on 3 standard benchmarks and achieved top overall scores on all three when measured against model-free approaches: opendataloader-bench at 0.875, olmOCR-bench at 0.391, and ParseBench at 0.3279.\n\n[Visit the demo site](https://www.llamaindex.ai/liteparse-demo) (running in-browser with WASM) or install the latest version today!\n\nbash\n\n``` bash\n$ pip install liteparse\n$ lit parse doc.pdf --format markdown\n```\n\npython\n\n``` python\nfrom liteparse import LiteParse\n\nlp = LiteParse(output_format=\"markdown\")\nresult = lp.parse(\"doc.pdf\")\nprint(result.text)\n```\n\n## How Does it Work?\n\nBuilding a heuristic pipeline for markdown essentially boils down into two parts: signals you can detect, and the types of output elements that listen to those signals. Similar to any machine-learning model, this essentially boils down to inputs, weights, and activations!\n\nPDFs carry a ton of data: font family, font size, text-location, and more. All of these are then treated as input signals to classify text into specific markdown elements like paragraphs, tables, lists, and headers.\n\nLiteParse uses a custom PDFium fork to capture as much signal as possible, and then combines that with signals from our existing grid-projection algorithm, to deliver the best markdown output we can deliver with a purely heuristic rule-based approach.\n\nAs time goes on, we expect this mode to get even better. There’s an extremely long tail of PDFs that we can adapt to over time, and time is the best thing for making this mode better.\n\n## Measuring Markdown Performance\n\nIt turns out not only is markdown a highly requested output option, it's also very hard to benchmark PDF parsing tools without it.\n\nAll existing benchmarks (ParseBench, olmOCR-bench, opendataloader-bench) are strongly fit to measuring markdown. By building this markdown pipeline, we were able to deliver an entirely new output mode while also being able to measure and improve our overall extraction quality.\n\nIn the spirit of “Lite”-ness, we built the markdown mode in LiteParse to be as light and fast as possible. This approach prioritizes speed, but also has to accept an upper-bound on accuracy (we aren’t going to do better than LlamaParse with this approach).\n\nIn order to compare fairly, we scoped our comparisons to open-source tools that do not leverage larger AI models for parsing. This means OCR and other model integrations are disabled when benchmarking.\n\n## Benchmark Results\n\n### ParseBench\n\n[We’ve written a lot about ParseBench already](https://www.llamaindex.ai/blog/parsebench). 2000+ documents measured across 5 key metrics that end-users actually care about. These are intentionally hard documents, so without larger AI models, these scores are actually quite impressive.\n\nLiteParse leads Overall. The Charts and Visual Grounding columns are effectively noise for every model-free tool here. ParseBench scores charts (and parts of its layout/visual-grounding metrics) by comparing structured data extracted from the chart, which fundamentally requires an ML model to recover. A heuristic engine has nothing to emit there, so all model-free tools cluster near zero. We're reporting those columns for completeness only.\n\n| Category | LiteParse | pymupdf4llm | opendataloader | pdf-inspector | markitdown |\n|---|---|---|---|---|---|\nOverall | 0.328 | 0.310 | 0.294 | 0.266 | 0.186 |\n| Tables | 0.403 | 0.373 | 0.352 | 0.266 | 0.158 |\n| Content Faithfulness | 0.686 | 0.609 | 0.661 | 0.561 | 0.645 |\n| Semantic Formatting | 0.409 | 0.446 | 0.341 | 0.351 | 0.009 |\n| Charts* | 0.034 | 0.015 | 0.001 | 0.053 | 0.020 |\n| Visual Grounding* | 0.107 | 0.107 | 0.108 | 0.099 | 0.099 |\n\n- Numbers here are mostly noise, none of the tools here output the proper data to benchmark properly on these metrics\n\n### opendataloader-bench\n\n[opendataloader-bench is a small benchmark of 200 docs](https://github.com/opendataloader-project/opendataloader-bench). It measures three main things: Reading Order Similarity (NID), Table Structure Similarity (TEDS), and Heading-Level Similarity (MHS). You can read more about these metrics in their github repo.\n\nHere, LiteParse leads across all categories. The official repo also reports scores from actual AI models and LiteParse is quite competitive there as well, but for this blog post we are only comparing to similar model-free OSS tools.\n\n| Category | LiteParse | pymupdf4llm | opendataloader | pdf-inspector | markitdown |\n|---|---|---|---|---|---|\nOverall | 0.871 | 0.732 | 0.831 | 0.792 | 0.589 |\n| NID | |||||\n| (Reading Order) | 0.908 | 0.885 | 0.902 | 0.876 | 0.844 |\n| TEDS | |||||\n| (Tables) | 0.693 | 0.401 | 0.483 | 0.630 | 0.273 |\n| MHS | |||||\n| (Headers) | 0.816 | 0.412 | 0.739 | 0.602 | 0.000 |\n\n### olmOCR-bench\n\nLiteParse leads in most categories in olmOCR-bench. Some of their rule checks don’t always reflect desired output, and sometimes disagree with eachother ([which we’ve written about before](https://www.llamaindex.ai/blog/olmocr-bench-review-insights-and-pitfalls-on-an-ocr-benchmark)), but it is useful signal nonetheless.\n\nLiteParse scores well on the baseline sanity checks, and a strong showing on headers/footers, multi column, and table tests. Low scores on old scans/math are expected as these typically require OCR. The rest of the scores are within distance of other tools.\n\n| Category | LiteParse | pymupdf4llm | opendataloader | pdf-inspector | markitdown |\n|---|---|---|---|---|---|\nOverall | 39.2% | 32.9% | 32.7% | 30.5% | 28.7% |\n| baseline | 99.9% | 84.5% | 86.9% | 82.9% | 86.8% |\n| headers_footers | 55.9% | 39.5% | 37.5% | 52.0% | 38.8% |\n| multi_column | 67.1% | 66.7% | 62.8% | 38.2% | 39.3% |\n| table_tests | 48.0% | 46.3% | 25.7% | 40.0% | 19.9% |\n| long_tiny_text | 29.2% | 12.7% | 35.1% | 17.4% | 31.2% |\n| old_scans | 13.3% | 13.3% | 13.3% | 13.3% | 13.3% |\n| arxiv_math | 0.0% | 0.0% | 0.0% | 0.0% | 0.0% |\n| old_scans_math | 0.0% | 0.0% | 0.0% | 0.0% | 0.0% |\n\n### Speed Tests\n\nSpeed was measured on a fixed set of PDFs with varying layouts, page counts, and content types. The times reported are the average time taken to process a single page across the entire test set. You can find [the source data here](https://huggingface.co/datasets/llamaindex/liteparse_bench_small/tree/main/bench-docs) and the [benchmark code here](https://github.com/run-llama/liteparse/tree/main/dataset_eval_utils).\n\n| Provider | ms/page (agg) |\n|---|---|\n| liteparse | 3.16 ms |\n| pdf-inspector | 3.83 ms |\n| opendataloader | 66.3 ms |\n| pymupdf4llm-md | 141.5 ms |\n| markitdown | 182.5 ms |\n\n### Licensing & Portability\n\nAcross all the tools tested, there is a mix of licenses and supported runtimes.\n\nLiteParse is permissively licensed (Apache-2.0) and runs as a single engine across four ecosystems, including natively in the browser via WASM. The Python-only tools can't go where a browser or a Node service needs them, and pymupdf4llm inherits PyMuPDF's AGPL-3.0 copyleft, which is a non-starter for many commercial codebases without a paid license.\n\n| Tool | License | Languages / Runtimes |\n|---|---|---|\nLiteParse | Apache-2.0 | Rust, Python, Node, WASM (browser) |\n| pymupdf4llm | AGPL-3.0 (commercial available) | Python |\n| markitdown | MIT | Python |\n| opendataloader | Apache-2.0 | Java core (+ Python, Node.js wrappers) |\n| pdf-inspector | MIT | Rust |\n\n## A Note on v2.1 Scope\n\nThese three benchmarks don't always agree on what \"good\" markdown looks like. We repeatedly found that tuning output to win one benchmark (e.g. olmOCR-bench) would regress another (e.g. ParseBench), and vice versa. Visually inspecting PDFs you’d often see results that “score well” but visually looked not great. Rather than benchmaxxing any single harness, we kept v2.1 tuned for solid, balanced performance across all three. There's plenty of headroom to push individual sub-categories over time (and we will!).\n\n## Try it Today!\n\nLiteParse runs everywhere and v2.1 is available now:\n\nbash\n\n```\n# Node Library + CLI\nnpm i @llamaindex/liteparse\n\n# Python Library + CLI\npip install liteparse\n\n# Rust Library + CLI\ncargo install liteparse\n\n# WASM Library\nnpm i @llamaindex/liteparse-wasm\n```\n\nOr, use it with your favourite coding agent directly as a skill:\n\nbash\n\n```\n# Claude Code, Codex, OpenCode, etc.\nnpx skills add run-llama/llamaparse-agent-skills --skill liteparse\n\n# Pi Coding Agent Extension\npi install npm:@llamaindex/liteparse-pi-extension@latest\n```\n\nFollow these links for docs and details on source code:", "url": "https://wpnews.pro/news/markdown-comes-to-liteparse", "canonical_source": "https://www.llamaindex.ai/blog/markdown-comes-to-liteparse", "published_at": "2026-06-19 05:33:10+00:00", "updated_at": "2026-06-19 06:01:16.915382+00:00", "lang": "en", "topics": ["developer-tools", "ai-tools", "natural-language-processing"], "entities": ["LlamaIndex", "LiteParse", "PDFium", "ParseBench", "olmOCR-bench", "opendataloader-bench"], "alternates": {"html": "https://wpnews.pro/news/markdown-comes-to-liteparse", "markdown": "https://wpnews.pro/news/markdown-comes-to-liteparse.md", "text": "https://wpnews.pro/news/markdown-comes-to-liteparse.txt", "jsonld": "https://wpnews.pro/news/markdown-comes-to-liteparse.jsonld"}}