Baidu's Unlimited OCR: Ditching the Split-and-Stitch Document Pipeline Baidu released Unlimited OCR, an open-source 3-billion-parameter model that uses Reference Sliding Window Attention to parse multi-page documents in a single pass, eliminating the need for page-by-page OCR pipelines. The model maintains a constant KV cache during decoding, enabling efficient long-document parsing on commodity hardware. AI https://www.devclubhouse.com/c/ai Article Baidu's Unlimited OCR: Ditching the Split-and-Stitch Document Pipeline By decoupling reference attention from generation history, Unlimited OCR makes single-pass multi-page document parsing practical on commodity hardware. Priya Nair https://www.devclubhouse.com/u/priya nair Anyone who has built a document ingestion pipeline knows the standard, painful ritual. To parse a fifty-page PDF, you render each page to an image, run an OCR model on each image individually, stitch the resulting text blocks back together, and attempt to reconstruct the layout before feeding it to a vector database or a retrieval-augmented generation RAG system. This page-by-page loop is a fragile engineering workaround. It destroys document-level context, struggles with tables that span across pages, and relies on external orchestration scripts to keep track of the document's state. Baidu's open-source release of Unlimited OCR https://github.com/baidu/Unlimited-OCR aims to replace this fragmented workflow. Built as an evolution of DeepSeek-OCR, this 3-billion-parameter model introduces an attention mechanism called Reference Sliding Window Attention R-SWA . By decoupling the visual reference tokens from the generated output history, the model can parse dozens of pages in a single forward pass while maintaining a constant key-value KV cache during decoding. Under the MIT license, with open weights available on Hugging Face https://huggingface.co and ModelScope https://www.modelscope.cn , it represents a practical shift in how we handle long-horizon document parsing. The Architectural Shift: How R-SWA Solves the KV Cache Bottleneck In standard autoregressive transformer decoders, the computational cost and memory footprint of the KV cache scale linearly with the length of the generated sequence. As the model outputs more text, the KV cache expands, consuming valuable VRAM and progressively slowing down token generation. For long-horizon tasks like transcribing a multi-page document, this scaling bottleneck quickly becomes prohibitive. While vanilla Sliding Window Attention SWA or linear attention mechanisms can limit memory growth, they fail in OCR tasks. Linear attention models rely on recurrent state updates that progressively blur visual features, which rapidly degrades character recognition accuracy over long sequences. php flowchart TD A Input Document Pages -- B Visual Encoder B -- C Reference Tokens: Visual + Prompt C -- D R-SWA Decoder E Generated Tokens -- |Sliding Window: Default 128| D D -- F Next Token Generation R-SWA bypasses this trade-off by splitting the attention mechanism into two distinct pathways: Global Reference Attention : The model maintains full, uncompromised attention over all reference tokens the visual tokens generated by the image encoder and the initial prompt throughout the entire generation process. Local Causal Attention : For the generated output tokens, the model limits its attention to a sliding window of the preceding $n$ tokens defaulting to 128 . Because the visual reference tokens remain static after the initial prefill phase, and the output history attention is capped at a fixed window, the KV cache remains constant during the entire decoding phase. This prevents the generation speed from degrading, mimicking how a human copyist glances back at the source document while keeping only the immediate textual context in short-term memory. Developer Angle: Running Unlimited OCR in Production At 3 billion parameters, the model's weights occupy roughly 6 GB on disk in BF16 format. This small footprint means you do not need an enterprise-grade cluster; a single mid-range NVIDIA GPU with 16 GB or 24 GB of VRAM is more than enough to host the model. There are two primary ways to run Unlimited OCR: direct inference via Hugging Face Transformers, or high-throughput serving using the SGLang https://github.com/sglang-project/sglang runtime. For production pipelines, the SGLang deployment path is highly recommended as it supports FlashAttention-3, custom logit processors, and an OpenAI-compatible streaming API. Shadow GPS — know where it is, always Real-time GPS tracking for vehicles, gear and loved ones. No monthly contracts. https://www.devclubhouse.com/go/ad/12 To set up the environment, you can use a uv -managed virtual environment to install the required dependencies, including PyMuPDF https://pymupdf.io for handling PDF-to-image conversions: uv venv --python 3.12 source .venv/bin/activate uv pip install wheel/sglang-0.0.0.dev11416+g92e8bb79e-py3-none-any.whl uv pip install kernels==0.11.7 uv pip install pymupdf==1.27.2.2 You can then launch the SGLang server with the following configuration, which pins the context length to 32,768 tokens and enables the custom logit processor: python -m sglang.launch server \ --model baidu/Unlimited-OCR \ --served-model-name Unlimited-OCR \ --attention-backend fa3 \ --page-size 1 \ --mem-fraction-static 0.8 \ --context-length 32768 \ --enable-custom-logit-processor \ --disable-overlap-schedule \ --skip-server-warmup \ --host 0.0.0.0 \ --port 10000 When parsing long documents, transformer decoders often fall into repetitive loops. To mitigate this, the Unlimited OCR repository includes a specialized logit processor. Below is a Python client snippet demonstrating how to stream a multi-page PDF to the server while applying this repetition defense: python import base64 import json import os import tempfile import fitz import requests from sglang.srt.sampling.custom logit processor import DeepseekOCRNoRepeatNGramLogitProcessor server url = "http://127.0.0.1:10000" session = requests.Session def pdf to images pdf path, dpi=300 : doc = fitz.open pdf path tmp dir = tempfile.mkdtemp prefix="pdf ocr " mat = fitz.Matrix dpi / 72, dpi / 72 image paths = for i, page in enumerate doc : image path = os.path.join tmp dir, f"page {i + 1:04d}.png" page.get pixmap matrix=mat .save image path image paths.append image path doc.close return image paths def encode image image path : with open image path, "rb" as f: data = base64.b64encode f.read .decode "utf-8" return {"type": "image url", "image url": {"url": f"data:image/png;base64,{data}"}} def parse document pdf path : images = pdf to images pdf path content = {"type": "text", "text": "Multi page parsing."} + encode image img for img in images payload = { "model": "Unlimited-OCR", "messages": {"role": "user", "content": content} , "temperature": 0, "skip special tokens": False, "images config": {"image mode": "base"}, "custom logit processor": DeepseekOCRNoRepeatNGramLogitProcessor.to str , "custom params": { "ngram size": 35, "window size": 1024, }, "stream": True, } response = session.post f"{server url}/v1/chat/completions", headers={"Content-Type": "application/json"}, data=json.dumps payload , stream=True, response.raise for status return response Trade-offs and Reality Checks Unlimited OCR achieves an impressive 93% on the OmniDocBench v1.5 benchmark, representing a 6% improvement over the DeepSeek OCR baseline. However, developers should not mistake "unlimited" for infinite capacity. While the R-SWA mechanism ensures that the KV cache does not grow during the generation phase, the initial prefill phase is still bound by physical hardware limits. Encoding dozens of high-resolution pages simultaneously generates a massive number of visual tokens. These visual tokens must fit within the model's 32,768-token context window. If you attempt to feed a 500-page technical manual into the model in a single pass, you will still overflow the context window during the prefill stage. The practical sweet spot for this model is documents ranging from 10 to 50 pages. For anything larger, you will still need to chunk the document, though you will be chunking by chapters or large sections rather than individual pages. Additionally, the model operates in two distinct modes: gundam and base . The gundam mode is optimized for single-image analysis, utilizing a 1024-pixel base resolution with dynamic cropping to capture fine details. The base mode is designed for multi-page and PDF parsing, maintaining a standard 1024-pixel resolution without cropping. When processing complex, multi-column PDFs with tiny fonts in base mode, the lack of dynamic cropping can occasionally lead to missed characters compared to single-page specialized parsers. The Verdict Baidu's Unlimited OCR is a highly practical release that addresses a real-world bottleneck in document processing. By replacing standard full attention with R-SWA, it proves that we do not need to brute-force context windows or accept linear performance degradation to parse long documents. For teams building RAG pipelines, legal tech applications, or financial analysis tools, this model offers a way to simplify ingestion infrastructure. It allows you to throw away complex page-stitching scripts and replace them with a single, high-throughput model call that preserves document layout and structural flow. Sources & further reading - Unlimited OCR: One-shot long-horizon parsing https://github.com/baidu/Unlimited-OCR — github.com - Unlimited OCR Works Welcome the Era of One-shot Long-horizon Parsing https://arxiv.org/html/2606.23050v1 — arxiv.org - Baidu Releases Unlimited OCR for One-shot Long-horizon Parsing | Trending Stories | HyperAI https://hyper.ai/en/stories/896c32c8cf649dc179e249b10e47d840 — hyper.ai - Unlimited-OCR by Baidu: Open Source OCR for Long PDFs https://pasqualepillitteri.it/en/news/6063/unlimited-ocr-baidu-long-pdfs — pasqualepillitteri.it Priya Nair https://www.devclubhouse.com/u/priya nair · AI & Developer Experience Writer Priya covers AI frameworks, developer productivity tooling, and the startup ecosystem across South and Southeast Asia, bringing a researcher's rigour and a practitioner's empathy to every story. She is deeply sceptical of benchmarks and asks hard questions so her readers don't have to. Discussion 0 No comments yet Be the first to weigh in.