AIArticle
By decoupling reference attention from generation history, Unlimited OCR makes single-pass multi-page document parsing practical on commodity hardware.
Anyone who has built a document ingestion pipeline knows the standard, painful ritual. To parse a fifty-page PDF, you render each page to an image, run an OCR model on each image individually, stitch the resulting text blocks back together, and attempt to reconstruct the layout before feeding it to a vector database or a retrieval-augmented generation (RAG) system.
This page-by-page loop is a fragile engineering workaround. It destroys document-level context, struggles with tables that span across pages, and relies on external orchestration scripts to keep track of the document's state.
Baidu's open-source release of Unlimited OCR aims to replace this fragmented workflow. Built as an evolution of DeepSeek-OCR, this 3-billion-parameter model introduces an attention mechanism called Reference Sliding Window Attention (R-SWA). By decoupling the visual reference tokens from the generated output history, the model can parse dozens of pages in a single forward pass while maintaining a constant key-value (KV) cache during decoding. Under the MIT license, with open weights available on Hugging Face and ModelScope, it represents a practical shift in how we handle long-horizon document parsing.
The Architectural Shift: How R-SWA Solves the KV Cache Bottleneck #
In standard autoregressive transformer decoders, the computational cost and memory footprint of the KV cache scale linearly with the length of the generated sequence. As the model outputs more text, the KV cache expands, consuming valuable VRAM and progressively slowing down token generation.
For long-horizon tasks like transcribing a multi-page document, this scaling bottleneck quickly becomes prohibitive. While vanilla Sliding Window Attention (SWA) or linear attention mechanisms can limit memory growth, they fail in OCR tasks. Linear attention models rely on recurrent state updates that progressively blur visual features, which rapidly degrades character recognition accuracy over long sequences.
flowchart TD
A[Input Document Pages] --> B[Visual Encoder]
B --> C[Reference Tokens: Visual + Prompt]
C --> D[R-SWA Decoder]
E[Generated Tokens] -->|Sliding Window: Default 128| D
D --> F[Next Token Generation]
R-SWA bypasses this trade-off by splitting the attention mechanism into two distinct pathways:
Global Reference Attention: The model maintains full, uncompromised attention over all reference tokens (the visual tokens generated by the image encoder and the initial prompt) throughout the entire generation process.Local Causal Attention: For the generated output tokens, the model limits its attention to a sliding window of the preceding $n$ tokens (defaulting to 128).
Because the visual reference tokens remain static after the initial prefill phase, and the output history attention is capped at a fixed window, the KV cache remains constant during the entire decoding phase. This prevents the generation speed from degrading, mimicking how a human copyist glances back at the source document while keeping only the immediate textual context in short-term memory.
Developer Angle: Running Unlimited OCR in Production #
At 3 billion parameters, the model's weights occupy roughly 6 GB on disk in BF16 format. This small footprint means you do not need an enterprise-grade cluster; a single mid-range NVIDIA GPU with 16 GB or 24 GB of VRAM is more than enough to host the model.
There are two primary ways to run Unlimited OCR: direct inference via Hugging Face Transformers, or high-throughput serving using the SGLang runtime. For production pipelines, the SGLang deployment path is highly recommended as it supports FlashAttention-3, custom logit processors, and an OpenAI-compatible streaming API.
To set up the environment, you can use a uv
-managed virtual environment to install the required dependencies, including PyMuPDF for handling PDF-to-image conversions:
uv venv --python 3.12
source .venv/bin/activate
uv pip install wheel/sglang-0.0.0.dev11416+g92e8bb79e-py3-none-any.whl
uv pip install kernels==0.11.7
uv pip install pymupdf==1.27.2.2
You can then launch the SGLang server with the following configuration, which pins the context length to 32,768 tokens and enables the custom logit processor:
python -m sglang.launch_server \
--model baidu/Unlimited-OCR \
--served-model-name Unlimited-OCR \
--attention-backend fa3 \
--page-size 1 \
--mem-fraction-static 0.8 \
--context-length 32768 \
--enable-custom-logit-processor \
--disable-overlap-schedule \
--skip-server-warmup \
--host 0.0.0.0 \
--port 10000
When parsing long documents, transformer decoders often fall into repetitive loops. To mitigate this, the Unlimited OCR repository includes a specialized logit processor. Below is a Python client snippet demonstrating how to stream a multi-page PDF to the server while applying this repetition defense:
import base64
import json
import os
import tempfile
import fitz
import requests
from sglang.srt.sampling.custom_logit_processor import DeepseekOCRNoRepeatNGramLogitProcessor
server_url = "http://127.0.0.1:10000"
session = requests.Session()
def pdf_to_images(pdf_path, dpi=300):
doc = fitz.open(pdf_path)
tmp_dir = tempfile.mkdtemp(prefix="pdf_ocr_")
mat = fitz.Matrix(dpi / 72, dpi / 72)
image_paths = []
for i, page in enumerate(doc):
image_path = os.path.join(tmp_dir, f"page_{i + 1:04d}.png")
page.get_pixmap(matrix=mat).save(image_path)
image_paths.append(image_path)
doc.close()
return image_paths
def encode_image(image_path):
with open(image_path, "rb") as f:
data = base64.b64encode(f.read()).decode("utf-8")
return {"type": "image_url", "image_url": {"url": f"data:image/png;base64,{data}"}}
def parse_document(pdf_path):
images = pdf_to_images(pdf_path)
content = [{"type": "text", "text": "Multi page parsing."}] + [encode_image(img) for img in images]
payload = {
"model": "Unlimited-OCR",
"messages": [{"role": "user", "content": content}],
"temperature": 0,
"skip_special_tokens": False,
"images_config": {"image_mode": "base"},
"custom_logit_processor": DeepseekOCRNoRepeatNGramLogitProcessor.to_str(),
"custom_params": {
"ngram_size": 35,
"window_size": 1024,
},
"stream": True,
}
response = session.post(
f"{server_url}/v1/chat/completions",
headers={"Content-Type": "application/json"},
data=json.dumps(payload),
stream=True,
)
response.raise_for_status()
return response
Trade-offs and Reality Checks #
Unlimited OCR achieves an impressive 93% on the OmniDocBench v1.5 benchmark, representing a 6% improvement over the DeepSeek OCR baseline. However, developers should not mistake "unlimited" for infinite capacity.
While the R-SWA mechanism ensures that the KV cache does not grow during the generation phase, the initial prefill phase is still bound by physical hardware limits. Encoding dozens of high-resolution pages simultaneously generates a massive number of visual tokens. These visual tokens must fit within the model's 32,768-token context window.
If you attempt to feed a 500-page technical manual into the model in a single pass, you will still overflow the context window during the prefill stage. The practical sweet spot for this model is documents ranging from 10 to 50 pages. For anything larger, you will still need to chunk the document, though you will be chunking by chapters or large sections rather than individual pages.
Additionally, the model operates in two distinct modes: gundam
and base
. The gundam
mode is optimized for single-image analysis, utilizing a 1024-pixel base resolution with dynamic cropping to capture fine details. The base
mode is designed for multi-page and PDF parsing, maintaining a standard 1024-pixel resolution without cropping. When processing complex, multi-column PDFs with tiny fonts in base
mode, the lack of dynamic cropping can occasionally lead to missed characters compared to single-page specialized parsers.
The Verdict #
Baidu's Unlimited OCR is a highly practical release that addresses a real-world bottleneck in document processing. By replacing standard full attention with R-SWA, it proves that we do not need to brute-force context windows or accept linear performance degradation to parse long documents.
For teams building RAG pipelines, legal tech applications, or financial analysis tools, this model offers a way to simplify ingestion infrastructure. It allows you to throw away complex page-stitching scripts and replace them with a single, high-throughput model call that preserves document layout and structural flow.
Sources & further reading #
Unlimited OCR: One-shot long-horizon parsing— github.com - Unlimited OCR Works Welcome the Era of One-shot Long-horizon Parsing— arxiv.org - Baidu Releases Unlimited OCR for One-shot Long-horizon Parsing | Trending Stories | HyperAI— hyper.ai - Unlimited-OCR by Baidu: Open Source OCR for Long PDFs— pasqualepillitteri.it
Priya Nair· AI & Developer Experience Writer
Priya covers AI frameworks, developer productivity tooling, and the startup ecosystem across South and Southeast Asia, bringing a researcher's rigour and a practitioner's empathy to every story. She is deeply sceptical of benchmarks and asks hard questions so her readers don't have to.
Discussion 0 #
No comments yet
Be the first to weigh in.