Unlimited OCR: One-shot long-horizon parsing

Baidu released Unlimited-OCR, a one-shot long-horizon parsing model that extends DeepSeek-OCR, on June 22, 2026. The open-source model supports single-image and multi-page PDF parsing with configurable image sizes and is available on GitHub and ModelScope.

- 2026/06/23 📄 Our paper is now available on arXiv https://arxiv.org/abs/2606.23050 . - 2026/06/23 🤝 Thanks to the ModelScope community for their support. Our model is now available at ModelScope https://modelscope.cn/models/PaddlePaddle/Unlimited-OCR . - 2026/06/22 🚀 We present Unlimited-OCR https://github.com/baidu/Unlimited-OCR , aiming to push Deepseek-OCR https://github.com/deepseek-ai/DeepSeek-OCR one step further. Inference using Huggingface transformers on NVIDIA GPUs. Requirements tested on python 3.12.3 + CUDA12.9： torch==2.10.0 torchvision==0.25.0 transformers==4.57.1 Pillow==12.1.1 matplotlib==3.10.8 einops==0.8.2 addict==2.4.0 easydict==1.13 pymupdf==1.27.2.2 psutil==7.2.2 python import os import torch from transformers import AutoModel, AutoTokenizer model name = 'baidu/Unlimited-OCR' tokenizer = AutoTokenizer.from pretrained model name, trust remote code=True model = AutoModel.from pretrained model name, trust remote code=True, use safetensors=True, torch dtype=torch.bfloat16, model = model.eval .cuda ── Single image supports two configs: gundam or base ── gundam: base size=1024, image size=640, crop mode=True base: base size=1024, image size=1024, crop mode=False model.infer tokenizer, prompt='<image document parsing.', image file='your image.jpg', output path='your/output/dir', base size=1024, image size=640, crop mode=True, max length=32768, no repeat ngram size=35, ngram window=128, save results=True, ── Multi page / PDF only uses base image size=1024 ── model.infer multi tokenizer, prompt='<image Multi page parsing.', image files= 'page1.png', 'page2.png', 'page3.png' , output path='your/output/dir', image size=1024, max length=32768, no repeat ngram size=35, ngram window=1024, save results=True, ── PDF convert pages to images, then multi-page parsing ── import tempfile, fitz PyMuPDF def pdf to images pdf path, dpi=300 : doc = fitz.open pdf path tmp dir = tempfile.mkdtemp prefix='pdf ocr ' mat = fitz.Matrix dpi / 72, dpi / 72 paths = for i, page in enumerate doc : out = os.path.join tmp dir, f'page {i+1:04d}.png' page.get pixmap matrix=mat .save out paths.append out doc.close return paths model.infer multi tokenizer, prompt='<image Multi page parsing.', image files=pdf to images 'your doc.pdf', dpi=300 , output path='your/output/dir', image size=1024, max length=32768, no repeat ngram size=35, ngram window=1024, save results=True, Set up the environment uv-managed virtualenv . Install the local SGLang wheel first, then pin kernels==0.9.0 and install PyMuPDF for PDF-to-image conversion: uv venv --python 3.12 source .venv/bin/activate uv pip install wheel/sglang-0.0.0.dev11416+g92e8bb79e-py3-none-any.whl uv pip install kernels==0.11.7 uv pip install pymupdf==1.27.2.2 Start the SGLang server: python -m sglang.launch server \ --model baidu/Unlimited-OCR \ --served-model-name Unlimited-OCR \ --attention-backend fa3 \ --page-size 1 \ --mem-fraction-static 0.8 \ --context-length 32768 \ --enable-custom-logit-processor \ --disable-overlap-schedule \ --skip-server-warmup \ --host 0.0.0.0 \ --port 10000 Send streaming requests to the OpenAI-compatible API: python import base64 import json import os import tempfile import fitz import requests from sglang.srt.sampling.custom logit processor import DeepseekOCRNoRepeatNGramLogitProcessor server url = "http://127.0.0.1:10000" session = requests.Session session.trust env = False def pdf to images pdf path, dpi=300 : doc = fitz.open pdf path tmp dir = tempfile.mkdtemp prefix="pdf ocr " mat = fitz.Matrix dpi / 72, dpi / 72 image paths = for i, page in enumerate doc : image path = os.path.join tmp dir, f"page {i + 1:04d}.png" page.get pixmap matrix=mat .save image path image paths.append image path doc.close return image paths def encode image image path : ext = os.path.splitext image path 1 .lower mime = "image/jpeg" if ext in ".jpg", ".jpeg" else f"image/{ext.lstrip '.' }" with open image path, "rb" as f: data = base64.b64encode f.read .decode "utf-8" return {"type": "image url", "image url": {"url": f"data:{mime};base64,{data}"}} def build content prompt, image paths : return {"type": "text", "text": prompt} + encode image path for path in image paths def generate prompt, image paths, image mode, ngram window : payload = { "model": "Unlimited-OCR", "messages": {"role": "user", "content": build content prompt, image paths } , "temperature": 0, "skip special tokens": False, "images config": {"image mode": image mode}, "custom logit processor": DeepseekOCRNoRepeatNGramLogitProcessor.to str , "custom params": { "ngram size": 35, "window size": ngram window, }, "stream": True, } response = session.post f"{server url}/v1/chat/completions", headers={"Content-Type": "application/json"}, data=json.dumps payload , timeout=1200, stream=True, response.raise for status chunks = for line in response.iter lines chunk size=1, decode unicode=True : if not line or not line.startswith "data: " : continue data = line len "data: " : if data == " DONE ": break event = json.loads data delta = event "choices" 0 .get "delta", {} .get "content", "" if delta: print delta, end="", flush=True chunks.append delta print return "".join chunks Single image supports two configs: gundam or base. Example below uses gundam. generate "document parsing.", "your image.jpg" , image mode="gundam", ngram window=128 Multi image base only generate "Multi page parsing.", "page1.png", "page2.png" , image mode="base", ngram window=1024 PDF base only generate "Multi page parsing.", pdf to images "your doc.pdf", dpi=300 , image mode="base", ngram window=1024 For batch inference, infer.py starts the SGLang server automatically and sends concurrent requests for an image directory or PDF: Image directory python infer.py \ --image dir ./examples/images \ --output dir ./outputs \ --concurrency 8 \ --image mode gundam PDF pages python infer.py \ --pdf ./examples/document.pdf \ --output dir ./outputs \ --concurrency 8 \ --image mode gundam Useful options: --model dir baidu/Unlimited-OCR Local path or Hugging Face model ID --gpu 0 CUDA VISIBLE DEVICES value --server log ./log/sglang server.log We would like to thank Deepseek-OCR https://github.com/deepseek-ai/DeepSeek-OCR , Deepseek-OCR-2 https://github.com/deepseek-ai/DeepSeek-OCR-2 , PaddleOCR https://github.com/PaddlePaddle/PaddleOCR for their valuable models and ideas. @misc{yin2026unlimitedocrworks, title={Unlimited OCR Works}, author={Youyang Yin and Huanhuan Liu and YY and Qunyi Xie and Chaorun Liu and Shiqi Yang and Shaohua Wang and Zhanlong Liu and Hao Zou and Jinyue Chen and Shu Wei and Jingjing Wu and Mingxin Huang and Zhen Wu and Guibin Wang and Tengyu Du and Lei Jia}, year={2026}, eprint={2606.23050}, archivePrefix={arXiv}, primaryClass={cs.CV}, url={https://arxiv.org/abs/2606.23050}, }