Snap any image, screenshot, or webpage into plaintext. No GPU. No cloud. One command.
textsnap screenshot.png
That's it. You get a .txt
next to your shell, recognized on your CPU, from a screenshot, a photo, an image URL, or even a webpage.
- โก
Runs on CPU. A 0.9B PaddleOCR-VL-1.5 vision-language model, quantized to q4 ONNX, parses full pages on a plain laptop. No CUDA. No M-series-only tricks. Plain old cores, pinned to your physical-core count. - ๐ผ
Images, screenshots, URLs, webpages. Point it at a local file, a direct image URL, or a full article URL โ it isolates the main content and OCRs the most prominent image. Or OCR straight from your clipboard with no argument at all โ and get the text putbackon your clipboard, ready to paste. - ๐ด
Offline after first run.~890 MB of ONNX downloads once to your cache and stays there. No API keys. No quotas. Your images never leave your machine. - ๐
Portable. Drop the model files next to the script and the whole folder becomes a self-contained, copy-anywhere tool โ no install, no download, no flags. - ๐ชถ
One file. The whole tool is a single Python module. Dependencies install themselves on first run if missing. - ๐
Markdown or plaintext. Default output is the model's native markdown (tables, headings, structure preserved). Add
--plaintext
to flatten it.
pip install textsnap
textsnap screenshot.png
textsnap https://example.com/article --plaintext
textsnap photo.jpg -o ~/notes/receipt.txt
The first run downloads the model (~890 MB). Every run after is offline.
| Source | Example |
|---|---|
| Clipboard | textsnap (no argument) |
| Local image file | textsnap path/to/img.png |
| Direct image URL | textsnap https://example.com/x.png |
| Webpage URL | textsnap https://example.com/article |
Local files cover anything Pillow can decode: .png
, .jpg
, .jpeg
, .webp
, .bmp
, .gif
, .tiff
, and friends. For webpage URLs, textsnap uses readability to isolate the main content, then picks the most prominent image on the page and OCRs that.
Run textsnap
with no argument and it reads the image currently on your clipboard. The recognized text is then copied straight back to the clipboard, so a screenshot-to-text round trip is just: snap โ textsnap
โ paste.
The .txt
file is still written as well (and its path still printed to stdout), so nothing about scripting changes โ the clipboard copy is a pure convenience layered on top.
Clipboard-out uses your platform's native tool โ pbcopy
(macOS), clip
(Windows), or wl-copy
/ xclip
/ xsel
(Linux) โ so it needs no extra Python package. If none of those is installed, textsnap simply skips the clipboard copy; the .txt
file is always there regardless. (Run with -v
to see whether the copy succeeded.)
By default textsnap downloads its model files to an OS cache directory (~/.cache/textsnap/
). But if it finds the model files sitting next to the script, it uses those directly โ no download, no --model-dir
flag, no setup at all.
"Next to the script" means a layout like:
textsnap/
โโโ textsnap.py
โโโ onnx/
โ โโโ vision_encoder_q4.onnx
โ โโโ decoder_q4.onnx
โ โโโ embedding.onnx
โโโ tokenizer.json
Drop those files in, and you can copy the entire textsnap/
folder to any machine โ a USB stick, an air-gapped box, a fresh laptop โ and run it immediately, fully offline, with zero install steps.
Model-directory resolution order:
--model-dir DIR
โ if you pass it explicitly, it always wins.Portableโ model files found next to the script.** OS cache**โ~/.cache/textsnap/
, down on first run if needed.
Like
--model-dir
, portable-mode files arenotSHA-256 verified โ files you placed there yourself are trusted by definition. Integrity verification applies to files textsnapdownloads. See[Security].
pip install textsnap
Installs two equivalent commands on your PATH
: ** textsnap** (canonical) and
(alias, for when the name slips your mind).
ocr
To install from a local source checkout instead:
pip install .
For a reproducible install with exact pinned dependency versions:
pip install -r requirements-lock.txt
pip install .
Clipboard note.Reading imagesfromthe clipboard relies on Pillow'sImageGrab
; on Linux you may needxclip
orwl-clipboard
installed. Writing recognized textbackto the clipboard usespbcopy
/clip
/wl-copy
/xclip
/xsel
. macOS and Windows work out of the box.
textsnap
textsnap path/to/screenshot.png
textsnap "https://example.com/diagram.png"
textsnap "https://example.com/article"
textsnap input.png --plaintext
textsnap input.png -o ./out/extracted.txt
textsnap dense-page.png --max-tokens 4096
textsnap input.png --max-pixels 250000
textsnap input.png --model-dir ~/models/paddleocr-vl
Plaintext, UTF-8. Default location is ./textsnaps/
(created if missing) under the current working directory; override with -o
. The filename is derived from the image filename stem (receipt_ocr.txt
), or from the webpage slug for URL inputs.
textsnap is quiet by default, Unix-style: the only thing printed to stdout is the path to the file it wrote, so it composes cleanly โ
OUT=$(textsnap receipt.png) # capture the path
textsnap receipt.png | xargs cat # print the recognized text
When the input is the clipboard, the recognized text is also placed on the clipboard โ see Clipboard in, clipboard out.
Pass -v
to send progress diagnostics (input type, image size, decode speed, token counts) to stderr; stdout stays just the path either way.
Default file output is the model's native markdown โ it preserves tables, headings, and document structure:
| Region | Revenue |
| ------ | ------- |
| EMEA | $1.2M |
| APAC | $0.9M |
With ** --plaintext**, markdown is flattened to bare text:
Quarterly Report
Region Revenue
EMEA $1.2M
APAC $0.9M
| Flag | Description |
|---|---|
-o , --output |
|
Output .txt path. Default: ./textsnaps/<name>_ocr.txt . |
|
-v , --verbose |
|
| Print progress diagnostics to stderr. Off by default. | |
--plaintext |
|
| Flatten the model's native markdown to plain text. | |
--model-dir |
|
| Use ONNX/config files from this directory. Overrides portable mode and the OS cache. | |
--max-tokens |
|
Cap generated tokens. Default 2048 . Raise it for very dense pages. |
|
--max-pixels |
|
| Image pixel budget fed to the vision encoder. Default is the model's maximum. Lower trades accuracy for speed; too low makes the model hallucinate. The image is only ever shrunk, never enlarged. | |
--no-verify |
|
| Skip SHA-256 verification of downloaded model files (not advised). | |
--generate-checksums |
|
| Download the pinned model files, write a fresh manifest, and exit. |
An environment variable, TEXTSNAP_DECODE_THREADS
, overrides the decoder's intra-op thread count if you want to tune CPU decode for a specific machine. Left unset, textsnap picks a sensible default based on your physical core count.
textsnap auto-downloads ~890 MB of model weights from the Hugging Face Hub on first run, so it treats those files as untrusted until proven otherwise:
Pinned model revision. Downloads are pinned to a specific repo revision, so a moved or retaggedmain
can't silently swap the weights.SHA-256 verification. Every downloaded file is hashed and checked against known-good digests before it's loaded. A mismatch aborts the run with a clear error rather than executing unverified weights. Digests live inand are also embedded in the script as a fallback, so verification works whether you install from source or from a wheel.model_checksums.sha256
Pinned dependencies. pins exact dependency versions for reproducible installs; the file documents how to add per-wheelrequirements-lock.txt
--hash
entries withpip-compile --generate-hashes
for full supply-chain pinning.
Verification applies to files textsnap downloads. Model files you supply yourself โ via --model-dir
or portable mode โ are trusted as-is and not re-hashed; you are responsible for their provenance.
Regenerate the checksum manifest after a deliberate model-revision bump:
textsnap --generate-checksums
To bypass verification (for local experimentation with a modified model), pass --no-verify
.
Load. From the clipboard, a local file, a direct image URL, or โ for a webpage URL โ the most prominent image inside the page's main content (readability + a prominence heuristic).Preprocess. The image is run through PaddleOCR-VL's Qwen2-VL-style smart-resize and patchify, producing the pixel-value tensor and grid the vision encoder expects. Smart-resize bounds the image to the model's pixel budget (tunable with--max-pixels
) and snaps it to the patch grid โ textsnap does not pre-shrink beyond that, since starving the encoder of resolution makes the model hallucinate rather than degrade gracefully.Recognize. Three ONNX components run on CPU: a vision encoder (q4), a token-embedding model (fp32), and an autoregressive decoder (q4) with a wired-up KV cache bound via ONNX Runtime IOBinding to avoid copying the cache each step. Greedy decode, guarded against runaway repetition by an n-gram block (it refuses to re-emit an n-gram it has already produced) plus a loop detector that trims any cycle that slips through.Format. Native markdown by default;--plaintext
reduces it to bare text.
No image is sent anywhere. No state is kept between runs except the cached model.
The PaddleOCR-VL-1.5 ONNX components are downloaded on first run to ~/.cache/textsnap/
:
onnx/vision_encoder_q4.onnx
โ vision encoder + spatial-merge projectoronnx/decoder_q4.onnx
โ autoregressive decoderonnx/embedding.onnx
โ token embeddings (fp32; no q4 variant exists)tokenizer.json
,config.json
Together ~890 MB. To use your own copy, either point --model-dir
at a directory containing the same onnx/
files plus tokenizer.json
and config.json
, or place those files next to the script for portable mode.
First run is the slow oneโ it downloads ~890 MB. After that, textsnap is fully offline.** CPU decode is sequential.**Dense, full-page documents take longer than a short screenshot. textsnap pins thread counts to your physical cores and prints a live tokens/sec readout so a slow run is visibly alive, not hung.Very dense pages can hit the default 2048-token cap and truncate; raise it if the tail of a page is missing.--max-tokens
caps the output.Lowering it speeds up the vision encoder but feeds the model a coarser image; set it too low and recognition quality drops sharply. The default (the model's full budget) is the safe choice.--max-pixels
is a speed/accuracy dial.Webpage inputs OCR one imageโ the most prominent one in the main content, not the whole rendered page.** Greedy decoding**can occasionally loop on repetitive layouts; an n-gram block prevents most loops outright and a detector trims any that remain.
MIT for this project โ see LICENSE.
The model is PaddleOCR-VL-1.5, distributed under Apache-2.0 by PaddlePaddle; textsnap pulls the ONNX export from onnx-community/PaddleOCR-VL-1.5-ONNX. See the
original model cardfor model terms. Powered by
onnxruntimeand