Quicktok: A faster tokenizer

Developer dmatth1 released quicktok, a fast exact BPE tokenizer written in C++ that runs 2–3.5× faster than bpe-openai and 4–11× faster than tiktoken on CPU. The tokenizer is byte-identical to tiktoken and available as a Python library and C++ source, targeting large-scale data processing and inference serving.

Been working on this a while Should be useful for anyone trying to speed up their tokenization flow. Introducing quicktok native C++ and Python quicktok is a fast, exact BPE tokenizer written in C++. Token ids are byte-identical to tiktoken , and encoding runs 2–3.5× faster than bpe-openai the fastest alternative I know of and 4–11× faster than tiktoken itself. I believe it’s the fastest exact CPU tokenizer available today for these encodings. It ships cl100k, o200k, GPT-OSS o200k harmony , Llama-3, and Qwen2.5/3, all byte-exact, plus bring-your-own Llama-4. This is useful for anyone doing large amounts of CPU-bound data processing — search indexing, ingesting corpora, token counting/billing — and can significantly reduce the time and cost of data ingestion. It can also be used for online request serving, such as CPU-bound inference paths token counting, embedding serving . I’m releasing it as a Python library pip install quicktok-v1 and it’s available via C++ source. Repo: GitHub - dmatth1/quicktok: Fast exact BPE tokenizer. Byte-identical to tiktoken, 7x faster · GitHub https://github.com/dmatth1/quicktok . Measured on 3 public corpora on my Apple M1, single thread, MB/s. Every encoder’s output was verified token-for-token identical against tiktoken before timing. cl100k base GPT-3.5 / GPT-4 | encoder | The Pile | GitHub code | Common Crawl | |---|---|---|---| quicktok | 116.1 | 144.2 | 75.2 | | bpe-openai | 36.5 | 41.6 | 29.2 | | tiktoken-rs | 15.3 | 14.3 | 13.5 | | tiktoken Python | 14.7 | 13.2 | 12.3 | | TokenDagger | 11.5 | 12.0 | 11.2 | o200k base GPT-4o | encoder | The Pile | GitHub code | Common Crawl | |---|---|---|---| quicktok | 100.6 | 117.1 | 59.2 | | bpe-openai | 36.1 | 40.1 | 29.9 | | tiktoken-rs | 23.1 | 20.9 | 17.9 | | tiktoken Python | 21.6 | 19.3 | 16.3 | | TokenDagger | 11.0 | 11.7 | 10.2 | quicktok also beats llama.cpp’s tokenizer on the Llama-3 vocab by ~14× . The parallel encode batch reaches 706 MB/s native on 8 cores; from Python it sustains 550 MB/s — 24× tiktoken ’s batch API. The speedups hold on other architectures like x86. To keep the comparison fair, each encoder is called through the same raw API its own benchmark uses. TokenDagger’s README claims 2–4× over tiktoken, but that’s on Llama-4/Mistral vocabs on AMD EPYC; on cl100k/o200k it lands around Python tiktoken’s level. To reproduce run make bench-compare in the repo. The fundamental algorithm is the same as bpe-openai exact backtracking BPE - see their blog post https://github.blog/ai-and-ml/llms/so-many-tokens-so-little-time-introducing-a-faster-more-flexible-byte-pair-tokenizer/ . Much of the speedup over bpe-openai comes from data-structure engineering around reducing memory accesses. All comparisons are single-threaded by design - parallel/batch is available but single-threaded is fair for comparison. Multilingual text Common Crawl is definitely the weakest ratio. Numbers above are from an M1 and were cross-checked on x86 Xeon - the ordering holds on both but absolute MB/s moves with corpus and host. Table numbers are the native C++ build -march=native . The prebuilt PyPI wheels are portable-ABI and land lower - roughly 1.1–1.6× bpe-openai depending on corpus. Building from source recovers the table numbers. The full methodology - corpus fetching, the exactness gate, raw-API rules — is in the bench README in the repo. If you find an input where quicktok’s ids differ from tiktoken’s that’s definitely a bug and please report it