{"slug": "why-cjk-support-in-rust-is-hard", "title": "Why CJK Support in Rust Is Hard", "summary": "Embedding and processing CJK (Chinese, Japanese, Korean) text in Rust is significantly more complex than handling Latin scripts due to the enormous size of CJK fonts, which require font subsetting to be practical. Subsetting CJK fonts introduces technical challenges like glyph ID remapping, CMap table reconstruction, and correct PDF object graph construction, areas where the Rust ecosystem still has gaps. Additionally, CJK text presents unique encoding problems involving Unicode normalization forms and compatibility ideographs, which can cause subtle issues in text search and rendering.", "body_md": "Most Rust developers don't think about CJK until they need it. Then they discover that embedding Japanese text in a PDF, building a search index over Chinese content, or normalizing Korean input involves a stack of interlocking problems that Latin-script tooling simply never had to solve.\nThis post breaks down why CJK is genuinely hard — not just \"different\" — and where the Rust ecosystem still has gaps.\nThe first thing that surprises developers: a full CJK font file is enormous.\nA Latin font like Inter Regular is around 300 KB. A full Japanese font — say, Noto Sans CJK JP — is over 15 MB. That's because Unicode's CJK Unified Ideographs block alone defines over 92,000 characters, and a production font needs to cover most of them.\nFor most use cases you don't need all 92,000 glyphs. If you're generating a PDF invoice with a customer name and address, you might use 50 distinct CJK characters. But a naive approach embeds the entire font, making a simple document balloon to 15 MB.\nThe solution is font subsetting: extract only the glyphs actually used, rebuild a minimal font binary, and embed that. It sounds straightforward. It isn't.\nSubsetting a Latin font is well-understood. Subsetting a CJK font involves:\nGlyph ID remapping. A font maps Unicode code points to internal Glyph IDs (GIDs). After subsetting, the GID space is compacted — the 50 glyphs you kept now have new GIDs from 0 to 49. Every reference to the old GIDs in the font binary and in your document needs to be updated.\nCMap table reconstruction. The font's cmap\ntable maps Unicode → GID. After subsetting, this table must be rebuilt to reflect the new GID assignments. Get this wrong and the font renders garbage or fails to load entirely.\nAdvance width recalculation. Fonts store per-glyph advance widths (how far the cursor moves after each character). After GID remapping, the width table must be reindexed. In PDF specifically, the /Widths\narray in the CIDFont object must match the new GIDs exactly — a mismatch causes text spacing to break in subtle, hard-to-debug ways.\nType0/CIDFont object graph. PDF represents CJK fonts as a two-level structure: a Type0 (composite) font wrapping a CIDFont. The CIDFont references the embedded font stream and the ToUnicode CMap. Building this object graph correctly requires understanding the PDF spec at a level most developers would rather avoid.\nIn pure Rust, the allsorts crate handles TTF subsetting. It works well for TrueType fonts. OpenType CFF fonts (.otf\nfiles with PostScript outlines) are more complex and allsorts coverage is incomplete — this is a known gap in the Rust ecosystem.\nPDF separates rendering (which glyph to draw) from semantics (what Unicode character that glyph represents). Rendering uses GIDs. Semantics are stored in a separate stream called the ToUnicode CMap.\nWithout a ToUnicode CMap:\nThe CMap is a PostScript-like stream that maps GID ranges to Unicode code points. For CJK fonts with thousands of glyphs, generating this stream correctly — with proper range compression for consecutive code points — requires care. A naive one-entry-per-glyph approach technically works but produces unnecessarily large streams.\nCJK text has an encoding problem that Latin scripts largely don't: the same logical character can have multiple valid representations.\nUnicode normalization forms (NFC, NFD, NFKC, NFKD) affect how composed characters are stored. Japanese text in particular mixes hiragana, katakana, kanji, and Latin characters, each with their own normalization quirks. Fullwidth ASCII (Ａ\n, Ｂ\n, Ｃ\n) and halfwidth katakana (ｱ\n, ｲ\n, ｳ\n) are canonically equivalent to their standard forms under NFKC but not NFC.\nCJK Compatibility Ideographs (U+F900–U+FAFF) are compatibility mappings for characters that appear in legacy encodings. U+FA30 (㌍) is canonically equivalent to U+30AD U+30ED (キロ). Depending on whether you normalize before indexing, the same string might or might not match a query.\nVariant selectors add another layer. CJK Unified Ideographs sometimes have multiple visual forms (simplified vs. traditional Chinese, Japanese vs. Korean glyph shapes). Unicode encodes this with Variation Selectors — invisible code points that follow a base character to select a specific glyph. 葛\nfollowed by VS17 (U+E0100) selects a specific variant used in place names. A text search that isn't VS-aware will fail to match these strings.\nFor fuzzy matching over CJK content, you need to decide which of these equivalences to collapse before indexing. The right answer depends on the use case: a legal document system probably wants exact glyph matching; a general search index probably wants NFKC normalization.\nModern CJK text is Unicode, but a significant amount of real-world content is still encoded in legacy formats:\nConverting these to Unicode isn't just a lookup table — legacy CJK encodings have overlapping code spaces, vendor extensions, and edge cases that differ between Windows, macOS, and Linux implementations.\nThe encoding_rs crate (originally written for Firefox) is the authoritative pure Rust implementation of the WHATWG Encoding Standard and handles most of these correctly. This is one area where the Rust ecosystem is actually in good shape.\nThe elephant in the room: most production CJK text processing still depends on C or C++ libraries.\nHarfBuzz — text shaping (converting Unicode to positioned glyphs) — is C++. For CJK, shaping is relatively simple compared to Arabic or Indic scripts (no complex ligatures or bidirectional reordering), but HarfBuzz is still the de facto standard.\nFreeType — font rasterization — is C. If you're rendering CJK text to a bitmap, you're almost certainly using FreeType bindings.\nICU (International Components for Unicode) — normalization, collation, locale-aware string comparison — is C++. The icu4x\nproject is a ground-up Rust rewrite led by the Unicode Consortium, and it's making solid progress, but it's not yet a drop-in replacement for all ICU use cases.\nThe consequence for Rust developers: if you need CJK support and reach for crates that wrap these C libraries, you give up WASM compatibility, you complicate cross-compilation, and you add a build-time dependency on the system libraries or vendored C sources.\nHere's an honest assessment of the pure Rust ecosystem for CJK work:\nThe gaps are real. Text shaping in particular is a hard open problem for pure Rust — for simple CJK rendering you can get away without a full shaper, but for mixed CJK/Latin text with proper kerning and ligatures, you eventually need something HarfBuzz-level.\nIf you're building something that needs to handle CJK text in Rust:\nencoding_rs\n. Don't roll your own.unicode-normalization\nand decide up front which form you want. For search, NFKC is usually the right default.CJK support isn't a single feature — it's a stack of problems that compound. The good news is that the Rust ecosystem is making real progress on each layer. The bad news is that each layer requires understanding the layer below it, which is why CJK support tends to be either \"works perfectly\" or \"completely broken\" with little middle ground.\nIf you're working on any of these problems — subsetting, normalization, collation, shaping — I'd love to compare notes.\nI've run into most of these problems while building harumi, a pure Rust PDF library with CJK font subsetting. The gaps in the table above are the ones I've personally hit.", "url": "https://wpnews.pro/news/why-cjk-support-in-rust-is-hard", "canonical_source": "https://dev.to/kent-tokyo/why-cjk-support-in-rust-is-hard-5bcf", "published_at": "2026-05-20 11:30:23+00:00", "updated_at": "2026-05-20 11:32:24.591462+00:00", "lang": "en", "topics": ["developer-tools", "open-source", "research", "data"], "entities": ["Rust", "CJK", "Unicode", "Noto Sans CJK JP", "Inter Regular"], "alternates": {"html": "https://wpnews.pro/news/why-cjk-support-in-rust-is-hard", "markdown": "https://wpnews.pro/news/why-cjk-support-in-rust-is-hard.md", "text": "https://wpnews.pro/news/why-cjk-support-in-rust-is-hard.txt", "jsonld": "https://wpnews.pro/news/why-cjk-support-in-rust-is-hard.jsonld"}}