NVIDIA Introduces X-Token: Projection-Guided Cross-Tokenizer KD That Outperforms GOLD by +3.82 Average Points on Llama-3.2-1B NVIDIA researchers introduced X-Token, a logit-distribution-based method for cross-tokenizer knowledge distillation that requires no auxiliary trainable components or architectural changes. The method outperforms the current state-of-the-art GOLD by +3.82 average points on Llama-3.2-1B by solving structural failures in GOLD's design, including the misalignment of critical tokens and over-conservative matching. X-Token uses dynamic-programming span alignment and a projection matrix to enable distillation across incompatible tokenizers, allowing practitioners to leverage stronger teachers like Phi-4-mini or Qwen3-4B. Knowledge distillation KD transfers “dark knowledge” from a large teacher model to a smaller student. The student learns from the teacher’s full output probability distribution over tokens, not just correct answers. This is done via per-position Kullback–Leibler KL divergence over next-token probability distributions. This formulation requires a shared tokenizer. A practitioner committed to Llama-3.2-1B cannot leverage stronger teachers with incompatible tokenizers — such as Phi-4-mini or Qwen3-4B — because token positions do not correspond across vocabularies. This also prevents multi-teacher distillation across tokenizer families. NVIDIA researchers introduced X-Token , a logit-distribution-based method for cross-tokenizer KD Knowledge distillation . It operates as a drop-in replacement for the standard KD loss, requiring no auxiliary trainable components and no architectural changes. The Problem X-Token is Solving Two prior approaches dominate cross-tokenizer KD. ULD Universal Logit Distillation sidesteps vocabulary alignment by rank-sorting both distributions and minimizing L1 distance. It discards token identity entirely. GOLD adds span alignment and a hybrid loss. It partitions tokens into a 1-to-1 string-matched common subset, trained with KL divergence, and an uncommon remainder, trained with ULD-style rank matching. GOLD is the current state of the art. The research team identifies two structural failures in GOLD’s design : Failure 1: Uncommon-token failure – When tokenizers fragment text differently, critical tokens fall into the unmatched uncommon subset. Llama-3 packs multi-digit numbers as single tokens — “201” is one token. Qwen3 splits them digit by digit: “2”, “0”, “1”. Under GOLD, all 1,100 of Llama’s two- and three-digit numerals 100 two-digit, 1,000 three-digit fall into the uncommon set when Qwen3-4B is the teacher. Those tokens receive two types of harmful signal: identity-agnostic noise from rank-based ULD matching, and suppressive gradients from the common-KL term acting through the full-vocabulary softmax. The result: GSM8k accuracy drops to 2.56 under GOLD with Qwen3-4B, compared to 12.89 for same-tokenizer KD from a weaker Llama-3.2-3B teacher. Failure 2: Over-conservative matching – GOLD uses strict string equality to define the common subset. A student token Hundreds corresponds to teacher tokens Hund followed by reds under teacher-side re-tokenization, but strict matching discards this pair. Useful alignment signal is lost even when the correspondence is well-formed. These two failures require opposite remedies: eliminate the partition when critical tokens are misaligned, and relax it when alignment is structurally sound. How X-Token Works X-Token has three components: span alignment, a projection matrix W, and two complementary loss formulations — P-KL and H-KL. Span Alignment Teacher and student tokenizers produce sequences of different lengths for the same text. X-Token uses dynamic-programming DP span alignment, grouping tokens into chunks where each chunk-pair decodes to the same underlying text substring. A chain-rule merge then combines per-token probabilities within each chunk into a single chunk-level distribution for use in the distillation loss. The alignment is cached per sequence and adds no per-step training overhead. The research team also identifies a failure in TRL’s surface-substring alignment, which is used in TRL’s GOLD trainer. TRL accumulates per-side decoded buffers and flushes only when both buffers match as equal raw strings. A byte-level disagreement — such as Llama-3 auto-prepending