Linear Ensembles Can Erase LLM Watermarks

A new study reveals that linear ensembles of just three to five independently trained models can effectively erase watermarks embedded in LLM outputs. The research shows that averaging probability distributions from multiple models cancels out the perturbations used for attribution, reducing detection z-scores below standard thresholds and cutting true-positive rates to under 50%. The findings challenge the reliability of current watermarking schemes for provenance tracking in multi-provider applications.

Watermarking schemes that embed distributional perturbations into LLM outputs are effectively broken by linear ensembles of a few independently trained models. The intuition behind most provenance tools is that a tiny bias introduced at generation time survives any downstream processing, making it detectable by a statistical test. In practice, that assumption collapses as soon as an application draws from more than one provider. The result is a hidden amplifier for hallucination‑free text that simultaneously wipes out the very signal we trusted for attribution. Before this work, the community treated watermark perturbations as immutable once rendered into the probability distribution. Methods such as the z‑score detector or the binary‑mask classifier were evaluated on single‑model generations and consistently reported true‑positive rates above 90 % at a 5 % false‑positive budget. The detection threshold of 4 on the z‑score became the de facto benchmark for a “detectable” watermark, and no prior study had examined how ensemble decoding would interact with that statistic. A linear ensemble of just three to five models eradicates the watermark signal in practice. “Empirically, simply averaging 3-5 models cancels out these perturbations.” 1 https://arxiv.org/abs/2605.30501 The cancellation works because each provider injects an independent perturbation; averaging restores the underlying, unwatermarked distribution up to a second‑order error term. No extra training or fine‑tuning is required—plain probability averaging suffices. Averaging three models drives detection z‑scores below the standard threshold of 4, cuts true‑positive rate at 5 % false‑positive rate to under 50 %, and simultaneously boosts text quality by 27.5 % while running six times faster than the strongest baseline. “Experiments across six watermarking schemes and three LLMs show that averaging across 3 models suppresses detection z-scores from 5-300 to below 2 below the detection threshold of 4 and reduces TPR@5%FPR to below 50%, while improving quality by 27.5% and running 6 faster than the best baseline on the long sequence generation.” 1 https://arxiv.org/abs/2605.30501 The authors also introduce WASH, a lightweight pipeline that aligns vocabularies and tokenisers across heterogeneous models, making the ensemble practical even when the constituent LLMs differ in architecture. The study leaves open several important questions. It evaluates only six watermarking schemes and three LLM families, so it is unclear whether more sophisticated, non‑linear perturbations would survive averaging. WASH mitigates vocabulary mismatches, yet scaling to dozens of providers may introduce latency or memory bottlenecks not captured in the reported six‑fold speedup. Moreover, the analysis assumes that perturbations are statistically independent; coordinated watermarking could deliberately inject correlated noise to resist cancellation, but the feasibility of such industry‑wide coordination remains speculative. Robust provenance tracking can no longer rely on simple distributional watermarks unless model providers adopt a universally shared signing key or shift to cryptographic signatures that survive ensemble decoding. The immediate consequence is that any service that aggregates outputs from multiple LLM APIs must treat watermark‑based detection as unreliable and consider alternative attribution mechanisms.