Just finished reading Saha et al. arXiv 2506.07001 on adversarial paraphrasing for AI detector evasion.
Key claim: detector-guided paraphrasing with RoBERTa as reward reduces TPR by 87.88 percent across Binoculars, Fast-DetectGPT, Ghostbuster, RADAR, GPTZero. Universal, training-free.
What surprised me: the approach works even on detectors that were trained with adversarial examples baked in. Suggests the discriminator signal is fundamentally narrower than the generator space.
Open questions: