Notes on adversarial paraphrasing: a paper review

A paper by Saha et al. (arXiv 2506.07001) demonstrates that detector-guided paraphrasing using RoBERTa as a reward reduces the true positive rate of AI-generated text detectors by 87.88% across Binoculars, Fast-DetectGPT, Ghostbuster, RADAR, and GPTZero. The approach is universal and training-free, and it remains effective even against detectors trained with adversarial examples, suggesting the discriminator signal is narrower than the generator space.

Just finished reading Saha et al. arXiv 2506.07001 on adversarial paraphrasing for AI detector evasion. Key claim: detector-guided paraphrasing with RoBERTa as reward reduces TPR by 87.88 percent across Binoculars, Fast-DetectGPT, Ghostbuster, RADAR, GPTZero. Universal, training-free. What surprised me: the approach works even on detectors that were trained with adversarial examples baked in. Suggests the discriminator signal is fundamentally narrower than the generator space. Open questions: