On the Learnability of Watermarks for Language Models
Abstract
Watermarking of <PRE_TAG>language model</POST_TAG> outputs enables statistical detection of model-generated text, which has many applications in the responsible deployment of <PRE_TAG>language model</POST_TAG>s. Existing <PRE_TAG><PRE_TAG>watermarking</POST_TAG> strategies</POST_TAG> operate by altering the decoder of an existing <PRE_TAG>language model</POST_TAG>, and the ability for a <PRE_TAG>language model</POST_TAG> to directly learn to generate the watermark would have significant implications for the real-world deployment of watermarks. First, learned watermarks could be used to build open models that naturally generate watermarked text, allowing for open models to benefit from <PRE_TAG>watermarking</POST_TAG>. Second, if <PRE_TAG>watermarking</POST_TAG> is used to determine the provenance of generated text, an adversary can hurt the reputation of a victim model by spoofing its watermark and generating damaging watermarked text. To investigate the learnability of watermarks, we propose <PRE_TAG>watermark distillation</POST_TAG>, which trains a student model to behave like a teacher model that uses decoding-based <PRE_TAG><PRE_TAG>watermarking</POST_TAG></POST_TAG>. We test our approach on three distinct decoding-based <PRE_TAG><PRE_TAG>watermarking</POST_TAG></POST_TAG> strategies and various hyperparameter settings, finding that models can learn to generate watermarked text with high detectability. We also find limitations to learnability, including the loss of <PRE_TAG><PRE_TAG>watermarking</POST_TAG> capabilities</POST_TAG> under fine-tuning on normal text and high sample complexity when learning low-distortion watermarks.
Community
Models citing this paper 36
Browse 36 models citing this paperDatasets citing this paper 13
Browse 13 datasets citing this paperSpaces citing this paper 0
No Space linking this paper