Adaptively-tuned Llama-3.2-3B Paraphraser

This model is an adaptively fine-tuned version of Qwen2.5-3B-Instruct optimized to evade the Unigram watermarking method while preserving text quality. It serves as a paraphrasing model that maintains semantic meaning while modifying the statistical patterns used for watermark detection.

Model Details

Model Description

This model is a fine-tuned version of Qwen2.5-3B-Instruct that has been optimized using Direct Preference Optimization (DPO) to evade the Unigram watermarking method described in Zhao et. al (2023). The model preserves text quality while modifying the statistical patterns that watermarking methods rely on for detection.

Model type: Decoder-only transformer language model
Language(s): English
Finetuned from model: meta-llama/Llama-3.2-3B-Instruct

Get Started

from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel, PeftConfig

# Load the base model
model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-3.2-3B-Instruct")
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3.2-3B-Instruct")

# Load the LoRA adapter
model = PeftModel.from_pretrained(model, "DDiaa/WM-Removal-Unigram-Llama-3.2-3B")

# Prepare the prompt

system_prompt = (
    "You are an expert copy-editor. Please rewrite the following text in your own voice and paraphrase all "
    "sentences.\n Ensure that the final output contains the same information as the original text and has "
    "roughly the same length.\n Do not leave out any important details when rewriting in your own voice. Do "
    "not include any information that is not present in the original text. Do not respond with a greeting or "
    "any other extraneous information. Skip the preamble. Just rewrite the text directly."
)

def paraphrase_text(text):
    # Prepare prompt
    prompt = tokenizer.apply_chat_template(
        [
            {"role": "system", "content": system_prompt},
            {"role": "user", "content": f"\n[[START OF TEXT]]\n{text}\n[[END OF TEXT]]"},
        ],
        tokenize=False,
        add_generation_prompt=True,
    ) + "[[START OF PARAPHRASE]]\n"
    
    # Generate paraphrase
    inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
    outputs = model.generate(
        **inputs,
        max_new_tokens=512,
        temperature=1.0,
        do_sample=True,
        pad_token_id=tokenizer.pad_token_id
    )
    
    # Post-process output
    paraphrased = tokenizer.decode(outputs[0], skip_special_tokens=True)
    paraphrased = paraphrased.split("[[START OF PARAPHRASE]]")[1].split("[[END OF")[0].strip()
    
    return paraphrased

Uses

Direct Use

The model is designed for research purposes to:

Study the robustness of watermarking methods
Evaluate the effectiveness of adaptive attacks against content watermarks
Test and develop improved watermarking techniques

Downstream Use

The model can be integrated into:

Watermark robustness evaluation pipelines
Research frameworks studying language model security
Benchmark suites for watermarking methods

Out-of-Scope Use

This model should not be used for:

Production environments requiring watermark compliance
Generating deceptive or misleading content
Evading legitimate content attribution systems
Any malicious purposes that could harm individuals or society

Bias, Risks, and Limitations

The model inherits biases from the base Qwen2.5-3B-Instruct model
Performance varies based on text length and complexity
Evasion capabilities may be reduced against newer watermarking methods
May occasionally produce lower quality outputs compared to the base model
Limited to English language texts

Recommendations

Use only for research and evaluation purposes
Always maintain proper content attribution
Monitor output quality metrics
Consider ethical implications when studying security measures
Use in conjunction with other evaluation methods

Citation

BibTeX:

@article{diaa2024optimizing,
  title={Optimizing adaptive attacks against content watermarks for language models},
  author={Diaa, Abdulrahman and Aremu, Toluwani and Lukas, Nils},
  journal={arXiv preprint arXiv:2410.02440},
  year={2024}
}

Model Card Contact

For questions about this model, please file an issue on the GitHub repository: https://github.com/ML-Watermarking/ada-llm-wm

DDiaa
/

WM-Removal-Unigram-Llama-3.2-3B