Extracting Prompts by Inverting LLM Outputs
Abstract
We consider the problem of language model inversion: given outputs of a language model, we seek to extract the prompt that generated these outputs. We develop a new black-box method, output2<PRE_TAG>prompt</POST_TAG>, that learns to extract prompts without access to the model's logits and without adversarial or jailbreaking queries. In contrast to previous work, output2<PRE_TAG>prompt</POST_TAG> only needs outputs of normal user queries. To improve memory efficiency, output2<PRE_TAG>prompt</POST_TAG> employs a new sparse encoding techique. We measure the efficacy of output2<PRE_TAG>prompt</POST_TAG> on a variety of user and system prompts and demonstrate zero-shot transferability across different LLMs.
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 1
Collections including this paper 0
No Collection including this paper