arxiv:2405.15012

Extracting Prompts by Inverting LLM Outputs

Published on May 23, 2024

Authors:

Abstract

We consider the problem of language model inversion: given outputs of a language model, we seek to extract the prompt that generated these outputs. We develop a new black-box method, output2<PRE_TAG>prompt</POST_TAG>, that learns to extract prompts without access to the model's logits and without adversarial or jailbreaking queries. In contrast to previous work, output2<PRE_TAG>prompt</POST_TAG> only needs outputs of normal user queries. To improve memory efficiency, output2<PRE_TAG>prompt</POST_TAG> employs a new sparse encoding techique. We measure the efficacy of output2<PRE_TAG>prompt</POST_TAG> on a variety of user and system prompts and demonstrate zero-shot transferability across different LLMs.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

No model linking this paper

Cite arxiv.org/abs/2405.15012 in a model README.md to link it from this page.

No dataset linking this paper

Cite arxiv.org/abs/2405.15012 in a dataset README.md to link it from this page.

No Collection including this paper

Add this paper to a collection to link it from this page.