Papers
arxiv:2405.15012

Extracting Prompts by Inverting LLM Outputs

Published on May 23, 2024
Authors:
,
,

Abstract

We consider the problem of language model inversion: given outputs of a language model, we seek to extract the prompt that generated these outputs. We develop a new black-box method, output2<PRE_TAG>prompt</POST_TAG>, that learns to extract prompts without access to the model's logits and without adversarial or jailbreaking queries. In contrast to previous work, output2<PRE_TAG>prompt</POST_TAG> only needs outputs of normal user queries. To improve memory efficiency, output2<PRE_TAG>prompt</POST_TAG> employs a new sparse encoding techique. We measure the efficacy of output2<PRE_TAG>prompt</POST_TAG> on a variety of user and system prompts and demonstrate zero-shot transferability across different LLMs.

Community

Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2405.15012 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2405.15012 in a dataset README.md to link it from this page.

Spaces citing this paper 1

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.