Papers
arxiv:2402.16444

ShieldLM: Empowering LLMs as Aligned, Customizable and Explainable Safety Detectors

Published on Feb 26, 2024
Authors:
,
,
,
,
,
,
,
,
,
,

Abstract

The safety of Large Language Models (LLMs) has gained increasing attention in recent years, but there still lacks a comprehensive approach for detecting safety issues within LLMs' responses in an aligned, customizable and explainable manner. In this paper, we propose ShieldLM, an LLM-based safety detector, which aligns with general human safety standards, supports customizable detection rules, and provides explanations for its decisions. To train ShieldLM, we compile a large bilingual dataset comprising 14,387 query-response pairs, annotating the safety of responses based on various safety standards. Through extensive experiments, we demonstrate that ShieldLM surpasses strong baselines across four test sets, showcasing remarkable customizability and explainability. Besides performing well on standard detection datasets, ShieldLM has also been shown to be effective in real-world situations as a safety evaluator for advanced LLMs. We release ShieldLM at https://github.com/thu-coai/ShieldLM to support accurate and explainable safety detection under various safety standards, contributing to the ongoing efforts to enhance the safety of LLMs.

Community

Sign up or log in to comment

Models citing this paper 6

Browse 6 models citing this paper

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2402.16444 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2402.16444 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.