Papers
arxiv:2406.07070

HalluDial: A Large-Scale Benchmark for Automatic Dialogue-Level Hallucination Evaluation

Published on Jun 11, 2024
Authors:
,
,
,
,
,
,

Abstract

Large Language Models (LLMs) have significantly advanced the field of Natural Language Processing (NLP), achieving remarkable performance across diverse tasks and enabling widespread real-world applications. However, LLMs are prone to hallucination, generating content that either conflicts with established knowledge or is unfaithful to the original sources. Existing hallucination benchmarks primarily focus on sentence- or passage-level <PRE_TAG>hallucination</POST_TAG> detection, neglecting dialogue-level evaluation, <PRE_TAG>hallucination localization</POST_TAG>, and rationale provision. They also predominantly target factuality hallucinations while underestimating faithfulness <PRE_TAG>hallucinations</POST_TAG>, often relying on labor-intensive or non-specialized evaluators. To address these limitations, we propose HalluDial, the first comprehensive large-scale benchmark for automatic dialogue-level <PRE_TAG>hallucination evaluation</POST_TAG>. HalluDial encompasses both spontaneous and induced <PRE_TAG>hallucination scenarios</POST_TAG>, covering factuality and faithfulness <PRE_TAG>hallucinations</POST_TAG>. The benchmark includes 4,094 dialogues with a total of 146,856 samples. Leveraging HalluDial, we conduct a comprehensive meta-evaluation of LLMs' hallucination evaluation capabilities in information-seeking dialogues and introduce a specialized judge language model, HalluJudge. The high data quality of HalluDial enables HalluJudge to achieve superior or competitive performance in hallucination evaluation, facilitating the automatic assessment of dialogue-level <PRE_TAG>hallucinations</POST_TAG> in LLMs and providing valuable insights into this phenomenon. The dataset and the code are available at https://github.com/FlagOpen/HalluDial.

Community

Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2406.07070 in a model README.md to link it from this page.

Datasets citing this paper 1

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2406.07070 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.