arxiv:2406.07070

HalluDial: A Large-Scale Benchmark for Automatic Dialogue-Level Hallucination Evaluation

Published on Jun 11, 2024

Authors:

Abstract

Large Language Models (LLMs) have significantly advanced the field of Natural Language Processing (NLP), achieving remarkable performance across diverse tasks and enabling widespread real-world applications. However, LLMs are prone to hallucination, generating content that either conflicts with established knowledge or is unfaithful to the original sources. Existing hallucination benchmarks primarily focus on sentence- or passage-level <PRE_TAG>hallucination</POST_TAG> detection, neglecting dialogue-level evaluation, <PRE_TAG>hallucination localization</POST_TAG>, and rationale provision. They also predominantly target factuality hallucinations while underestimating faithfulness <PRE_TAG>hallucinations</POST_TAG>, often relying on labor-intensive or non-specialized evaluators. To address these limitations, we propose HalluDial, the first comprehensive large-scale benchmark for automatic dialogue-level <PRE_TAG>hallucination evaluation</POST_TAG>. HalluDial encompasses both spontaneous and induced <PRE_TAG>hallucination scenarios</POST_TAG>, covering factuality and faithfulness <PRE_TAG>hallucinations</POST_TAG>. The benchmark includes 4,094 dialogues with a total of 146,856 samples. Leveraging HalluDial, we conduct a comprehensive meta-evaluation of LLMs' hallucination evaluation capabilities in information-seeking dialogues and introduce a specialized judge language model, HalluJudge. The high data quality of HalluDial enables HalluJudge to achieve superior or competitive performance in hallucination evaluation, facilitating the automatic assessment of dialogue-level <PRE_TAG>hallucinations</POST_TAG> in LLMs and providing valuable insights into this phenomenon. The dataset and the code are available at https://github.com/FlagOpen/HalluDial.

View arXiv page View PDF Add to collection

Community

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2406.07070 in a model README.md to link it from this page.

Datasets citing this paper 1

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2406.07070 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.