# UTAustin-AIHealth Welcome to **UTAustin-AIHealth** – a hub dedicated to advancing research in medical AI. This repo contains the **MedHallu** dataset, which underpins our recent work: **MedHallu: A Comprehensive Benchmark for Detecting Medical Hallucinations in Large Language Models** MedHallu is a rigorously designed benchmark intended to evaluate large language models' ability to detect hallucinations in medical question-answering tasks. The dataset is organized into two distinct splits: - **pqa_labeled:** Contains 1,000 high-quality, human-annotated samples derived from PubMedQA. - **pqa_artificial:** Contains 9,000 samples generated via an automated pipeline from PubMedQA. --- ## Setup Environment To work with the MedHallu dataset, please install the Hugging Face `datasets` library using pip: ```bash pip install datasets ``` ## How to Use MedHallu **Downloading the Dataset:** ```python from datasets import load_dataset # Load the 'pqa_labeled' split: 1,000 high-quality, human-annotated samples. medhallu_labeled = load_dataset("UTAustin-AIHealth/MedHallu", "pqa_labeled") # Load the 'pqa_artificial' split: 9,000 samples generated via an automated pipeline. medhallu_artificial = load_dataset("UTAustin-AIHealth/MedHallu", "pqa_artificial") ``` --- ## License This dataset and associated resources are distributed under the [MIT License](https://opensource.org/license/mit/). ## Citations If you find MedHallu useful in your research, please consider citing our work: ```bibtex @misc{pandit2025medhallucomprehensivebenchmarkdetecting, title={MedHallu: A Comprehensive Benchmark for Detecting Medical Hallucinations in Large Language Models}, author={Shrey Pandit and Jiawei Xu and Junyuan Hong and Zhangyang Wang and Tianlong Chen and Kaidi Xu and Ying Ding}, year={2025}, eprint={2502.14302}, archivePrefix={arXiv}, primaryClass={cs.CL}, url={https://arxiv.org/abs/2502.14302}, } ``` ## Contact For further information or inquiries about MedHallu, please reach out at shreypandit@utexas.edu