arxiv:2402.01349

Beyond the Answers: Reviewing the Rationality of Multiple Choice Question Answering for the Evaluation of Large Language Models

Published on Feb 2, 2024

Authors:

Abstract

In the field of natural language processing (NLP), Large Language Models (LLMs) have precipitated a paradigm shift, markedly enhancing performance in natural language generation tasks. Despite these advancements, the comprehensive evaluation of LLMs remains an inevitable challenge for the community. Recently, the utilization of Multiple Choice Question Answering (MCQA) as a benchmark for LLMs has gained considerable traction. This study first investigates the limitations of MCQA as an evaluation method for LLMs and then analyzes the fundamental reason for the limitations of MCQA, that while LLMs may select the correct answers, it is possible that they also recognize other wrong options as correct. Finally, we propose a dataset augmenting method for Multiple-Choice Questions (MCQs), MCQA+, that can more accurately reflect the performance of the model, which underscores the need for more robust evaluation mechanisms in assessing the performance of LLMs.

View arXiv page View PDF Add to collection

Community

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2402.01349 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2402.01349 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2402.01349 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.