Tomayto, Tomahto. Beyond Token-level Answer Equivalence for Question Answering Evaluation
Abstract
The predictions of <PRE_TAG>question answering (QA)</POST_TAG>systems are typically evaluated against manually annotated finite sets of one or more answers. This leads to a coverage limitation that results in underestimating the true performance of systems, and is typically addressed by extending over <PRE_TAG>exact match (EM)</POST_TAG> with pre-defined rules or with the <PRE_TAG><PRE_TAG>token-level <PRE_TAG><PRE_TAG>F1 measure</POST_TAG></POST_TAG></POST_TAG></POST_TAG>. In this paper, we present the first systematic conceptual and data-driven analysis to examine the shortcomings of token-level equivalence measures. To this end, we define the <PRE_TAG><PRE_TAG>asymmetric notion of answer equivalence</POST_TAG></POST_TAG> (AE), accepting answers that are equivalent to or improve over the reference, and publish over 23k human judgments for candidates produced by multiple QA systems on SQuAD. Through a careful analysis of this data, we reveal and quantify several concrete limitations of the <PRE_TAG>F1 measure</POST_TAG>, such as a false impression of graduality, or missing dependence on the question. Since collecting AE annotations for each evaluated model is expensive, we learn a <PRE_TAG><PRE_TAG>BERT</POST_TAG></POST_TAG> matching (<PRE_TAG>BEM</POST_TAG>) measure to approximate this task. Being a simpler task than QA, we find <PRE_TAG>BEM</POST_TAG> to provide significantly better AE approximations than F1, and to more accurately reflect the performance of systems. Finally, we demonstrate the practical utility of AE and <PRE_TAG>BEM</POST_TAG> on the concrete application of <PRE_TAG>minimal accurate prediction sets</POST_TAG>, reducing the number of required answers by up to x2.6.
Models citing this paper 2
Datasets citing this paper 2
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper