Unsupervised Evaluation of Interactive Dialog with DialoGPT
Abstract
It is important to define meaningful and interpretable automatic evaluation metrics for open-domain dialog research. Standard language generation metrics have been shown to be ineffective for dialog. This paper introduces the FED metric (fine-grained evaluation of dialog), an automatic evaluation metric which uses DialoGPT, without any fine-tuning or supervision. It also introduces the FED dataset which is constructed by annotating a set of human-system and human-human conversations with eighteen <PRE_TAG>fine-grained dialog qualities</POST_TAG>. The FED metric (1) does not rely on a ground-truth response, (2) does not require training data and (3) measures <PRE_TAG>fine-grained dialog qualities</POST_TAG> at both the turn and whole dialog levels. FED attains moderate to strong correlation with human judgement at both levels.
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper