Papers
arxiv:2006.12719

Unsupervised Evaluation of Interactive Dialog with DialoGPT

Published on Jun 23, 2020
Authors:
,

Abstract

It is important to define meaningful and interpretable automatic evaluation metrics for open-domain dialog research. Standard language generation metrics have been shown to be ineffective for dialog. This paper introduces the FED metric (fine-grained evaluation of dialog), an automatic evaluation metric which uses DialoGPT, without any fine-tuning or supervision. It also introduces the FED dataset which is constructed by annotating a set of human-system and human-human conversations with eighteen <PRE_TAG>fine-grained dialog qualities</POST_TAG>. The FED metric (1) does not rely on a ground-truth response, (2) does not require training data and (3) measures <PRE_TAG>fine-grained dialog qualities</POST_TAG> at both the turn and whole dialog levels. FED attains moderate to strong correlation with human judgement at both levels.

Community

Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2006.12719 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2006.12719 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2006.12719 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.