Title Interpreting Bleu/NIST Scores: How Much Improvement do We Need to Have a Better System?
Author(s) Ying Zhang, Stephan Vogel, Alex Waibel

Language Technologies Institute, School of Computer Science, Carnegie Mellon University

Session P25-EW
Abstract Automatic evaluation metrics for Machine Translation (MT) systems, such as BLEU and the related NIST metric, are becoming increasingly important in MT. Yet, their behaviors are not fully understood. In this paper, we analyze some flaws in the BLEU/NIST metrics. With a better understanding of these problems, we can better interpret the reported BLEU/NIST scores. In addition, this paper reports a novel method of calculating the confidence intervals for BLEU/NIST scores using bootstrapping. With this method, we can determine whether two MT systems are significantly different from each other.
Keyword(s) Automatic evaluation, BLEU, NIST, confidence intervals, bootstrapping
Language(s) N/A
Full Paper 755.pdf