Machine Translation Evaluation: N-grams to the Rescue
Kishore Papineni (IBM T.J. Watson Research Center Yorktown Heights, NY 10598, U.S.A)
Human judges weigh many subtle aspects of translation quality. But human evaluations are very expensive. Developers of Machine Translation systems need to evaluate quality constantly. Automatic methods that approximate human judgment are therefore very useful. The main difculty in automatic evaluation is that there are many correct translations that differ in choice and order of words. There is no single gold standard to compare a translation with. The closer a machine translation is to professional human translations, the better it is. We borrow precision and recall concepts from Information Retrieval to measure closeness. The precision measure is used on variablelength n-grams. Unigram matches between machine translation and the professional reference translations account for adequacy. Longer n-gram matches account for uency. The n-gram precisions are aggregated across sentences and averaged. A multiplicative brevity penalty prevents cheating. The resulting metric correlates highly with human judgments of translation quality. This method is tested for robustness across language families and across the spectrum of translation quality. We discuss BLEU, an automatic method to evaluate translation quality that is cheap, fast, and good.