Scaling the ISLE Framework: Use of Existing Corpus Resources for Validation of MT Evaluation Metrics across Languages


Michelle Vanni (U.S. Department of Defense Fort Meade, MD 20755 USA)

Keith Miller (The MITRE Corporation 7515 Colshire Drive McLean, VA 22102-7508 USA ) 


EO4: MT Evaluation


This paper describes the next step in a machine translation (MT) evaluation (MTE) research program previously reported on at MT Summit 2001. The development of this evaluation methodology has benefited from the availability of two collections of source language texts and the results of processing these texts with several consumer off-the-shelf (COTS) MT engines (DARPA 1994, Doyon, Taylor, & White 1999). The crucial characteristic of this methodology is a systematic development of a predictive relationship between discrete, well-defined metrics (a set of quality test scores) and specific information processing tasks that can be reliably performed with output of a given MT system. One might view the intended outcomes as (1) a system for classifying MT output in terms of the information processing functions it can serve and (2) an indicator for research and development directions in MT designed to serve a specific information processing function. 

Unlike the tests used in initial experiments on automated scoring to compare MT output with human-produced text (Jones and Rusk 2000), our method employs traditional measures of MT output quality, selected from the framework put forth by International Standards for Language Engineering (ISLE). They measure coherence, clarity, syntax, morphology, and general and domain-specific lexical robustness, to explicitly include the translation of named entities. 

Each test was evaluated, refined and validated on MT output (from the 1994 DARPA MTE program) produced by three Spanish-to-English systems given a single input text. By contrast, the material used in the present work is taken from the MT Scale evaluation research program, and considers output produced by Japanese-to-English MT systems. Since Spanish and Japanese differ structurally on the morphological, syntactic, and discourse levels, among others, it is expected that a comparison of scores on tests measuring these output qualities should reveal how structural similarity, such as that enjoyed by Spanish and English, and structural contrast, such that found between Japanese and English, affect the linguistic distinctions which must be accommodated by MT systems. Further, it is shown that the metrics that were developed using Spanish-English MT output are equally effective when applied to Japanese-English MT output. 

The principal measures include coherence, clarity, and measures of syntax, morphology, and lexical coverage. The coherence metric draws on Mann and Thompson's RST (1981), and is based on impressions of the overall dynamic of a discourse. For the Spanish-English evaluation, the sentence was used as the unit of evaluation. Application of this test to Japanese texts was complicated by the fact that the sentence boundaries are sometimes unclear in our sample Japanese texts. Clarity is measured on a four-point scale, and is differentiated from the coherence metric in that the sentence being evaluated does not need to make sense with respect to the rest of the discourse. Nor does the sentence have to be grammatically well-formed, as this feature of the output is discretely measured by the syntax metric. Scores for clarity have been shown to covary with intuitive judgements of output quality. Scores for syntax are based on the minimal number of corrections needed to render a sentence grammatical; likewise, the morphology scores are based on the rate of strictly morphological errors present in the output text. Two complementary measures of lexical coverage and correctness have been developed and validated: one concerns itself primarily with general and domain-specific lexical coverage, and the other with the handling of named entities. The latter is believed to be crucial in determining the suitability of MT output for use in downstream information extraction tasks.

The research described in this paper, that is validating the MT evaluation metrics on Japanese data, provides a basis for the correlation of these metrics with independently-derived measures of usefulness of the output texts for downstream information processing tasks (Doyon, Taylor, & White, 1999).


Machine translation, Machine translation evaluation, Task-Based metrics, Automated metrics

Full Paper