Scaling the ISLE Framework: Use of Existing Corpus Resources for Validation of MT Evaluation Metrics across Languages
Michelle Vanni (U.S. Department of Defense Fort Meade, MD 20755 USA)
Keith Miller (The MITRE Corporation 7515 Colshire Drive McLean, VA 22102-7508 USA )
EO4: MT Evaluation
This paper describes the next step in a machine translation (MT) evaluation (MTE) research program previously reported on at MT Summit 2001. The development of this evaluation methodology has benefited from the availability of two collections of source language texts and the results of processing these texts with several consumer off-the-shelf (COTS) MT engines (DARPA 1994, Doyon, Taylor, & White 1999). The crucial characteristic of this methodology is a systematic development of a predictive relationship between discrete, well-defined metrics (a set of quality test scores) and specific information processing tasks that can be reliably performed with output of a given MT system. One might view the intended outcomes as (1) a system for classifying MT output in terms of the information processing functions it can serve and (2) an indicator for research and development directions in MT designed to serve a specific information processing function.
Unlike the tests used in initial experiments on automated
scoring to compare MT output with human-produced text (Jones and Rusk 2000), our method employs traditional
measures of MT output quality, selected from the framework put forth by International Standards for Language Engineering (ISLE). They
measure coherence, clarity, syntax, morphology, and general and domain-specific lexical robustness, to explicitly include the
translation of named entities.
The principal measures include coherence, clarity, and measures of
syntax, morphology, and lexical coverage. The coherence metric draws on Mann and Thompson's RST (1981), and is based on impressions of the
overall dynamic of a discourse. For the Spanish-English evaluation, the sentence was used as the unit of evaluation. Application of this
test to Japanese texts was complicated by the fact that the sentence boundaries are sometimes unclear in our sample Japanese texts.
Clarity is measured on a four-point scale, and is differentiated from the coherence metric in that the sentence being evaluated does not
need to make sense with respect to the rest of the discourse. Nor does the
sentence have to be grammatically well-formed, as this feature of the output is discretely measured by the syntax metric.
Scores for clarity have been shown to covary with intuitive judgements of output quality. Scores for syntax are based on the minimal number
of corrections needed to render a sentence grammatical; likewise, the morphology scores are based on the rate of strictly morphological
errors present in the output text. Two complementary measures of lexical coverage and correctness have been developed and validated:
one concerns itself primarily with general and domain-specific lexical coverage, and the other with the handling of named entities. The
latter is believed to be crucial in determining the suitability of MT output for use in downstream information extraction tasks.
Machine translation, Machine translation evaluation, Task-Based metrics, Automated metrics