Building a Paraphrase Corpus for Speech Translation


Mitsuo Shimohata (1, 2), Eiichiro Sumita (1), Yuji Matsumoto (2)

(1) ATR Spoken Language Translation Research Laboratories; (2) Graduate School of Information Science Nara Institute of Science and Technology




When a machine translation (MT) system receives input sentences of spoken language, the following two types of sentences are difficult to translate: (1) long sentences and (2) sentences having redundant expressions often seen in spoken language. To reduce these difficulties, we are developing methods to paraphrase input sentences into more translatable ones. In this paper, we report a preliminary Japanese paraphrase corpus. The corpus consists of original sentences derived from travel conversation and versions of them paraphrased by humans. We use three paraphrasing methods: plain, segment, and summary paraphrasing. Plain paraphrasing is applied to short sentences, where redundant expressions are replaced with plain ones. Segment and summary paraphrasing is applied to long sentences, where long sentences are converted into one or several short sentences. We also report a comparison of machine translation quality between the original sentences and the paraphrased sentences. We use two corpus-based machine translation systems in the experiment.


Corpus, Machine translation, Speech translation, Paraphrase

Language(s) Japanese
Full Paper