Automatic Alignment of Japanese and English Newspaper Articles using an MT System and a Bilingual Company Name Dictionary


Kenji Matsumoto (ATR Spoken Language Translation Research Laboratory)

Hideki Tanaka (ATR Spoken Language Translation Research Laboratory)


WP1: Corpora & Corpus Tools


One of the crucial parts of any corpus-based machine translation system is a large-scale bilingual corpus that is aligned at various levels such, as the sentence and phrase levels. This kind of corpus, however, is not easy to obtain, and accordingly, there is a great need for an efficient construction method. We approach this problem by integrating two large monolingual corpora in two different languages sharing the same source of information. We often see such a situation in journalistic texts where the same events are reported in many languages. Unfortunately, they often lack article-level alignment information and the recovery of this is the first problem to solve. In this paper, we report a method of automatically aligning Japanese and English newspaper articles in the financial and economic news domain. Although conventional methods require some manual work, the proposed method works fully automatically. We show that our method can align such newspaper articles with an accuracy of 97%.


Newspaper article alignment, Large-scale bilingual corpus, Bilingual company name dictionary, Dice's coefficient, Machine translation

Full Paper