Extracting French-Japanese Word Pairs from Bilingual Corpora based on Transliteration Rules
Keita Tsuji (National Institute of Informatics 2-1-2 Hitotsubashi, Chiyoda-ku, Tokyo, Japan)
Beatrice Dailley (University of Nantes IRIN, 2, rue de la Houssinire BP 92208, 44322 Nantes cedex 3, France)
Kyo Kageura (National Institute of Informatics 2-1-2 Hitotsubashi, Chiyoda-ku, Tokyo, Japan)
WP1: Corpora & Corpus Tools
It has been shown so far that using transliteration rules to extract Japanese Katakana and English word pairs is highly useful and promising. But for Japanese-French pairs, the method is not guaranteed to work, because only a very few Japanese Katakana words are borrowed directly from French. In this paper we will show the possibility of extracting Japanese Katakana and French word pairs based on transliteration from loosely aligned Japanese French bilingual corpora. The method applies all the existing transliteration rules to each mora unit in a Katakana word, and extracts the French word which matches or partially-matches one of these transliteration candidates as translation. For instance, if we have `Ot' in the Japanese part of a bilingual corpora, we generate such transliteration candidates as <graf>, <graphe>, <gulerph>,... and identify similar words from French part of the corpora. The method performed reasonably well, achieving 80% precision at 20% recall. We had also observed that Japanese-English transliteration rules worked well for extracting Katakana-French word pairs.
Transliteration rules, Word pairs