Term Translations in Parallel Corpora: Discovery and Consistency Check


Dan Tufis

Research Institute for Artificial Intelligence of the Romanian Academy & University „A.I. Cuza” of Iasi




The paper describes a method for identifying term translations in parallel corpora, developed within the FF-POIROT European project. This project aims at building multilingual (Dutch, Italian, French and English) resources in the financial/legal domain that may be used in knowledge and information systems by investigative bodies, and law enforcement in order to detect, investigate or help prevent instances of actual or attempted financial fraud. The methodology builds on our word alignment procedure based on translation equivalents extracted from parallel corpora. When a validated list of multiword terms is available in one language, the procedure provides the translations in any of the languages present in the parallel corpus. Given that a term is usually semantically non-ambiguous, the found translations of different occurrences of the same term should be the same (modulo inflectional variations). If this is not the case, one might suspect a non-systematic translation of the original term. When a man-made term list is not available, the system tries to discover the term candidates extracting sequences of words that appear together more frequently than expected by chance. By the procedure mentioned before, the candidate terms occurrences in one language are linked to their translation equivalents in the other languages.


corpus encoding, tokenisation, tagging, terms extraction, terms translation based on parallel corpora, translation equivalents in parallel corpora, word alignment

Language(s) English, French, Dutch, Italian
Full Paper