Translation memories enrichment by statistical bilingual segmentation


Francisco Nevado (1), Francisco Casacuberta (1), Josu Landa (2)

(1) Dept. de Sistemas Informaticos y Computacion, Camino de Vera s/n, 46022 Valencia, Spain; (2) Ametzagaiña AIE, Zirkuitu Ibilbidea 2-1, 20160 Lasarte-Oria, Spain




A majority of Machine Aided Translation systems are based on comparisons between a source sentence and reference sentences stored in Translation Memories (TMs). The translation search is done by looking for sentences in a database which are similar to the source sentence. TMs have two basic limitations: the dependency on the repetition of complete sentences and the high cost of building a TM. As human translators do not only remember sentences from their preceding translations, but they also decompose the sentence to be translated and work with smaller units, it would be desirable to enrich the TM database with smaller translation units. This enrichment should also be automatic in order not to increase the cost of building a TM. We propose the application of two automatic bilingual segmentation techniques based on statistical translation methods in order to create new, shorter bilingual segments to be included in a TM database. An evaluation of the two techniques is carried out for a bilingual Basque-Spanish task.


Statistical Bilingual Segmentation, Translation Memories, Statistical, Machine Translation


Basque, Spanish

Full Paper