Summary of the paper

Title Experiments on Processing Overlapping Parallel Corpora
Authors Mark Fishel and Heiki-Jaan Kaalep
Abstract The number and sizes of parallel corpora keep growing, which makes it necessary to have automatic methods of processing them: combining, checking and improving corpora quality, etc. We here introduce a method which enables performing many of these by exploiting overlapping parallel corpora. The method finds the correspondence between sentence pairs in two corpora: first the corresponding language parts of the corpora are aligned and then the two resulting alignments are compared. The method takes into consideration slight differences in the source documents, different levels of segmentation of the input corpora, encoding differences and other aspects of the task. The paper describes two experiments conducted to test the method. In the first experiment, the Estonian-English part of the JRC-Acquis corpus was combined with another corpus of legislation texts. In the second experiment alternatively aligned versions of the JRC-Acquis are compared to each other with the example of all language pairs between English, Estonian and Latvian. Several additional conclusions about the corpora can be drawn from the results. The method proves to be effective for several parallel corpora processing tasks.
Language Language-independent
Topics Corpus (creation, annotation, etc.), Tools, systems, applications, Validation of LRs
Full paper Experiments on Processing Overlapping Parallel Corpora
Slides Experiments on Processing Overlapping Parallel Corpora
