Making Monolingual Corpora Comparable: a Case Study of Bulgarian and Croatian
Božo Bekavac (1), Petya Osenova (2), Kiril Simov (2), Marko Tadić (1)
(1) Institute of Linguistics, Faculty of Philosophy, University of Zagreb, Ivana Lucica 3, 10000 Zagreb, Croatia, email@example.com, firstname.lastname@example.org; (2) BulTreeBank Project, Linguistic Modelling Laboratory, Bulgarian Academy of Sciences, Acad. G. Bonchev St. 25A, 1113 Sofia, Bulgaria, email@example.com, firstname.lastname@example.org
This paper describes the first steps towards the creation of a Bulgarian-Croatian comparable corpus. Its base are two newspaper subcorpora from larger reference corpora of Bulgarian and Croatian. In the beginning we rely on more extralinguistically-oriented, but methodologically cleaner parameters of similarity like: specific topics, pre-defined time span and data size. The idea of `light' and `hard' comparable corpora is introduced. At this stage we aim at producing a `light' bilingual comparable corpus. The algorithm for identifying lexical similarity and aligning linguistic units is presented, and the initial experiments are outlined.
comparable corpora, text alignment