Word Sense Disambiguation as a Wordnets' Validation Method in Balkanet


Dan Tufis(1,2), Radu Ion(1), Nancy Ide(3)

(1) Research Institute for Artificial Intelligence of the Romanian Academy, Bucharest; (2) University "A.I. Cuza" of Iasi; (3) Department of Computer Science, Vassar College




BalkaNet is a European project which aims at the development of monolingual wordnets for five languages in the Balkans area (Bulgarian, Greek, Romanian Serbia, and Turkish) and at improvement of the Czech wordnet developed in the EuroWordNet project. The wordnets are aligned to the Princeton Wordnet, according to the principles established by the EuroWordNet consortium. One of the main concerns of this project is the interlingual validation of the wordnets alignment. To this end, we have developed a WSD system, based on parallel corpora, which exploits the common intuition according to which words that are reciprocal translations in a parallel texts should be linked to the same(or closely related) interlingual concepts. An embedded word aligner provides the wordnet-based algorithm, described in the paper, with pairs of words which are reciproca translations and which are subject to mutually disambiguate each other. With wordnets under construction, our WSD system is useful mainly for validation, pinpointing wrong interlingual alignments, incomplete or missing synsets in one or the other of the wordnets. With robust wordnets, the system is a proper word sense disambiguation tool for parallel corpora. The sense granularity at which the WSD is achieved is the one in the Princeton Wordnet. The challenge of this approach, besides its high accuracy and fine-grained disambiguation is that it may be used to automatically sense-tag corpora in not only one language, but rather several at once and by the same sense inventory. WSD is evaluated on an Romanian-English bitext, extracted form the multilingual parallel corpus "1984", against a hand sense-tagging used as a Gold-Standard.


parallel corpora, translation equivalents, validation, word alignment, wordnet, word sense disambiguation

Language(s) English, Romanian
