Tiered Tagging Revisited


 Dan Tufis (1,2), Liviu Dragomirescu (1)

(1) Research Institute for Artificial Intelligence of the Romanian Academy; (2) University „A.I. Cuza” of Iasi




 In this paper we describe a new baseline tagset induction algorithm, which unlike the one described in previous work is fully automatic and produces tagsets with better performance than before. The algorithm is an information lossless transformation of the MULTEXT-EAST compliant lexical tags (MSD) into a reduced tagset that can be mapped back on the lexicon tagset fully deterministic. From the baseline tagsets, a corpus linguist, expert in the language in case, may further reduce the tagsets taking into account language distributional properties. As any further reduction of the baseline tagsets assumes losing information, adequate recovering rules should be designed for ensuring the final tagging in terms of lexicon encoding. The algorithm is described in details and the generated baseline tagsets for Czech, English, Estonian, Hungarian, Romanian and Slovenean are evaluated. They are much smaller and systematically ensures better tagging accuracy than the corresponding MSDs.


 tiered tagging, tagset design, tagset evaluation, tagset mapping rules


 Czech, English, Estonian, Hungarian, Romanian, Slovene

Full Paper