LREC 2000 2nd International Conference on Language Resources & Evaluation

Previous Paper   Next Paper

Title Morphosyntactic Tagging of Slovene: Evaluating Taggers and Tagsets
Authors Džeroski Sašo (Institute Jozef Stefan, Ljubljana, Slovenia)
Erjavec Tomaž (Dept. for Intelligent Systems, Jožef Stefan Institute, Ljubljana, Slovenia,
Zavrel Jakub (CNTS / Language Technology Group, University of Antwerp, Universiteitsplein 1, 2610 Wilrijk, Belgium,
Keywords Evaluation, Slovene Langauge, Tagging
Session Session WP5 - Corpus Tagging
Full Paper, 146.pdf
Abstract The paper evaluates tagging techniques on a corpus of Slovene, where we are faced with a large number of possible word-class tags and only a small (hand-tagged) dataset. We report on training and testing of four different taggers on the Slovene MULTEXT-East corpus containing about 100.000 words and 1000 different morphosyntactic tags. Results show, first of all, that training times of the Maximum Entropy Tagger and the Rule Based Tagger are unacceptably long, while they are negligible for the Memory Based Taggers and the TnT tri-gram tagger. Results on a random split show that tagging accuracy varies between 86% and 89% overall, between 92% and 95% on known words and between 54% and 55% on unknown words. Best results are obtained by TnT. The paper also investigates performance in relation to our EAGLES-based morphosyntactic tagset. Here we compare the per-feature accuracy on the full tagset, and accuracies on these features when training on a reduced tagset. Results show that PoS accuracy is quite high, while accuracy on Case is lowest. Tagset reduction helps improve accuracy, but less than might be expected.