LREC 2000 2nd International Conference on Language Resources & Evaluation

Title Principled Hidden Tagset Design for Tiered Tagging of Hungarian
Authors Tufiş Dan (RACAI-Romanian Academy 13, “13 Septembrie”, Ro-74311, Bucharest 5, Romania,
Dienes Péter (Research Institute for Linguistics, Hungarian Academy of Sciences, Budapest,
Oravecz Csaba (Research Institute for Linguistics, Hungarian Academy of Sciences, Budapest,
Váradi Tamás (Linguistics Institute, Hungarian Academy of Sciences, H-1014 Budapest Színház u 5-9,
Keywords Corpus Annotation, Tagset Design, Tagset Reduction, Tiered Tagging
Session Session WO18 - Morphology in Lexical and Textual Resources
Full Paper, 249.pdf
Abstract For highly inflectional languages, the number of morpho-syntactic descriptions (MSD), required to descriptionally cover the content of a word-form lexicon, tends to rise quite rapidly, approaching a thousand or even more set of distinct codes. For the purpose of automatic disambiguation of arbitrary written texts, using such large tagsets would raise very many problems, starting from implementation issues of a tagger to work with such a large tagsets to the more theory-based difficulty of sparseness of training data. Tiered tagging is one way to alleviate this problem by reformulating it in the following way: starting from a large set of MSDs, design a reduced tagset, Ctag-set, manageable for the current tagging technology. We describe the details of the reduced tagset design for Hungarian, where the MSD-set cardinality is several thousand. This means that designing a manageable C-tagset calls for severe reduction in the number of the MSD features, a process that requires careful evaluation of the features.