Combining symbolic and statistical methods in morphological analysis and unknown word guessing


Attila Novák (1,2), Viktor Nagy (1), Csaba Oravecz (1)

(1) Research Institute for Linguistics, Hungarian Academy of Sciences, Budapest, Address: Benczúr u. 33, Budapest, Hungary, Email: {novak,nagyv,oravecz}@nytud.hu; (2) MorphoLogic Ltd., Budapest, Address: Orbánhegyi út 5, Budapest, Hungary




Highly inflectional/agglutinative languages like Hungarian typically feature possible word forms in such a magnitude that automatic methods that provide morphosyntactic annotation on the basis of some training corpus often face the problem of data sparseness. A possible solution to this problem is to apply a comprehensive morphological analyser, which is able to analyse almost all wordforms alleviating the problem of unseen tokens. However, although in a smaller number, there will still remain forms which are unknown even to the morphological analyzer and should be handled by some guesser mechanism. The paper will describe a hybrid method which combines symbolic and statistical information to provide lemmatization and suffix analyses for unknown word forms. Evaluation is carried out with respect to the induction of possible analyses and their respective lexical probabilities for unknown word forms in a part-of-speech tagging system.


morphosyntactic annotation, POS tagging, unknown word guessing



Full Paper