Automated Morphological Segmentation and Evaluation
Uwe D. Reichel, Karl Weilhammer
Department of Phonetics and Speech Communication
In this paper we introduce (i) a new method for morphological segmentation of German words and (ii) some measures related to the MDL principle for evaluation of morphological segmentations. Our segmentation method is based on general knowledge about inflection, derivation, and morphotactics, and part of speech information, all supplied by little effort. It includes the capabilities to generate allomorphs, to deal with hierarchical structure, and to retrieve morphemes not given in isolation in the input data. Manual evaluation of 1400 segmented types, counting omissions and false insertions of morpheme boundaries, gave 87 % recall and 98 % precision. In order to get automatic evaluation measures for morphological segmentations, we tested (i) vocabulary size and entropy measures (data size aspect of the MDL principle), (ii) model size represented as the number of states of reduced deterministic finite state automatons (DFSA) matching exactly the models' outputs, and (iii) a linear combination of (i) and (ii). These measures have been applied to segmentations of different qualities. As a result linear combination of vocabulary size and size of model-equivalent reduced DFSAs turned out to be an appropriate measure to rank segmentation models according to their quality.
morphological segmentation, automated evaluation, MDL, minimum description length, entropy, DFSA, automating, morphology