Implementation and Evaluation of PAROLE PoS in a National Context
Tilly Dutilh (Institute for Dutch Lexicology, P.O. Box 9515, 2300 RA Leiden, The Netherlands)
Truus Kruyt (Institute for Dutch Lexicology, P.O. Box 9515, 2300 RA Leiden, The Netherlands)
WP4: Corpus Annotation
We are annotating the complete 20 million Dutch PAROLE corpus with PoS and lemma. The morphosyntactic tagging of 250,000 words during the PAROLE project was the first confrontation of the fine-grained Dutch PAROLE tagset and its 'functional' mode of application, with real corpus data. The correction of the manual tagging and the compilation of a 100,000 words training corpus for the automatic tagger initiated the evaluation of the suitability of the tagset and the methodology of tag assignment, which topics will both be discussed in this paper. The reality of corpus data brought about a number of adaptations, linguistic restrictions and generalisations. The most salient tagger results will be presented. Our experience is relevant for a new project: the Integrated Language Database of 8th - 21st Century Dutch (ILD), which will contain a text corpus covering all these centuries. The corpus will be annotated with lemma and PoS, in which process historical lexica will be used. Obviously, we will have to tailor tagset and methodology of tag assignment optimally to these purposes.
Customised PAROLE tagset, Tag methodology, PAROLE tagger, PAROLE and historical dutch, Internet-accessible dutch PAROLE corpus