The Hungarian National Corpus
Tamás Váradi (Research Institute for Linguistics, Hungarian Academy of Sciences)
WP1: Corpora & Corpus Tools
The paper reports on the development of the Hungarian National Corpus, which was completed at the end of 2001 after four years' effort. The HNC is designed to be a balanced reference corpus of current written Hungarian consisting of 150 million words. The paper first discusses basic design issues concerning the composition of the corpus. The HNC adopts a fairly pragmatic approach, focusing on five major text types. The second half of the paper contains details of the annotation and tagging system used.
Corpus annotation, Tagging, Disambiguation, Representative, Tiered tagging