Quantitative parameters in corpus design: Estimating the optimum text size in Modern Greek language. 


George Mikros (Department of Italian and Spanish Language and Literature, University of Athens & Institute for Language and Speech Processing )


WO8: Written Corpora


The aim of this paper is to investigate the major quantitative parameters related to the definition of the optimum text size in Modern Greek corpus development. Using the Hellenic National Corpus (HNC) (Hatzigeorgiu et al., 2000) as a reference point we estimated a number of critical statistical measures regarding feature counting in different text sizes. The results indicate that frequent linguistic features behave differently from the medium frequency and the rare ones and the text size increase do not affect them uniformly. 


Corpus design, Corpus size, Quantitative linguistics, Text size, Distribution of linguistic features, Multivariate statistics

