Measuring corpus homogeneity using a range of measures for inter-document distance


Gabriela Cavaglia (ITRI, University of Brighton Lewes Road, Brighton BN2 4GJ, United Kingdom)


WP1: Corpora & Corpus Tools


With the ever more widespread use of corpora in language research, it is becoming increasingly important to be able to describe and compare corpora. The analysis of corpus homogeneity is preliminary to any quantitative approach to corpora comparison. We describe a method for text analysis based only on document-internal linguistic features, and a set of related homogeneity measures based on inter-document distance. We present a preliminary experiment to validate the hypothesis that in the presence of a homogeneous corpus the subcorpus that is necessary to train an NLP system is smaller than the one required if a heterogeneous corpus is used.Overhead projector


Corpus homogeneity, Corpus design, Corpus comparison

