Title

Title	A Framework for Evaluating the Suitability of Non-English Corpora for Language Engineering
Author(s)	Avik Sarkar, Anne De Roeck Department of Computing, The Open University; Milton Keynes; UK
Session	O39-EW
Abstract	In this paper we develop a framework for fast profiling and quality verification of datasets for language engineering and information retrieval research. The profiling steps consist of an initial tokenization of the corpus to produce a frequency list from which some basic statistics are derived. Manual sampling is carried out to detect obvious discrepancies. Two diagnostic tests are performed to check for sparseness related measures. The behaviour of the function words is traced to gauge homogeneity of their distribution in documents.
Keyword(s)	Bengali language evaluation, corpus profiling
Language(s)	Bengali, Arabic, English
Full Paper	485.pdf