Title

Title	Unexpected Productions May Well be Errors
Author(s)	Tylman Ule(1), Kiril Simov (2) (1) Seminar für Sprachwissenschaft, Universität Tübingen; (2) Linguistic Modelling Laboratory, Bulgarian Academy of Sciences
Session	P19-SW
Abstract	We present a method for detecting annotation errors in treebanks. It assumes that errors are unexpected small tree fragments. We generate statistics over configurations of these fragments using a standard statistical test. We use the test result and the characteristics of their distributions as features to classify unseen configurations as likely errors via machine learning. Evaluation shows that the resulting list of error candidates is reliable, independent of corpus size, annotation quality, and target language.
Keyword(s)	error detection, treebanks, manual annotation, language independent, machine learning
Language(s)	Bulgarian, German
Full Paper	483.pdf