Unexpected Productions May Well be Errors
Tylman Ule(1), Kiril Simov (2)
(1) Seminar für Sprachwissenschaft, Universität Tübingen; (2) Linguistic Modelling Laboratory, Bulgarian Academy of Sciences
We present a method for detecting annotation errors in treebanks. It assumes that errors are unexpected small tree fragments. We generate statistics over configurations of these fragments using a standard statistical test. We use the test result and the characteristics of their distributions as features to classify unseen configurations as likely errors via machine learning. Evaluation shows that the resulting list of error candidates is reliable, independent of corpus size, annotation quality, and target language.
error detection, treebanks, manual annotation, language independent, machine learning