LREC 2012 Proceedings

Summary of the paper

Title	Building a learner corpus
Authors	Jirka Hana, Alexandr Rosen, Barbora Štindlová and Petr Jäger
Abstract	The paper describes a corpus of texts produced by non-native speakers of Czech. We discuss its annotation scheme, consisting of three interlinked levels to cope with a wide range of error types present in the input. Each level corrects different types of errors; links between the levels allow capturing errors in word order and complex discontinuous expressions. Errors are not only corrected, but also classified. The annotation scheme is tested on a doubly-annotated sample of approx. 10,000 words with fair inter-annotator agreement results. We also explore options of application of automated linguistic annotation tools (taggers, spell checkers and grammar checkers) on the learner text to support or even substitute manual annotation.
Topics	Acquisition, Corpus (creation, annotation, etc.), LR Infrastructures and Architectures
Full paper	Building a learner corpus
Bibtex	@InProceedings{HANA12.992, author = {Jirka Hana and Alexandr Rosen and Barbora Štindlová and Petr Jäger}, title = {Building a learner corpus}, booktitle = {Proceedings of the Eight International Conference on Language Resources and Evaluation (LREC'12)}, year = {2012}, month = {may}, date = {23-25}, address = {Istanbul, Turkey}, editor = {Nicoletta Calzolari (Conference Chair) and Khalid Choukri and Thierry Declerck and Mehmet Uğur Doğan and Bente Maegaard and Joseph Mariani and Asuncion Moreno and Jan Odijk and Stelios Piperidis}, publisher = {European Language Resources Association (ELRA)}, isbn = {978-2-9517408-7-7}, language = {english} }