LREC 2016 Proceedings

Summary of the paper

Title	Corpus vs. Lexicon Supervision in Morphosyntactic Tagging: the Case of Slovene
Authors	Nikola Ljubešić and Tomaž Erjavec
Abstract	In this paper we present a tagger developed for inflectionally rich languages for which both a training corpus and a lexicon are available. We do not constrain the tagger by the lexicon entries, allowing both for lexicon incompleteness and noisiness. By using the lexicon indirectly through features we allow for known and unknown words to be tagged in the same manner. We test our tagger on Slovene data, obtaining a 25% error reduction of the best previous results both on known and unknown words. Given that Slovene is, in comparison to some other Slavic languages, a well-resourced language, we perform experiments on the impact of token (corpus) vs. type (lexicon) supervision, obtaining useful insights in how to balance the effort of extending resources to yield better tagging results.
Topics	Part-of-Speech Tagging, Tools, Systems, Applications, Corpus (Creation, Annotation, etc.)
Full paper	Corpus vs. Lexicon Supervision in Morphosyntactic Tagging: the Case of Slovene
Bibtex	@InProceedings{LJUBEI16.811, author = {Nikola Ljubešić and Tomaž Erjavec}, title = {Corpus vs. Lexicon Supervision in Morphosyntactic Tagging: the Case of Slovene}, booktitle = {Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC 2016)}, year = {2016}, month = {may}, date = {23-28}, location = {Portorož, Slovenia}, editor = {Nicoletta Calzolari (Conference Chair) and Khalid Choukri and Thierry Declerck and Sara Goggi and Marko Grobelnik and Bente Maegaard and Joseph Mariani and Helene Mazo and Asuncion Moreno and Jan Odijk and Stelios Piperidis}, publisher = {European Language Resources Association (ELRA)}, address = {Paris, France}, isbn = {978-2-9517408-9-1}, language = {english} }