Title The JOS Morphosyntactically Tagged Corpus of Slovene
Authors Tomaš Erjavec and Simon Krek
Abstract The JOSmorphosyntactic resources for Slovene consist of the specifications, lexicon, and two corpora: jos100k, a 100,000 word balanced monolingual sampled corpus annotated with hand validated morphosyntactic descriptions (MSDs) and lemmas, and jos1M, the 1 million-word partially hand validated corpus. The two corpora have been sampled from the 600M-word Slovene reference corpus FidaPLUS. The JOS resources have a standardised encoding, with the MULTEXT-East-type morphosyntactic specifications and the corpora encoded according to the Text Encoding Initiative Guidelines P5. JOS resources are available as a dataset for research under the Creative Commons licence and are meant to facilitate developments of HLT for Slovene.
Language Single language
Topics Corpus (creation, annotation, etc.), Tagging, Standards for LRs
Full paper The JOS Morphosyntactically Tagged Corpus of Slovene
