Summary of the paper

Title SYN2015: Representative Corpus of Contemporary Written Czech
Authors Michal Křen, Václav Cvrček, Tomáš Čapka, Anna Čermáková, Milena Hnátková, Lucie Chlumská, Tomáš Jelínek, Dominika Kováříková, Vladimír Petkevič, Pavel Procházka, Hana Skoumalová, Michal Škrabal, Petr Truneček, Pavel Vondřička and Adrian Jan Zasina
Abstract The paper concentrates on the design, composition and annotation of SYN2015, a new 100-million representative corpus of contemporary written Czech. SYN2015 is a sequel of the representative corpora of the SYN series that can be described as traditional (as opposed to the web-crawled corpora), featuring cleared copyright issues, well-defined composition, reliability of annotation and high-quality text processing. At the same time, SYN2015 is designed as a reflection of the variety of written Czech text production with necessary methodological and technological enhancements that include a detailed bibliographic annotation and text classification based on an updated scheme. The corpus has been produced using a completely rebuilt text processing toolchain called SynKorp. SYN2015 is lemmatized, morphologically and syntactically annotated with state-of-the-art tools. It has been published within the framework of the Czech National Corpus and it is available via the standard corpus query interface KonText at as well as a dataset in shuffled format.
Topics Corpus (Creation, Annotation, etc.), LR National/International Projects, Infrastructural/Policy issues, Other
Full paper SYN2015: Representative Corpus of Contemporary Written Czech
Bibtex @InProceedings{KEN16.186,
  author = {Michal Křen and Václav Cvrček and Tomáš Čapka and Anna Čermáková and Milena Hnátková and Lucie Chlumská and Tomáš Jelínek and Dominika Kováříková and Vladimír Petkevič and Pavel Procházka and Hana Skoumalová and Michal Škrabal and Petr Truneček and Pavel Vondřička and Adrian Jan Zasina},
  title = {SYN2015: Representative Corpus of Contemporary Written Czech},
  booktitle = {Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC 2016)},
  year = {2016},
  month = {may},
  date = {23-28},
  location = {Portoro┼ż, Slovenia},
  editor = {Nicoletta Calzolari (Conference Chair) and Khalid Choukri and Thierry Declerck and Sara Goggi and Marko Grobelnik and Bente Maegaard and Joseph Mariani and Helene Mazo and Asuncion Moreno and Jan Odijk and Stelios Piperidis},
  publisher = {European Language Resources Association (ELRA)},
  address = {Paris, France},
  isbn = {978-2-9517408-9-1},
  language = {english}
Powered by ELDA © 2016 ELDA/ELRA