Orthographic and Phonetic Annotation of Very Large Czech Corpora with Quality Assessment


Petr Pollák (1), Jan Černocký (2)

(1) Czech Technical University in Prague, CVUT FEL K13131, Technick\a 2, 16627 Praha 6, Czech Republic, E-mail: pollak@feld.cvut.cz; (2) Brno University of Technology, VUT FIT, Bozetechova 2, 612 66 Brno, Czech Republic, E-mail: cernocky@fit.vutbr.cz




The annotation is generally indivisible part of speech database. In this paper we are presenting common orthographic and phonetic annotation of large Czech databases. Phonetic annotation may be very important and gives more information than pronunciation lexicon with possible pronunciation variants. Moreover, for Czech language phonetic annotation means just small additional effort to standard ortographic transcription. The tool FTP-Trascriber developed for thispurposes is also presented. In the second part we are presenting procedure of quality assessment applied to the annotation of large speech corpora collected at our laboratories. We are presenting semi-automated quality checks based on using several fully automated pre-checks decreasing necessarry additional manual effort.


database annotation, orthographic transcription, phonetic transcription, annotation tool

Language(s) Czech
