LREC 2000 2nd International Conference on Language Resources & Evaluation

Previous Paper   Next Paper

Title Orthographic Transcription of the Spoken Dutch Corpus
Authors Goedertier Wim (Electronics and Information Systems (ELIS), University Gent, Sint-Pietersnieuwstraat 41, B-9000 Gent, Belgium,
Goddijn Simo (Speech Processing Expertise Centre (SPEX), Department of Language and Speech, University of Nijmegen, P.O. Box 9103, 6500 HD Nijmegen, The Netherlands,
Martens Jean-Pierre (Electronics and Information Systems (ELIS), University Gent, Sint-Pietersnieuwstraat 41, B-9000 Gent, Belgium,
Keywords Orthographic Transcription, Speech Corpora, Spoken Dutch, Spoken Language Resources
Session Session SP3 - Spoken Language Resources' Projects
Full Paper, 87.pdf
Abstract This paper focuses on the specification of the orthographic transcription task in the Spoken Dutch Corpus, the problems encountered in making that specification and the evaluation experiments that were carried out to assess the transcription efficiency and the inter-transcriber consistency. It is stated that the role of the orthographic transcriptions in the Spoken Dutch Corpus is twofold: on the one hand, the transcriptions are important for future database users, on the other hand they are indispensable to the development of the corpus itself. The main objectives of the transcription task are the following: (1) to obtain a verbatim transcription that can be made with a minimum level of interpretation of the utterances; (2) to obtain an alignment of the transcription to the speech signal on the level of relatively short chunks; (3) to obtain a transcription that is useful to researchers working in several research areas and (4) to adhere to international standards for existing large speech corpora. In designing the transcription protocol and transcription procedure it was attempted to establish the best compromise between consistency, accuracy and usability of the output and efficiency of the transcription task. For example, the transcription procedure always consists of a first transcription cycle and a verification cycle. Some efficiency and consistency statistics derived from pilot experiments with several students transcribing the same material are presented at the end of the paper. In these experiments the transcribers were also asked to record the amount of time they spent on the different audio files, and to report difficulties they encountered in performing their task.