Methods and Tools for Speech Data Acquisition exploiting a Database of German Parliamentary Speeches and Transcripts from the Internet


Konstantin Biatov (Fraunhofer Institute for Media Communication Schloß Birlinghoven 53754 Sankt Augustin Germany)

Joachim Köhler (Fraunhofer Institute for Media Communication Schloß Birlinghoven 53754 Sankt Augustin Germany)


SP2: Speech Varieties And Multilingual ASR


This paper describes methods that exploit stenographic transcripts of the German  parliament to improve the acoustic models of a speech recognition system for this domain. The stenographic transcripts and the speech data are available on the Internet. Using data from the Internet makes it possible to avoid the costly process of the collection and annotation of a huge amount of data. The automatic data acquisition technique works using the stenographic transcripts and acoustic data from the German parliamentary speeches plus general acoustic models, trained on different data. The idea of this technique is to generate special finite state automata from the stenographic transcripts. These finite state automata simulate potential possible correspondences between the  stenographic transcript and the spoken audio content, i.e. accurate transcript. The first step is the recognition of the speech data using finite state automaton as a language model. The next step is to find, to extract and to verify the match between sections of recognized  words and actually spoken audio content. After this, the automatically extracted and verified data can be used for acoustic model training. Experiments show that for a given  recognition task from the German Parliament domain the absolute decrease of the word error rate is 20%.


Tools, Speech database

Full Paper