Title Comparative Evaluations in the Domain of Automatic Speech Recognition 
Author(s) Alex Trutnev, Martin Rajman

LIA-ICC-SCC-EPFL

Session O32-ES
Abstract The goal of this contribution is to present the results of a comparative evaluation of different, academic and commercial, speech recognitions engines (SREs). The evaluation was carried out at EPFL. The same test data sets were used for all the systems and the comparison was made on the basis of the obtained Word Error Rate (WER) scores. Besides the production of WER scores for several SREs in identical conditions, one of the important objectives of this work was to study relative performances of HMM (Hidden Markov Model) and hybrid HMM-ANN (Hidden Markov Model-Artificial Neural Network) technologies, as used in state-of-the-art systems. A second important objective was to evaluate whether the possibility to have control on the complete recognition process (in particular, the possibility to modify the different - acoustic and language - models used by the SREs), as it is often the case for academic SREs, lead to an observable advantage over the commercial systems that most often are delivered with pre-defined, non modifiable models.
The evaluated SREs were all speaker independent continuous speech recognition engines either academic systems widely used in the research community or commercial tools currently available on the market. In this work, we considered 3 academic systems (HTK, Sirroco, and Strut/DRSpeech) and 2 commercial systems (SRE1 and SRE2). HTK and Sirroco are HMM-based systems, while Strut/DRSpeech uses a hybrid approach where ANN (multi-layer perceptron) are used for the estimation of phonemes probabilities distributions. The SRE1 and SRE2 engines are provided with their own models. 
Various types of data, in two different languages (French and German), were used for the training (acoustic model and language model) and the evaluation of the systems.
More precisely, for the training, we used:
1. the Swiss French Polyphone (SFP) database (continuous read speech in French, ~3'000 utterances spoken by ~400 persons) to train the acoustic models of both HMM-based and ANN-based systems for French;
2. the German SpeechDat (SDGe) database (continuous read speech in German, ~5'000 sentences spoken by ~3'000 persons) to train the acoustic models of the HMM-based systems for German;
3. the transcriptions of spontaneous speech extracted either from dialogues recorded during the InfoVox project (yielding a corpus of ~20'000 word in French) or from the SDGe database (yielding a corpus of ~400'000 words in German) to train the French and German language models.
For the evaluation itself:
1. continuous free telephone quality speech recordings, consisting of user utterances from human-machine dialogues with a telephone-based vocal information system about restaurants, and telephone quality digits recordings extracted from the SFP database were used for the evaluation in French; the recordings of user utterances were produced during the 2 field-tests of the InfoVox project and correspond to a dataset of ~900 spoken utterances; the recorded digits correspond to a dataset of ~400 spoken utterances;
2. continuous microphone quality read speech recordings, consisting of user utterances from human-machine dialogues with a Smart home system (Inspire project), and telephone quality digits recordings extracted from the SDGe database were used for the evaluation in German; the recordings of user utterances were produced during the Inspire project and correspond to a dataset of ~1'400 spoken utterances; the recorded digits correspond to a dataset of 500 spoken utterances; furthermore, different types of noise generation techniques corresponding to ~30 noise conditions were applied to the recorded user utterances, therefore yielding an augmented evaluation database.
The following evaluation protocol was used: 
1. training of the acoustic models: the SRE1 and SRE2 systems are provided with their own internal acoustic models; acoustic model training was therefore required only for the academic systems;
2. training the language models: the SRE1 systems are provided with specific tools to train language models in their proprietary formats. For other systems, the CMU Toolkit was used.
For both the acoustic models and the language models, the above mentioned training data was systematically used, thus yielding comparable models and increasing the relevance of the comparative evaluation of the recognition performances.
3. the training of the various meta-parameters (such as the scaling factor for language model for HTK, or inter-model transition penalties for HTK, Sirroco and Strut/DRSpeech) was performed for all systems on the basis of the same training data (~20% of InfoVox and Inspire utterances);
4. for the evaluation as such, the above mentioned test data (SFP and SDGe digits and ~80% of InfoVox and Inspire utterances) was used for all systems; recognition performance evaluation was therefore carried out for all 5 systems in French and in German using the MAPSSWE Test implemented in the 'sclite' tool; WER scores for each of the SREs and for each of the test data sets were produced and compared. 
The main results obtained during the experiments were the following:
1. the evaluation of the French HMM-based acoustic model on SFP digits shows a WER of 22.1% vs a WER of 36.0% obtained with the ANN-based acoustic model. Furthermore, the evaluation of the French acoustic models of the commercial systems on the SFP digits shows a WER of 8.0% for SRE2 and of 10.1% for SRE1;
2. the evaluation of the HMM-based HTK system on InfoVox utterances shows a WER of 61.5% for the first field-test and 63.3% for the second vs 72.5% for the first field-test and 76.6% for the second for Strut/DRSpeech system. SRE2 shows a WER of 65.5% for the first field-test and 65.0% for the second. The corresponding results for SRE1 are 67.9% and 66.3% respectively;
3. the evaluation of German acoustic models on SDGe digits shows a WER of 26.6% for the HMM-based acoustic model, 0.4% for the SRE1 acoustic model and 5.8% for the SRE2 acoustic model. Furthermore, HTK shows a WER of 81.0% for Inspire utterances vs 35.8% obtained with SRE1. 
In conclusion, the obtained results show that HMM-based technology performs better than the hybrid approach. HMM performance remains also better than the one of commercial systems for continuous French speech. On the other hand, the commercial systems show better recognition accuracy for continuous German speech. 
Keyword(s) Automatic Speech Recognition (ASR), Cross Validation, Word Error Rate (WER), acoustic model, language model, Hidden Markov Model (HMM), Artificial Neural Network (ANN)
Language(s) French, English, German
Full Paper 654.pdf