AV@CAR: A Spanish Multichannel Multimodal Corpus for In-Vehicle Automatic Audio-Visual Speech Recognition


Alfonso Ortega (1) , Federico Sukno (2), Eduardo LLeida (1), Alejandro Frangi (2), Antonio Miguel (1), Luis Buera (1), Ernesto Zacur (2)

(1) Communication Technologies Group and (2) Computer Vision Group - Aragon Institute of Engineering Research (I3A) University of Zaragoza, Spain




This paper describes the acquisition of the multichannel multimodal database AV@CAR for automatic audio-visual speech recognition in cars. Automatic speech recognition (ASR) plays an important role inside vehicles to keep the driver away from distraction. It is also known that visual information (lip-reading) can improve accuracy in ASR under adverse conditions as those within a car. The corpus described here is intended to provide training and testing material for several classes of audiovisual speech recognizers including isolated word system, word-spotting systems, vocabulary independent systems, and speaker dependent or speaker independent systems for a wide range of applications. The audio database is composed of seven audio channels including, clean speech (captured using a close talk microphone), noisy speech from several microphones placed on the overhead of the cabin, noise only signal coming from the engine compartment and information about the speed of the car. For the video database, a small video camera sensible to the visible and the near infrared bands is placed on the windscreen and used to capture the face of the driver. This is done under different light conditions both during the day and at night. Additionally, the same individuals are recorded in laboratory, under controlled environment conditions to obtain noise free speech signals, 2D images and 3D + texture face models.


Multimodal, Multichannel, Audio-Visual Automatic Speech Recognition, car, 3D images

Language(s) Spanish
Full Paper