Framework for data-driven video-realistic audio-visual speech-synthesis
Institut für Kommunikationsforschung und Phonetik, Universität Bonn
We present a framework for generating a video-realistic audio-visual “Talking Head”, which can be integrated in applications as a natural Human-Computer interface where audio only is not an appropriate output channel especially in noisy environments. Our work is based on a 2D-video-frame concatenative visual synthesis and a unit-selection based Text-to-Speech system. In order to produce a synchronized audio-video-stream with novel utterances the speaker never made in the previously recorded corpus, we deploy data-driven selection and concatenation techniques. Our framework is organized in an offline-processing and an online-processing stage. The offline module handles data preparation which is used during runtime. The online module manages and generates the audio-visual-synthesis. The generated output stream has the form of a camera recorded video. Within the framework, it is possible to use the visual output with different speakers or one speaker with different visual mappings. Our framework is built on German speech and video data but could easily be adapted to other languages.
Speechsynthesis, Computer Vision, Human Computer Interaction, Multimodal Corpora, Dialog Systems