Evaluating Factors Impacting the Accuracy of Forced Alignments in a Multimodal Corpus


Lei Chen (1), Yang Liu (1,2), Mary Harper (1), Eduardo Maia (1), Susan McRoy (3)

(1) Electrical and Computer Engineering at Purdue University; (1,2) The International Computer Science Institute; (3) Computer Science at the University of Wisconsin-Milwaukee




People, when processing human-to-human communication, utilize everything they can in order to understand that communication, including speech and information such as the time and location of an interlocutor's gesture and gaze. Speech and gesture are known to exhibit a synchronous relationship in human communication; however, the precise nature of that relationship requires further investigation. The construction of computer models of multimodal human communication would be enabled by the availability of multimodal communication corpora annotated with synchronized gesture and speech features. To investigate the temporal relationships of these knowledge sources, we have collected and are annotating several multimodal corpora with time-aligned features. Forced alignment between a speech file and its transcription is a crucial part of multimodal corpus production. This paper investigates a number of factors that may contribute to highly accurate forced alignments to support the rapid production of these multimodal corpora including the acoustic model, the match between the speech used for training the system and that to be force aligned, the amount of data used to train the ASR system, the availability of speaker adaptation, and the duration of alignment segments.


Forced Alignment, Multimodal Dialog, Corpus Production, Time Alignment



Full Paper