Reducing Segmental Duration Variation by Local Speech Rate Normalization of Large Spoken Language Resources


Hartmut R. Pfitzinger (Department of Phonetics, University of Munich)


SP1: Speech Resources


We developed a time-domain normalization procedure which uses a speech signal and its corresponding speech rate contour as an input, and produces the normalized speech signal. Then we normalized the speech rate of a large spoken language resource of German read speech. We compared the resulting segment durations with the original durations using several three-way ANOVAs with phone type and speaker as independent variables, since we assume that segment duration variation is determined by segment type (intrinsic duration), by the speaker (speech rate, sociolect, ideolect, dialect, speech production variation), and by linguistic effects (context, syllable structure, accent, and stress). One important result of the statistical analysis was, that the influence of the speaker on segment duration variation decreased dramatically (factor 0.54 for vowels, factor 0.29 for consonants) when normalizing speech rate, despite the fact that sociolect, ideolect, and dialect remained almost unchanged. Since the interaction between the independent variables speaker and phone type remained constantly, the hypothesis arises, that this interaction contains most of the speaker-specific information. 


Prosody, Speaker characteristics, Timing invariance and variability, Local speech rate

Full Paper