A data-driven adaptation of prosody in a multilingual TTS


Janez Stergar (1), Caglayan Erdem (2), Bogomir Horvat (1), Zdravko Kačič (1)

(1) University of Maribor, Faculty of Electrical Engineering and Computer Science Maribor, Slovenia, (2) Siemens Corporate Technology, Dept. CTIC 5, 81730 Munich, Germany, (janez.stergar, bogo.horvat, kacic)@uni-mb.si, caglayan.erdem@bmw.de




Proper accentuation and phrasing make the syntactic and semantic structure of the message more transparent to the listener. Therefore a good modeling of prosody in a TTS system has to be structured into appropriate levels. The implemented prosodic hierarchy should guide the listeners’ attention and help in support of the comprehension process. Since prosody functions as a distractor, it is very important to build the prosody module in a TTS system very carefully. With the goal towards improvements of naturalness a concept of a selective hierarchical approach of prominence disambiguation and symbolic modeling will be introduced. The selective statistically based prominence disambiguation and prediction concept will be discussed and the implementation of the neural network (NN) module for prediction of symbolic tags into a multilingual TTS system introduced. We’ll conclude with prediction results and a suitability test of the introduced selective approach based on preliminary acoustical tests performed in a multilingual TTS.


multilingual TTS, prosody, data-driven adaptation, Slovenian language, Neural networks

Language(s) Slovenian
Full Paper