|Title||Perceptual Evaluation of Quality Deterioration Owing to Prosody Modification|
Kazuki Adachi (1), Tomoki Toda (2), Hiromichi Kawanami (1), Hiroshi Saruwatari (1), Kiyohiro Shikano (1)
Nara Institute of Science and Technology, 8916-5, Takayama-cho, Ikoma-shi, Nara-ken, Japan; (2) Nagoya Institute of Technology, JAPAN /Carnegie Mellon University, U.S.A.
|Abstract||Our reasearch goal is to construct a Japanese TTS (Text-to-Speech) system that can output various kinds of prosody. Since such synthetic speech is useful for a practical use, many TTS systems have implemented global prosodic control processing. But fundamentally they're designed to output speech with standard pitch and speech rate. We discuss synthesis method for high quality speech with extreme prosody (very high, low, fast and slow) from a viewpoint of a speech database. As a speech synthesis method, we employ a unit selection-concatenation method. We also introduce an analysis-synthesis process to give precise target prosody to output speech. Many research has reported that speech quality get worse in proportion to an amount of prosody modification by analysis-synthesis or PSOLA. Following the reports, we take an approach to reduce prosody modification of a speech segment. Nine Japanese speech databases with different characteristics in prosody are prepared. First we confirm relationship between speech quality deterioration and prosody modification, using synthetic speech with through objective and subjective tests. We also investigate relationship between a speech deterioration tendency and each speech database. The result indicates that the tendencies depend on prosodic features of original speech.|
|Keyword(s)||Speech synthesis, prosody modification, speech database, analysis-synthesis, perceptual evaluation|