Title Automatic Generation of Compound Word Lexicon for Hindi Speech Synthesis
Author(s) Deepa S.R. (1), Kalika Bali (2), Partha Pratim Talukdar (2), A.G. Ramakrishnan (3)

(1) Birla Institute of Technology & Science, Pilani, Rajasthan, India, f2000073@bits-pilani.ac.in; (2) Hewlett-Packard Labs, 24 Salarpuria Arena, Hosur Road, Bangalore, India, {kalika.bali, partha.talukdar}@hp.com; (3) Department of Electrical Engineering, Indian Institute of Science, Bangalore, India, ramkiag@ee.iisc.ernet.in

Abstract This paper addresses the problem of Hindi compound word splitting and its relevance to developing a good quality phonetizer for Hindi Speech Synthesis. The constituents of a Hindi compound word are not separated by space or hyphen. Hence, most of the existing compound splitting algorithms can not be applied to Hindi. We propose a new technique for automatic extraction of compound words from Hindi corpus. Preliminary tests conducted on the algorithm have shown a split rate of 92 to 96% of the input compound words. Of these splits, around 83 to 87% are correct splits. A few modifications have been suggested, which will improve the accuracy of the splits. Finally, we observe an improvement of 1.6% in Hindi Grapheme-to-Phoneme (G2P) conversion as a result of using a phonetized compound word lexicon, created by the above technique.
Keyword(s) Compound word lexicon, speech synthesis, Hindi Grapheme-to-Phoneme (G2P) conversion, schwa deletion
Language(s) Hindi
