A Galician Textual Corpus for Morphosyntactic Tagging with Application to Text-to-Speech Synthesis
Lorena Seijo Pereiro (1), Ana Martínez Ínsua (1), Francisco Méndez Pazó (2), Francisco Campillo Díaz (2), Eduardo Rodríguez Banga (2)
(1) Centro Ramón Piñeiro para a Investigación en Humanidades. Xunta de Galicia. Santiago de Compostela. SPAIN; (2) Dpto. Teoría de la Señal y Comunicaciones. Universidad de Vigo. Vigo. SPAIN
This paper will present the morphosintactic tagger and the corpus of contemporary written Galician which are being employed in the development of the Galician version of our tex-to-speech synthesizer. Their quality and accuracy make them useful for speech technology applications and turn them into possible references for further investigation and research projects about Galician language. In essence, the tagger assigns automatically the morphosyntactic categories and other additional labels to the words in the corpus by resorting to a combination of both a reduced (although highly reliable) set of rules, and a stochastic language model that employs class n-grams whose probabilities are trained using the corpus itself. A bootstrapping technique is employed for tagging the texts contained in the corpus: a small amount of text is initially tagged automatically making use of a reduced set of linguistic rules and then, gathering together the results obtained at this stage of the process (after the manual revision of the tagging), an initial statistical model is built. The tagging process may be said to consist essentialy of a number of consecutive automatic-tagging stages that enclose: the use of the latest version of the statistical model, the manual revision, and the subsequent updating of the stochastic model with the correctly tagged text.
Galician corpus, morphosyntactic tagger, class n-grams, part-of-speech, text-to-speech