LREC 2018 Proceedings

Summary of the paper

Title	Text Normalization Infrastructure that Scales to Hundreds of Language Varieties
Authors	Mason Chua, Daan Van Esch, Noah Coccaro, Eunjoon Cho, Sujeet Bhandari and Libin Jia
Abstract	We describe the automated multi-language text normalization infrastructure that prepares textual data to train language models used in Google’s keyboards and speech recognition systems, across hundreds of language varieties. Training corpora are sourced from various types of data sets, and the text is then normalized using a sequence of hand-written grammars and learned models. These systems need to scale to hundreds or thousands of language varieties in order to meet product needs. Frequent data refreshes, privacy considerations and simultaneous updates across such a high number of languages make manual inspection of the normalized training data infeasible, while there is ample opportunity for data normalization issues. By tracking metrics about the data and how it was processed, we are able to catch internal data processing issues and external data corruption issues that can be hard to notice using standard extrinsic evaluation methods. Showing the importance of paying attention to data normalization behavior in large-scale pipelines, these metrics have highlighted issues in Google’s real-world speech recognition system that have caused significant, but latent, quality degradation.
Topics	Industrial Systems
Full paper	Text Normalization Infrastructure that Scales to Hundreds of Language Varieties
Bibtex	@InProceedings{CHUA18.8883, author = {Mason Chua and Daan Van Esch and Noah Coccaro and Eunjoon Cho and Sujeet Bhandari and Libin Jia}, title = "{Text Normalization Infrastructure that Scales to Hundreds of Language Varieties}", booktitle = {Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)}, year = {2018}, month = {May 7-12, 2018}, address = {Miyazaki, Japan}, editor = {Nicoletta Calzolari (Conference chair) and Khalid Choukri and Christopher Cieri and Thierry Declerck and Sara Goggi and Koiti Hasida and Hitoshi Isahara and Bente Maegaard and Joseph Mariani and Hélène Mazo and Asuncion Moreno and Jan Odijk and Stelios Piperidis and Takenobu Tokunaga}, publisher = {European Language Resources Association (ELRA)}, isbn = {979-10-95546-00-9}, language = {english} }