Summary of the paper

Title Text Normalization Infrastructure that Scales to Hundreds of Language Varieties
Authors Mason Chua, Daan Van Esch, Noah Coccaro, Eunjoon Cho, Sujeet Bhandari and Libin Jia
Abstract We describe the automated multi-language text normalization infrastructure that prepares textual data to train language models used in Google’s keyboards and speech recognition systems, across hundreds of language varieties. Training corpora are sourced from various types of data sets, and the text is then normalized using a sequence of hand-written grammars and learned models. These systems need to scale to hundreds or thousands of language varieties in order to meet product needs. Frequent data refreshes, privacy considerations and simultaneous updates across such a high number of languages make manual inspection of the normalized training data infeasible, while there is ample opportunity for data normalization issues. By tracking metrics about the data and how it was processed, we are able to catch internal data processing issues and external data corruption issues that can be hard to notice using standard extrinsic evaluation methods. Showing the importance of paying attention to data normalization behavior in large-scale pipelines, these metrics have highlighted issues in Google’s real-world speech recognition system that have caused significant, but latent, quality degradation.
Topics Industrial Systems
Full paper Text Normalization Infrastructure that Scales to Hundreds of Language Varieties
Bibtex @InProceedings{CHUA18.8883,
  author = {Mason Chua and Daan Van Esch and Noah Coccaro and Eunjoon Cho and Sujeet Bhandari and Libin Jia},
  title = "{Text Normalization Infrastructure that Scales to Hundreds of Language Varieties}",
  booktitle = {Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)},
  year = {2018},
  month = {May 7-12, 2018},
  address = {Miyazaki, Japan},
  editor = {Nicoletta Calzolari (Conference chair) and Khalid Choukri and Christopher Cieri and Thierry Declerck and Sara Goggi and Koiti Hasida and Hitoshi Isahara and Bente Maegaard and Joseph Mariani and Hélène Mazo and Asuncion Moreno and Jan Odijk and Stelios Piperidis and Takenobu Tokunaga},
  publisher = {European Language Resources Association (ELRA)},
  isbn = {979-10-95546-00-9},
  language = {english}
Powered by ELDA © 2018 ELDA/ELRA