LREC 2018 Proceedings

Summary of the paper

Title	Discovering Canonical Indian English Accents: A Crowdsourcing-based Approach
Authors	Sunayana Sitaram, Varun Manjunath, Varun Bharadwaj, Monojit Choudhury, Kalika Bali and Michael Tjalve
Abstract	Automatic Speech Recognition (ASR) systems typically degrade in performance when recognizing an accent different from the accents in the training data. One way to overcome this problem without training new models for every accent is adaptation. India has over a hundred major languages, which leads to many variants in Indian English accents. Making an ASR system work well for Indian English would involve collecting data for all representative accents in Indian English and then adapting Acoustic Models for each of those accents. However, given the number of languages that exist in India and the lack of a prior work in literature about how many Indian English accents exist, it is difﬁcult to come up with a set of canonical accents that could sufﬁciently capture the variations observed in Indian English. In addition, there is a lack of labeled corpora of accents in Indian English. We approach the problem of determining a set of canonical Indian English accents by taking a crowdsourcing based approach. We conduct a mobile app based user study in which we play audio samples collected from all over India and ask users to identify the geographical origin of the speaker. We measure the consensus among users to come up with a set of candidate accents in Indian English and identify which accents are best recognized and which ones are confusable. We extend our preliminary user study to a web app-based study that can potentially generate more labeled data for Indian English accents. We describe results and challenges encountered in a pilot study conducted using the web-app and future work to scale up the study.
Topics	Speech Resource/Database, Multilinguality, Other
Full paper	Discovering Canonical Indian English Accents: A Crowdsourcing-based Approach
Bibtex	@InProceedings{SITARAM18.279, author = {Sunayana Sitaram and Varun Manjunath and Varun Bharadwaj and Monojit Choudhury and Kalika Bali and Michael Tjalve}, title = "{Discovering Canonical Indian English Accents: A Crowdsourcing-based Approach}", booktitle = {Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)}, year = {2018}, month = {May 7-12, 2018}, address = {Miyazaki, Japan}, editor = {Nicoletta Calzolari (Conference chair) and Khalid Choukri and Christopher Cieri and Thierry Declerck and Sara Goggi and Koiti Hasida and Hitoshi Isahara and Bente Maegaard and Joseph Mariani and Hélène Mazo and Asuncion Moreno and Jan Odijk and Stelios Piperidis and Takenobu Tokunaga}, publisher = {European Language Resources Association (ELRA)}, isbn = {979-10-95546-00-9}, language = {english} }