LREC 2018 Proceedings

Summary of the paper

Title	CPJD Corpus: Crowdsourced Parallel Speech Corpus of Japanese Dialects
Authors	Shinnosuke Takamichi and Hiroshi Saruwatari
Abstract	Public parallel corpora of dialects can accelerate related studies such as spoken language processing. Various corpora have been collected using a well-equipped recording environment, such as voice recording in an anechoic room. However, due to geographical and expense issues, it is impossible to use such a perfect recording environment for collecting all existing dialects. To address this problem, we used web-based recording and crowdsourcing platforms to construct a crowdsourced parallel speech corpus of Japanese dialects (CPJD corpus) including parallel text and speech data of 21 Japanese dialects. We recruited native dialect speakers on the crowdsourcing platform, and the hired speakers recorded their dialect speech using their personal computer or smartphone in their homes. This paper shows the results of the data collection and analyzes the audio data in terms of the signal-to-noise ratio and mispronunciations.
Topics	Corpus (Creation, Annotation, Etc.), Other
Full paper	CPJD Corpus: Crowdsourced Parallel Speech Corpus of Japanese Dialects
Bibtex	@InProceedings{TAKAMICHI18.67, author = {Shinnosuke Takamichi and Hiroshi Saruwatari}, title = "{CPJD Corpus: Crowdsourced Parallel Speech Corpus of Japanese Dialects}", booktitle = {Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)}, year = {2018}, month = {May 7-12, 2018}, address = {Miyazaki, Japan}, editor = {Nicoletta Calzolari (Conference chair) and Khalid Choukri and Christopher Cieri and Thierry Declerck and Sara Goggi and Koiti Hasida and Hitoshi Isahara and Bente Maegaard and Joseph Mariani and Hélène Mazo and Asuncion Moreno and Jan Odijk and Stelios Piperidis and Takenobu Tokunaga}, publisher = {European Language Resources Association (ELRA)}, isbn = {979-10-95546-00-9}, language = {english} }