LREC 2018 Proceedings

Summary of the paper

Title	JESC: Japanese-English Subtitle Corpus
Authors	Reid Pryzant, Youngjoo Chung, Dan Jurafsky and Denny Britz
Abstract	In this paper we describe the Japanese-English Subtitle Corpus (JESC). JESC is a large Japanese-English parallel corpus covering the underrepresented domain of conversational dialogue. It consists of more than 3.2 million examples, making it the largest freely available dataset of its kind. The corpus was assembled by crawling and aligning subtitles found on the web. The assembly process incorporates a number of novel preprocessing elements to ensure high monolingual fluency and accurate bilingual alignments. We summarize its contents and evaluate its quality using human experts and baseline machine translation (MT) systems.
Topics	Corpus (Creation, Annotation, Etc.), Other
Full paper	JESC: Japanese-English Subtitle Corpus
Bibtex	@InProceedings{PRYZANT18.30, author = {Reid Pryzant and Youngjoo Chung and Dan Jurafsky and Denny Britz}, title = "{JESC: Japanese-English Subtitle Corpus}", booktitle = {Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)}, year = {2018}, month = {May 7-12, 2018}, address = {Miyazaki, Japan}, editor = {Nicoletta Calzolari (Conference chair) and Khalid Choukri and Christopher Cieri and Thierry Declerck and Sara Goggi and Koiti Hasida and Hitoshi Isahara and Bente Maegaard and Joseph Mariani and Hélène Mazo and Asuncion Moreno and Jan Odijk and Stelios Piperidis and Takenobu Tokunaga}, publisher = {European Language Resources Association (ELRA)}, isbn = {979-10-95546-00-9}, language = {english} }