Summary of the paper

Title Producing Monolingual and Parallel Web Corpora at the Same Time - SpiderLing and Bitextor's Love Affair
Authors Nikola Ljubešić, Miquel Esplà-Gomis, Antonio Toral, Sergio Ortiz Rojas and Filip Klubička
Abstract "This paper presents an approach for building large monolingual corpora and, at the same time, extracting parallel data by crawling the top-level domain of a given language of interest. For gathering linguistically relevant data from top-level domains we use the SpiderLing crawler, modified to crawl data written in multiple languages. The output of this process is then fed to Bitextor, a tool for harvesting parallel data from a collection of documents. We call the system combining these two tools Spidextor, a blend of the names of its two crucial parts. We evaluate the described approach intrinsically by measuring the accuracy of the extracted bitexts from the Croatian top-level domain "".hr"" and the Slovene top-level domain "".si"", and extrinsically on the English-Croatian language pair by comparing an SMT system built from the crawled data with third-party systems. We finally present parallel datasets collected with our approach for the English-Croatian, English-Finnish, English-Serbian and English-Slovene language pairs."
Topics Corpus (Creation, Annotation, etc.), Multilinguality, Machine Translation, SpeechToSpeech Translation
Full paper Producing Monolingual and Parallel Web Corpora at the Same Time - SpiderLing and Bitextor's Love Affair
Bibtex @InProceedings{LJUBEI16.857,
  author = {Nikola Ljubešić and Miquel Esplà-Gomis and Antonio Toral and Sergio Ortiz Rojas and Filip Klubička},
  title = {Producing Monolingual and Parallel Web Corpora at the Same Time - SpiderLing and Bitextor's Love Affair},
  booktitle = {Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC 2016)},
  year = {2016},
  month = {may},
  date = {23-28},
  location = {Portoro┼ż, Slovenia},
  editor = {Nicoletta Calzolari (Conference Chair) and Khalid Choukri and Thierry Declerck and Sara Goggi and Marko Grobelnik and Bente Maegaard and Joseph Mariani and Helene Mazo and Asuncion Moreno and Jan Odijk and Stelios Piperidis},
  publisher = {European Language Resources Association (ELRA)},
  address = {Paris, France},
  isbn = {978-2-9517408-9-1},
  language = {english}
 }
Powered by ELDA © 2016 ELDA/ELRA