Summary of the paper

Title The Scielo Corpus: a Parallel Corpus of Scientific Publications for Biomedicine
Authors Mariana Neves, Antonio Jimeno Yepes and Aurélie Névéol
Abstract The biomedical scientific literature is a rich source of information not only in the English language, for which it is more abundant, but also in other languages, such as Portuguese, Spanish and French. We present the first freely available parallel corpus of scientific publications for the biomedical domain. Documents from the ”Biological Sciences” and ”Health Sciences” categories were retrieved from the Scielo database and parallel titles and abstracts are available for the following language pairs: Portuguese/English (about 86,000 documents in total), Spanish/English (about 95,000 documents) and French/English (about 2,000 documents). Additionally, monolingual data was also collected for all four languages. Sentences in the parallel corpus were automatically aligned and a manual analysis of 200 documents by native experts found that a minimum of 79% of sentences were correctly aligned in all language pairs. We demonstrate the utility of the corpus by running baseline machine translation experiments. We show that for all language pairs, a statistical machine translation system trained on the parallel corpora achieves performance that rivals or exceeds the state of the art in the biomedical domain. Furthermore, the corpora are currently being used in the biomedical task in the First Conference on Machine Translation (WMT’16).
Topics Corpus (Creation, Annotation, etc.), Machine Translation, SpeechToSpeech Translation, Multilinguality
Full paper The Scielo Corpus: a Parallel Corpus of Scientific Publications for Biomedicine
Bibtex @InProceedings{NEVES16.800,
  author = {Mariana Neves and Antonio Jimeno Yepes and Aurélie Névéol},
  title = {The Scielo Corpus: a Parallel Corpus of Scientific Publications for Biomedicine},
  booktitle = {Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC 2016)},
  year = {2016},
  month = {may},
  date = {23-28},
  location = {Portoro┼ż, Slovenia},
  editor = {Nicoletta Calzolari (Conference Chair) and Khalid Choukri and Thierry Declerck and Sara Goggi and Marko Grobelnik and Bente Maegaard and Joseph Mariani and Helene Mazo and Asuncion Moreno and Jan Odijk and Stelios Piperidis},
  publisher = {European Language Resources Association (ELRA)},
  address = {Paris, France},
  isbn = {978-2-9517408-9-1},
  language = {english}
Powered by ELDA © 2016 ELDA/ELRA