Summary of the paper

Title The brWaC Corpus: A New Open Resource for Brazilian Portuguese
Authors Jorge Alberto Wagner Filho, Rodrigo Wilkens, Marco Idiart and Aline Villavicencio
Abstract In this work, we present the construction process of a large Web corpus for Brazilian Portuguese, aiming to achieve a size comparable to the state of the art in other languages. We also discuss our updated sentence-level approach for the strict removal of duplicated content. Following the pipeline methodology, more than 60 million pages were crawled and filtered, with 3.5 million being selected. The obtained multi-domain corpus, named brWaC, is composed by 2.7 billion tokens, and has been annotated with tagging and parsing information. The incidence of non-unique long sentences, an indication of replicated content, which reaches 9% in other Web corpora, was reduced to only 0.5%. Domain diversity was also maximized, with 120,000 different websites contributing content. We are making our new resource freely available for the research community, both for querying and downloading, in the expectation of aiding in new advances for the processing of Brazilian Portuguese.
Topics Tools, Systems, Applications, Corpus (Creation, Annotation, Etc.), Other
Full paper The brWaC Corpus: A New Open Resource for Brazilian Portuguese
Bibtex @InProceedings{WAGNER FILHO18.599,
  author = {Jorge Alberto Wagner Filho and Rodrigo Wilkens and Marco Idiart and Aline Villavicencio},
  title = "{The brWaC Corpus: A New Open Resource for Brazilian Portuguese}",
  booktitle = {Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)},
  year = {2018},
  month = {May 7-12, 2018},
  address = {Miyazaki, Japan},
  editor = {Nicoletta Calzolari (Conference chair) and Khalid Choukri and Christopher Cieri and Thierry Declerck and Sara Goggi and Koiti Hasida and Hitoshi Isahara and Bente Maegaard and Joseph Mariani and Hélène Mazo and Asuncion Moreno and Jan Odijk and Stelios Piperidis and Takenobu Tokunaga},
  publisher = {European Language Resources Association (ELRA)},
  isbn = {979-10-95546-00-9},
  language = {english}
