Summary of the paper

Title LanguageCrawl: A Generic Tool for Building Language Models Upon Common-Crawl
Authors Szymon Roziewski and Wojciech Stokowiec
Abstract The web data contains immense amount of data, hundreds of billion words are waiting to be extracted and used for language research. In this work we introduce our tool LanguageCrawl which allows NLP researchers to easily construct web-scale corpus from Common Crawl Archive: a petabyte scale, open repository of web crawl information. Three use-cases are presented: filtering Polish websites, building an N-gram corpora and training continuous skip-gram language model with hierarchical softmax. Each of them has been implemented within the LanguageCrawl toolkit, with the possibility to adjust specified language and N-gram ranks. Special effort has been put on high computing efficiency, by applying highly concurrent multitasking. We make our tool publicly available to enrich NLP resources. We strongly believe that our work will help to facilitate NLP research, especially in under-resourced languages, where the lack of appropriately sized corpora is a serious hindrance to applying data-intensive methods, such as deep neural networks.
Topics Corpus (Creation, Annotation, etc.), Text Mining, Tools, Systems, Applications
Full paper LanguageCrawl: A Generic Tool for Building Language Models Upon Common-Crawl
Bibtex @InProceedings{ROZIEWSKI16.489,
  author = {Szymon Roziewski and Wojciech Stokowiec},
  title = {LanguageCrawl: A Generic Tool for Building Language Models Upon Common-Crawl},
  booktitle = {Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC 2016)},
  year = {2016},
  month = {may},
  date = {23-28},
  location = {Portoro┼ż, Slovenia},
  editor = {Nicoletta Calzolari (Conference Chair) and Khalid Choukri and Thierry Declerck and Sara Goggi and Marko Grobelnik and Bente Maegaard and Joseph Mariani and Helene Mazo and Asuncion Moreno and Jan Odijk and Stelios Piperidis},
  publisher = {European Language Resources Association (ELRA)},
  address = {Paris, France},
  isbn = {978-2-9517408-9-1},
  language = {english}
Powered by ELDA © 2016 ELDA/ELRA