LREC 2014 Proceedings

Summary of the paper

Title	N-gram Counts and Language Models from the Common Crawl
Authors	Christian Buck, Kenneth Heafield and Bas Van Ooyen
Abstract	We contribute 5-gram counts and language models trained on the Common Crawl corpus, a collection over 9 billion web pages. This release improves upon the Google n-gram counts in two key ways: the inclusion of low-count entries and deduplication to reduce boilerplate. By preserving singletons, we were able to use Kneser-Ney smoothing to build large language models. This paper describes how the corpus was processed with emphasis on the problems that arise in working with data at this scale. Our unpruned Kneser-Ney English $5$-gram language model, built on 975 billion deduplicated tokens, contains over 500 billion unique n-grams. We show gains of 0.5-1.4 BLEU by using large language models to translate into various languages.
Topics	Machine Translation, SpeechToSpeech Translation, Corpus (Creation, Annotation, etc.)
Full paper	N-gram Counts and Language Models from the Common Crawl
Bibtex	@InProceedings{BUCK14.1097, author = {Christian Buck and Kenneth Heafield and Bas Van Ooyen}, title = {N-gram Counts and Language Models from the Common Crawl}, booktitle = {Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)}, year = {2014}, month = {may}, date = {26-31}, address = {Reykjavik, Iceland}, editor = {Nicoletta Calzolari (Conference Chair) and Khalid Choukri and Thierry Declerck and Hrafn Loftsson and Bente Maegaard and Joseph Mariani and Asuncion Moreno and Jan Odijk and Stelios Piperidis}, publisher = {European Language Resources Association (ELRA)}, isbn = {978-2-9517408-8-4}, language = {english} }