LREC 2016 Proceedings

Summary of the paper

Title	Domain-Specific Corpus Expansion with Focused Webcrawling
Authors	Steffen Remus and Chris Biemann
Abstract	This work presents a straightforward method for extending or creating in-domain web corpora by focused webcrawling. The focused webcrawler uses statistical N-gram language models to estimate the relatedness of documents and weblinks and needs as input only N-grams or plain texts of a predefined domain and seed URLs as starting points. Two experiments demonstrate that our focused crawler is able to stay focused in domain and language. The first experiment shows that the crawler stays in a focused domain, the second experiment demonstrates that language models trained on focused crawls obtain better perplexity scores on in-domain corpora. We distribute the focused crawler as open source software.
Topics	Acquisition, Corpus (Creation, Annotation, etc.), Language Modelling
Full paper	Domain-Specific Corpus Expansion with Focused Webcrawling
Bibtex	@InProceedings{REMUS16.316, author = {Steffen Remus and Chris Biemann}, title = {Domain-Specific Corpus Expansion with Focused Webcrawling}, booktitle = {Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC 2016)}, year = {2016}, month = {may}, date = {23-28}, location = {Portorož, Slovenia}, editor = {Nicoletta Calzolari (Conference Chair) and Khalid Choukri and Thierry Declerck and Sara Goggi and Marko Grobelnik and Bente Maegaard and Joseph Mariani and Helene Mazo and Asuncion Moreno and Jan Odijk and Stelios Piperidis}, publisher = {European Language Resources Association (ELRA)}, address = {Paris, France}, isbn = {978-2-9517408-9-1}, language = {english} }