Summary of the paper

Title Designing and Evaluating a Reliable Corpus of Web Genres via Crowd-Sourcing
Authors Noushin Rezapour Asheghi, Serge Sharoff and Katja Markert
Abstract Research in Natural Language Processing often relies on a large collection of manually annotated documents. However, currently there is no reliable genre-annotated corpus of web pages to be employed in Automatic Genre Identification (AGI). In AGI, documents are classified based on their genres rather than their topics or subjects. The major shortcoming of available web genre collections is their relatively low inter-coder agreement. Reliability of annotated data is an essential factor for reliability of the research result. In this paper, we present the first web genre corpus which is reliably annotated. We developed precise and consistent annotation guidelines which consist of well-defined and well-recognized categories. For annotating the corpus, we used crowd-sourcing which is a novel approach in genre annotation. We computed the overall as well as the individual categories' chance-corrected inter-annotator agreement. The results show that the corpus has been annotated reliably.
Topics Crowdsourcing, Document Classification, Text categorisation
Full paper Designing and Evaluating a Reliable Corpus of Web Genres via Crowd-Sourcing
Bibtex @InProceedings{REZAPOURASHEGHI14.470,
  author = {Noushin Rezapour Asheghi and Serge Sharoff and Katja Markert},
  title = {Designing and Evaluating a Reliable Corpus of Web Genres via Crowd-Sourcing},
  booktitle = {Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)},
  year = {2014},
  month = {may},
  date = {26-31},
  address = {Reykjavik, Iceland},
  editor = {Nicoletta Calzolari (Conference Chair) and Khalid Choukri and Thierry Declerck and Hrafn Loftsson and Bente Maegaard and Joseph Mariani and Asuncion Moreno and Jan Odijk and Stelios Piperidis},
  publisher = {European Language Resources Association (ELRA)},
  isbn = {978-2-9517408-8-4},
  language = {english}
Powered by ELDA © 2014 ELDA/ELRA