| Title | Textual Characteristics for Language Engineering | 
  
  | Authors | Mathias Bank, Robert Remus and Martin Schierle | 
  
  | Abstract | Language statistics are widely used to characterize and better understand language. In parallel, the amount of text mining and information retrieval methods grew rapidly within the last decades, with many algorithms evaluated on standardized corpora, often drawn from newspapers. However, up to now there were almost no attempts to link the areas of natural language processing and language statistics in order to properly characterize those evaluation corpora, and to help others to pick the most appropriate algorithms for their particular corpus. We believe no results in the field of natural language processing should be published without quantitatively describing the used corpora. Only then the real value of proposed methods can be determined and the transferability to corpora originating from different genres or domains can be estimated. We lay ground for a language engineering process by gathering and defining a set of textual characteristics we consider valuable with respect to building natural language processing systems. We carry out a case study for the analysis of automotive repair orders and explicitly call upon the scientific community to provide feedback and help to establish a good practice of corpus-aware evaluations. | 
  
  | Topics | Information Extraction, Information Retrieval, LR Infrastructures and Architectures, Tools, systems, applications | 
  
  | Full paper  | Textual Characteristics for Language Engineering | 
  
  | Bibtex | @InProceedings{BANK12.182, author =  {Mathias Bank and Robert Remus and Martin Schierle},
 title =  {Textual Characteristics for Language Engineering},
 booktitle =  {Proceedings of the Eight International Conference on Language Resources and Evaluation (LREC'12)},
 year =  {2012},
 month =  {may},
 date =  {23-25},
 address =  {Istanbul, Turkey},
 editor =  {Nicoletta Calzolari (Conference Chair) and Khalid Choukri and Thierry Declerck and Mehmet Uğur Doğan and Bente Maegaard and Joseph Mariani and Asuncion Moreno and Jan Odijk and Stelios Piperidis},
 publisher =  {European Language Resources Association (ELRA)},
 isbn =  {978-2-9517408-7-7},
 language =  {english}
 }
 |