LREC 2012 Proceedings

Summary of the paper

Title	Collecting and Analysing Chats and Tweets in SoNaR
Authors	Eric Sanders
Abstract	In this paper a collection of chats and tweets from the Netherlands and Flanders is described. The chats and tweets are part of the freely available SoNaR corpus, a 500 million word text corpus of the Dutch language. Recruitment, metadata, anonymisation and IPR issues are discussed. To illustrate the difference of language use between the various text types and other parameters (like gender and age) simple text analysis in the form of unigram frequency lists is carried out. Furthermore a website is presented with which users can retrieve their own frequency lists.
Topics	Corpus (creation, annotation, etc.), Metadata, Web Services
Full paper	Collecting and Analysing Chats and Tweets in SoNaR
Bibtex	@InProceedings{SANDERS12.416, author = {Eric Sanders}, title = {Collecting and Analysing Chats and Tweets in SoNaR}, booktitle = {Proceedings of the Eight International Conference on Language Resources and Evaluation (LREC'12)}, year = {2012}, month = {may}, date = {23-25}, address = {Istanbul, Turkey}, editor = {Nicoletta Calzolari (Conference Chair) and Khalid Choukri and Thierry Declerck and Mehmet Uğur Doğan and Bente Maegaard and Joseph Mariani and Asuncion Moreno and Jan Odijk and Stelios Piperidis}, publisher = {European Language Resources Association (ELRA)}, isbn = {978-2-9517408-7-7}, language = {english} }