LREC 2014 Proceedings

Summary of the paper

Title	TweetCaT: a Tool for Building Twitter Corpora of Smaller Languages
Authors	Nikola Ljubešić, Darja Fišer and Tomaž Erjavec
Abstract	This paper presents TweetCaT, an open-source Python tool for building Twitter corpora that was designed for smaller languages. Using the Twitter search API and a set of seed terms, the tool identifies users tweeting in the language of interest together with their friends and followers. By running the tool for 235 days we tested it on the task of collecting two monitor corpora, one for Croatian and Serbian and the other for Slovene, thus also creating new and valuable resources for these languages. A post-processing step on the collected corpus is also described, which filters out users that tweet predominantly in a foreign language thus further cleans the collected corpora. Finally, an experiment on discriminating between Croatian and Serbian Twitter users is reported.
Topics	Social Media Processing, Corpus (Creation, Annotation, etc.)
Full paper	TweetCaT: a Tool for Building Twitter Corpora of Smaller Languages
Bibtex	@InProceedings{LJUBEI14.834, author = {Nikola Ljubešić and Darja Fišer and Tomaž Erjavec}, title = {TweetCaT: a Tool for Building Twitter Corpora of Smaller Languages}, booktitle = {Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)}, year = {2014}, month = {may}, date = {26-31}, address = {Reykjavik, Iceland}, editor = {Nicoletta Calzolari (Conference Chair) and Khalid Choukri and Thierry Declerck and Hrafn Loftsson and Bente Maegaard and Joseph Mariani and Asuncion Moreno and Jan Odijk and Stelios Piperidis}, publisher = {European Language Resources Association (ELRA)}, isbn = {978-2-9517408-8-4}, language = {english} }