Summary of the paper

Title TSix: A Human-involved-creation Dataset for Tweet Summarization
Authors Minh-Tien Nguyen, Dac Viet Lai, Huy-Tien Nguyen and Minh-Le Nguyen
Abstract We present a new dataset for tweet summarization. The dataset includes six events collected from Twitter from October 10 to November 9, 2016. Our dataset features two prominent properties. Firstly, human-annotated gold-standard references allow to correctly evaluate extractive summarization methods. Secondly, tweets are assigned into sub-topics divided by consecutive days, which facilitate incremental tweet stream summarization methods. To reveal the potential usefulness of our dataset, we compare several well-known summarization methods. Experimental results indicate that among extractive approaches, hybrid term frequency -- document term frequency obtains competitive results in term of ROUGE-scores. The analysis also shows that polarity is an implicit factor of tweets in our dataset, suggesting that it can be exploited as a component besides tweet content quality in the summarization process.
Topics Summarisation, Information Extraction, Information Retrieval, Corpus (Creation, Annotation, Etc.)
Full paper TSix: A Human-involved-creation Dataset for Tweet Summarization
