What is my Style? Using Stylistic Features of Portuguese Web Texts to classify Web pages according to Users' Needs
Rachel Aires (1, 2), Aline Manfrin (1), Sandra Maria Aluísio (1), Diana Santos (2)
(1) NILC/ICMC - USP; (2) Linguateca, SINTEF
In this paper we investigate the use of stylistic features of Web texts in Portuguese to classify web pages according to users’ needs, in order to improve Web Information Retrieval. We first describe a seven categories classification of users´ needs, which was the outcome of a qualitative analysis of two TodoBr logs (a major Brazilian search engine). We describe 46 shallow linguistic features, inspired by the works of Biber and Karlgren, and proceed describing the compilation of the corpus employed on the classifier training. Our aim is to obtain rules that can be applied on the classification of Web texts according to those seven users´ needs. Some experiments are reported, showing that it is possible, at least for some of the categories, to identify them reliably.
Web Information Retrieval, stylistic features, users’ needs, Portuguese