Title Pumping Documents Through a Domain and Genre Classification Pipeline
Author(s) Udo Hahn, Joachim Wermter

Text Knowledge Engineering Lab, Freiburg University, Werthmannplatz 1, D-79098 Freiburg, Germany

Session O16-EW
Abstract We propose a simple, yet effective, pipeline architecture for document classification. The task we intend to solve is to classify large and content-wise heterogeneous document streams on a layered nine-category system, which distinguishes medical from non-medical texts and sorts medical texts into various subgenres. While the document classification problem is often dealt with using computationally powerful and, hence, costly classifiers (e.g., Bayesian ones), we have gathered empirical evidence that a much simpler approach based on n-gram-statistics achieves a comparable level of classification performance.
Keyword(s) text categorization, medical application, n-gram model, text genre, WWW
Language(s) German, language-independent
Full Paper 641.pdf