N-Gram Language Modeling for Robust Multi-Lingual Document Classification
German Research Center for Artificial Intelligence GmbH
Statistical n-gram language modeling is used in many domains like speech recognition, language identification, machine translation, character recognition and topic classification. Most language modeling approaches work on n-grams of terms. This paper reports about ongoing research in the MEMPHIS project which employs models based on character-level n-grams instead of term n-grams. The models are used for the multi-lingual classification of documents according to the topics of the MEMPHIS domains. We present methods capable of dealing robustly with large vocabularies and informal, erroneous texts in different languages. We also report on our results of using multi-lingual language models and experimenting with different classification parameters like smoothing techniques and n-grams lengths.
language modeling, multi-lingual document classification, character-level n-gram modeling, smoothing techniques