SUMMARY : Session P22-W
| Title | Open Source Corpus Analysis Tools for Malay |
|---|---|
| Authors | T. Baldwin, S. Awab |
| Abstract | Tokenisers, lemmatisers and POS taggers are vital to the linguistic and digital furtherment of any language. In this paper, we present an open source toolkit for Malay incorporating a word and sentence tokeniser, a lemmatiser and a partial POS tagger, based on heavy reuse of pre-existing language resources. We outline the software architecture of each component, and present an evaluation of each over a 26K word sample of Malay text. |
| Keywords | sentence tokeniser, lemmatiser, Malay |
| Full paper | Open Source Corpus Analysis Tools for Malay |