LREC 2000 2nd International Conference on Language Resources & Evaluation

Previous Paper   Next Paper

Title Issues from Corpus Analysis that have influenced the On-going Development of Various Haitian Creole Text- and Speech-based NLP Systems and Applications
Authors Mason Marilyn (Mason Intergated Technologies Ltd (MIT2), P.O. Box 181015, Boston, Massahchusetts 02118 USA,
Keywords End-User Environments, Haitian Creole, Minority Languages, Natural Language Processing Systems, OCR, Orthography Conversion, Standardization, Vernacular Languages
Session Session WP7 - Corpus Projects
Full Paper, 342.pdf
Abstract This paper describes issues that are relevant to using small- to large-sized corpora for the training and testing of various text- and speech-based natural language processing (NLP) systems for minority and vernacular languages. These R&D and commercial systems and applications include machine translation, orthography conversion, optical character recognition, speech recognition, and speech synthesis that have already been produced for the Haitian Creole (HC) language. Few corpora for minority and vernacular languages have been created specifically for language resource distribution and for NLP system training. As a result, some of the only available corpora are those that are produced within real end-user environments. It is therefore of utmost importance that written language standards be created and then observed so that research on various text- and speech-based systems can be fruitful. In doing so, this also provides vernacular and minority languages with the opportunity to have an impact within the globalization and advanced communication needs efforts of the modern day world. Such technologies can significantly influence the status of these languages, yet the lack of standardization is a severe impediment to technological development. A number of relevant issues are discussed in this paper.