Summary of the paper

Title New language resources for the Pashto language
Authors Djamel Mostefa, Khalid Choukri, Sylvie Brunessaux, Karim Boudahmane
Abstract This paper reports on the development of new language resources for the Pashto language, a very low-resource language spoken in Afghanistan and Pakistan. In the scope of a multilingual data collection project, three large corpora are collected for Pashto. Firstly a monolingual text corpus of 100 million words is produced. Secondly a 100 hours speech database is recorded and manually transcribed. Finally a bilingual Pashto-French parallel corpus of around 2 million is produced by translating Pashto texts into French. These resources will be used to develop Human Language Technology systems for Pashto with a special focus on Machine Translation.
Topics Corpus (creation, annotation, etc.), Machine Translation, SpeechToSpeech Translation, Speech resource/database
Full paper New language resources for the Pashto language
