Summary of the paper

Title New language resources for the Pashto language
Authors Djamel Mostefa, Khalid Choukri, Sylvie Brunessaux, Karim Boudahmane
Abstract This paper reports on the development of new language resources for the Pashto language, a very low-resource language spoken in Afghanistan and Pakistan. In the scope of a multilingual data collection project, three large corpora are collected for Pashto. Firstly a monolingual text corpus of 100 million words is produced. Secondly a 100 hours speech database is recorded and manually transcribed. Finally a bilingual Pashto-French parallel corpus of around 2 million is produced by translating Pashto texts into French. These resources will be used to develop Human Language Technology systems for Pashto with a special focus on Machine Translation.
Topics Corpus (creation, annotation, etc.), Machine Translation, SpeechToSpeech Translation, Speech resource/database
Full paper New language resources for the Pashto language
Bibtex @InProceedings{MOSTEFA12.824,
  author = {Djamel Mostefa and Khalid Choukri and Sylvie Brunessaux and Karim Boudahmane},
  title = {New language resources for the Pashto language},
  booktitle = {Proceedings of the Eight International Conference on Language Resources and Evaluation (LREC'12)},
  year = {2012},
  month = {may},
  date = {23-25},
  address = {Istanbul, Turkey},
  editor = {Nicoletta Calzolari (Conference Chair) and Khalid Choukri and Thierry Declerck and Mehmet Uğur Doğan and Bente Maegaard and Joseph Mariani and Asuncion Moreno and Jan Odijk and Stelios Piperidis},
  publisher = {European Language Resources Association (ELRA)},
  isbn = {978-2-9517408-7-7},
  language = {english}
Powered by ELDA © 2012 ELDA/ELRA