Network of Data Centres (NetDC): BNSC An Arabic Broadcast News Speech Corpus


Khalid Choukri, Mahtab Nikkhou, Niklas Paulsson

ELDA Evaluation and Language Resources Distribution Agency, Paris, France, 55-57, rue Brillat-Savarin, 75013 Paris, FRANCE, {choukri, nikkhou, paulsson}@elda.fr, http://www.elda.fr

Session O23-SE

Broadcast news is a very rich source of Language Resources that has been exploited to develop and assess a large set of Human Language Technologies. Some examples include systems to: automatically produce text transcriptions of spoken data; identify the language of a text; translate a text from one language to another; identify topics in the news and retrieve all stories discussing a target topic; retrieve stories directly from the broadcast audio and extract summaries of the content of news stories. BNSC is a broadcast news speech corpus developed in the framework of the European-funded project Network of Data Centres (NetDC). The corpus contains more than 20 hours of Arabic news recordings in modern standard Arabic. The news was recorded over a period of 3 months and were transcribed in Arabic script. The project was done in corporation with the LDC (Linguistic Data Consortium), which has produced a similar corpus of its Voice of America Arabic in the United States. This paper presents the BNSC corpus production from data collection to final product.

Keyword(s) NetDC, broadcast news, Speech corpus
Language(s) N /A
Full Paper 797.pdf