Comparative study of oral and written corpora automatically tagged with morpho-syntactic information


Véronique Gendner (TALANA-Lattice (Paris 7) / TLP-LIMSI (CNRS))


SO4: Annotation Tools For Speech LRs


In this paper, we investigate automatic tagging of French corpora and compare morpho-syntactic properties of spoken and written language on corpora from different sources. Morpho-syntactic properties are first described according to the distribution of the 8 main POS in five corpora of about 1 million words each. The automatic tagging was made with about a hundred tags and we will describe the distinctions they allow and the reason why they were chosen. We will further discuss variation of the distinction common / proper noun and some distinctions made on the verb category . For this comparison, corpora of about 40 million words were used. These larger corpora have also been used to study the influence of corpus size on vocabularies. Our study on French shows that sources in the news domain have about 36% of noun-like items (nouns and pronouns). This strongly correlates with Hudson's earlier studies on the English Brown and LOB corpora. A task-specific dialog corpus shows the highest proportions of 43% of noun-like items. Spoken news shows about 5% less nouns and 5% more pronouns than written news.


Automatic tagging, Spoken corpora, Morpho-Syntactic information

Full Paper