205 What are Transcription Errors and Why are They made? In recent work we compared transcriptions of German spontaneous dialogues of the VERBMOBIL corpus to ascertain differences between transcribers and quality. A better understanding of where and what kind of inconsistencies occur will help us to improve the working environment for transcribers, to reduce the effort on correction passes, and will finally result in better transcription quality. The results show that transcribers have different levels of perception of spontaneous speech phenomena, mainly prosodic phenomena such as pauses in speech and lengthening. During the correction pass 80% of these labels had to be inserted. Additionally, the annotation of non-grammatical phrases and pronunciation comments seems to need a better explanation in the convention manual. Here the correcting transcribers had to change 20% of the annotations.
180 What's in a Thesaurus? We first describe four varieties of thesaurus: (1) Roget-style, produced to help people find synonyms when they are writing; (2) WordNet and EuroWordNet; (3) thesauruses produced (manually) to support information retrieval systems; and (4) thesauruses produced auto-matically from corpora. We then contrast thesauruses and dictionaries, and present a small experiment in which we look at polysemy in relation to thesaurus structure. It has sometimes been assumed that different dictionary senses for a word that are close in meaning will be near neighbours in the thesaurus. This hypothesis is explored, using as inputs the hierarchical structure of WordNet 1.5 and a mapping between WordNet senses and the senses of another dictionary. The experiment shows that pairs of ‘lexicographically close’ meanings are frequently found in different parts of the hierarchy.
98 Where Opposites Meet. A Syntactic Meta-scheme for Corpus Annotation and Parsing Evaluation The paper describes the use of FAME, a functional annotation meta–scheme for comparison and evaluation of syntactic annotation schemes, i) as a flexible yardstick in multi–lingual and multi–modal parser evaluation campaigns and ii) for corpus annotation. We show that FAME complies with a variety of non–trivial methodological requirements, and has the potential for being effectively used as an “interlingua” between different syntactic representation formats.
76 Will Very Large Corpora Play For Semantic Disambiguation The Role That Massive Computing Power Is Playing For Other AI-Hard Problems? In this paper we formally analyze the relation between the amount of (possibly noisy) examples provided to a word-sense classification algorithm and the performance of the classifier. In the first part of the paper, we show that Computational Learning Theory provides a suitable theoretical framework to establish one such relation. In the second part of the paper, we will apply our theoretical results to the case of a semantic disambiguation algorithm based on syntactic similarity.
31 With WORLDTREK Family, Create, Update and Browse your Terminological World Companies need to extract pertinent and coherent information from large collections of documents to be competitive and efficient. Structured terminologies are essential for a better drafting, translation or understanding of technical communication. WORLDTREK EDITION is a tool created to help the terminologist elaborate, browse and update structured terminologies in a ergonomic environment without changing his or her working method. This application can be entirely adapted to the « terminological habits » of the expert. Thus, the data loaded in the software is meta-data. Links, status, property names and domains can be customized. Moreover, the validation stage is facilitated by the use of templates, queries and filters. New terms and links can be easily created to enrich the domains and points of view. Properties like definition, context, equivalent in foreign languages are associated with the terms. WORLDTREK EDITION facilitates the comparison and merging of pre-existing networks. All these tasks and the visualization techniques constitute the tool which will help the terminologist to be more effective and productive.