LREC 2000 2nd International Conference on Language Resources & Evaluation

Papers and abstracts by paper title: A B C D E F G H I J K L M N O P Q R S T U V W X Y Z

Papers and abstracts by ID number: 1-50, 51-100, 101-150, 151-200, 201-250, 251-300, 301-350, 351-377.

List of all papers and abstracts

Paper Paper Title Abstract
351 Modern Greek Corpus Taxonomy The aim of this paper is to explore the way in which different kind of linguistic variables can be used in order to discriminate text type in 240 preclassified press texts. Modern Greek (MG) language due to its past diglossic status exhibits extended variation in written texts across all linguistic levels and can be exploited in text categorization tasks. The research presented used Discriminant Function Analysis (DFA) as a text categorization method and explores the way different variable groups contribute to the text type discrimination.
353 Language Resources as by-Product of Evaluation: The MULTITAG Example In this paper, we show how the paradigm of evaluation can function as language resource producer for high quality and low cost validated language resources. First the paradigm of evaluation is presented, the main points of its history are recalled, from the first deployment that took place in the USA during the DARPA/NIST evaluation campaigns, up to latest efforts in Europe (SENSEVAL2/ROMANSEVAL2, CLEF, CLASS etc.). Then the principle behind the method used to produce high-quality validated language at low cost from the by-products of an evaluation campaign is exposed. It was inspired by the experiments (Recognizer Output Voting Error Recognition) performed during speech recognition evaluation campaigns in the USA and consists of combining the outputs of the participating sys-tems with a simple voting strategy to obtain higher performance results. Here we make a link with the existing strategies for system combination studied in machine learning. As an illustration we describe how the MULTITAG project funded by CNRS has built from the by-products of the GRACE evaluation campaign (French Part-Of-Speech tagging system evaluation campaign) a corpus of around 1 million words, annotated with a fine grained tagset derived from the EAGLES and MULTEXT projects. A brief presentation of the state of the art in Part-Of-Speech (POS) tagging and of the problem posed by its evaluation is given at the beginning, then the corpus itself is presented along with the procedure used to produce and validate it. In particular, the cost reduction brought by using this method instead of more classical methods is presented and its generalization to other control task is discussed in the conclusion.
355 Evaluation of Computational Linguistic Techniques for Identifying Significant Topics for Browsing Applications Evaluation of natural language processing tools and systems must focus on two complementary aspects: first, evaluation of the accuracy of the output, and second, evaluation of the functionality of the output as embedded in an application. This paper presents evaluations of two aspects of LinkIT, a tool for noun phrase identification linking, sorting and filtering. LinkIT [Evans 1998] uses a head sorting method [Wacholder 1998] to organize and rank simplex noun phrases (SNPs). LinkIT is to identify significant topics in domain-independent documents. The first evaluation, reported in D.K.Evans et al. 2000 compares the output of the Noun Phrase finder in LinkIT to two other systems. Issues of establishing a gold standard and criteria for matching are discussed. The second evaluation directly concerns the construction of the browsing application. We present results from Wacholder et al. 2000 on a qualitative evaluation which compares three shallow processing methods for extracting index terms, i.e., terms that can be used to model the content of documents. We analyze both quality and coverage. We discuss how experimental results such as these guide the building of an effective browsing applications.
356 Acoustical Sound Database in Real Environments for Sound Scene Understanding and Hands-Free Speech Recognition This paper reports on a project for collection of the sound scene data. The sound scene data is necessary for studies such as sound source localization, sound retrieval, sound recognition and hands-free speech recognition in real acoustical environments. There are many kinds of sound scenes in real environments. The sound scene is denoted by sound sources and room acoustics. The number of combination of the sound sources, source positions and rooms is huge in real acoustical environments. However, the sound in the environments can be simulated by convolution of the isolated sound sources and impulse responses. As an isolated sound source, a hundred kinds of non-speech sounds and speech sounds are collected. The impulse responses are collected in various acoustical environments. In this paper, progress of our sound scene database project and application to environment sound recognition are described.
357 Using Lexical Semantic Knowledge from Machine Readable Dictionaries for Domain Independent Language Modelling Machine Readable Dictionaries (MRDs) have been used in a variety of language processing tasks including word sense disambiguation, text segmentation, information retrieval and information extraction. In this paper we describe the utilization of semantic knowledge acquired from an MRD for language modelling tasks in relation to speech recognition applications. A semantic model of language has been derived using the dictionary definitions in order to compute the semantic association between the words. The model is capable of capturing phenomena of latent semantic dependencies between the words in texts and reducing the language ambiguity by a considerable factor. The results of experiments suggest that the semantic model can improve the word recognition rates in “noisy-channel” applications. This research provides evidence that limited or incomplete knowledge from lexical resources such as MRDs can be useful for domain independent language modelling.
358 Annotation of a Multichannel Noisy Speech Corpus This paper describes the activity of annotation of an Italian corpus of in-car speech material, with specific reference to the JavaSgram tool, developed with the purpose of annotating multichannel speech corpora. Some pre/post processing tools used with JavaSgram are briefly described together with a synthetic description of the annotation criteria which were adopted. The final objective is that of using the resulting corpus for training and testing a hands-free speech recognizer under development.
360 ARISTA Generative Lexicon for Compound Greek Medical Terms A Generative Lexicon for Compound Greek Medical Terms based on the ARISTA method is proposed in this paper. The concept of a representation independent definition-generating lexicon for compound words is introduced in this paper following the ARISTA method. This concept is used as a basis for developing a generative lexicon of Greek compound medical terminology using the senses of their component words expressed in natural language and not in a formal language. A Prolog program that was implemented for this task is presented that is capable of computing implicit relations between the components words in a sublanguage using linguistic and extra linguistic knowledge. An extra linguistic knowledge base containing knowledge derived from the domain or microcosm of the sublanguage is used for supporting the computation of the implicit relations. The performance of the system was evaluated by generating possible senses of the compound words automatically and judging the correctness of the results by comparing them with definitions given in a medical lexicon expressed in the language of the lexicographer.
362 A Self-Expanding Corpus Based on Newspapers on the Web A Unix-based system is presented which automatic collects newspaper articles from the web, converts the texts, and includes these texts in a newspaper corpus. This corpus can be searched from a web-browser. The corpus is currently 70 millions words and increases by 4 millions words each month.
363 A Web-based Advanced and User Friendly System: The Oslo Corpus of Tagged Norwegian Texts A general purpose text corpus meant for linguists and lexicographers needs to satify quality criteria at at least four different levels. The first two criteria are fairly well established; the corpus should have a wide variety of texts and be tagged according to a fine-grained system. The last two criteria are much less widely appreciated, unfortunately. One has to do with variety of search criteria: the user should be allowed to search for any information contained in the corpus, and with any combination possible. In addition, the search results should be presented in a choice of ways. The fourth criterion has to do with accessability. It is a rather surprising fact that while user interfaces tend to be simple and self explanatory in most areas of life represented electronically, corpus interfaces are still extremely user unfriendly. In this paper, we present a corpus whose interface we have given a lot of thought, and likewise the possible search options, viz. the Oslo Corpus of Tagged Norwegian Texts.
364 COCOSDA - a Progress Report This paper presents a review of the activities of COCOSDA, the International Committee for the Co-ordination and Standardisation of Speech Databases and Assessment Techniques for Speech Input/Output. COCOSDA has a history of innovative actions which spawn national and regional consortia for the co-operative development of speech corpora and for the promotion of research in related topics. COCOSDA has recently undergone a change of organisation in order to meet the developing needs of the speech- and language-processing technologies and this paper summarises those changes.
366 The Treatment of Adjectives in SIMPLE: Theoretical Observations This paper discusses the issues that play a part in the characterization of adjectival meaning. It describes the SIMPLE ontology for adjectives and provides insight into the morphological, syntactic and semantic aspects that are included in the SIMPLE adjectival templates.
367 Cardinal, Nominal or Ordinal Similarity Measures in Comparative Evaluation of Information Retrieval Process Similarity measures are used to quantify the resemblance of two sets. Simplest ones are calculated by ratios of the document's number of the compared sets. These measures are simple and usually employed in first steps of evaluation studies, they are called cardinal measures. Others measures compare sets upon the number of common documents they have. They are usually employed in quantitative information retrieval evaluations, some examples are Jaccard, Cosine, Recall or Precision. These measures are called nominal ones. There are more or less adapted in function of the richness of the information system's answer. Indeed, in the past, they were sufficient because answers given by systems were only composed by an unordered set of documents. But usual systems improve the quality or the visibility of there answers by using a relevant ranking or a clustering presentation of documents. In this case, similarity measures aren't adapted. In this paper we present some solutions in the case of totally ordered and partially ordered answer.
368 Evaluating Multi-party Multi-modal Systems The MITRE Corporation ’s Evaluation Working Group has developed a methodology for evaluating multi-modal groupware systems and capturing data on human-human interactions.The methodology consists of a framework for describing collaborative systems, scenario-based evaluation approach,and evaluation metrics for the various components of collaborative systems.We designed and ran two sets of experiments to validate the methodology by evaluating collaborative systems.In one experiment,we compared two configurations of a multi-modal collaborative application using a map navigation scenario requiring information sharing and decision making.In the second experiment,we pplied the evaluation methodology to a loosely integrated set of collaborative tools,again using a scenario-based approach.In both experiments,multi-modal,multi-user data were collected,visualized,annotated,and analyzed.
369 Extension and Use of GermaNet, a Lexical-Semantic Database This paper describes GermaNet, a lexical-semantic network and on-line thesaurus for the German language, and outlines its future extension and use. GermaNet is structured along the same lines as the Princeton WordNet (Miller et al., 1990; Fellbaum, 1998), encoding the major semantic relations like synonymy, hyponymy, meronymy, etc. that hold among lexical items. Constructing semantic networks like GermaNet has become very popular in recent approaches to computational lexicography, since wordnets constitute important language resources for word sense disambiguation, which is a prerequisite for various applications in the field of natural language processing, like information retrieval, machine translation and the development of different language-learning tools.
370 Russian Monitor Corpora: Composition, Linguistic Encoding and Internet Publication The LinGO (Linguistic Grammars Online) project’s English Resource Grammar and the LKB grammar development environment are language resources which are freely available for download for any purpose, including commercial use (see Executable programs and source code are both included. In this paper, we give an outline of the LinGO English grammar and LKB system, and discuss the ways in which they are currently being used. The grammar and processing system can be used independently or combined to give a central component which can be exploited in a variety of ways. Our intention in writing this paper is to encourage more people to use the technology, which supports collaborative development on many levels.
371 An Open Source Grammar Development Environment and Broad-coverage English Grammar Using HPSG The LinGO (Linguistic Grammars Online) project's English Resource Grammar and the LKB grammar development environment are language resources which are freely available for download for any purpose, including commercial use (see Executable programs and source code are both included. In this paper, we give an outline of the LinGO English grammar and LKB system, and discuss the ways in which they are currently being used. The grammar and processing system can be used independently or combined to give a central component which can be exploited in a variety of ways. Our intention in writing this paper is to encourage more people to use the technology, which supports collaborative development on many levels.
372 Hua Yu: A Word-segmented and Part-Of-Speech Tagged Chinese Corpus As the outcome of a 3-year joint effort of Department of Computer Science, Tsinghua University and Language Information Processing Institute, Beijing Language and Culture University, Beijing, China, a word-segmented and part-of-speech tagged Chinese corpus with size of 2 million Chinese characters, named HuaYu, has been established. This paper firstly introduces some basics about HuaYu in brief, as its genre distribution, fundamental considerations in designing it, word segmentation and part-of-speech tagging standards. Then the complete list of tag set used in HuaYu is given, along with typical examples for each tag accordingly. Several pieces of annotated texts in each genre are also included at last for reader's reference.
373 SPEECHDAT-CAR. A Large Speech Database for Automotive Environments The aims of the SpeechDat-Car project are to develop a set of speech databases to support training and testing of multilingual speech recognition applications in the car environment. As a result, a total of ten (10) equivalent and similar resources will be created. The 10 languages are Danish, British English, Finnish, Flemish/Dutch, French, German, Greek, Italian, Spanish and American English. For each language 600 sessions will be recorded (from at least 300 speakers) in seven characteristic environments (low speed, high speed with audio equipment on, etc.). This paper gives an overview of the project with a focus on the production phases (recording platforms, speaker recruitment, annotation and distribution).
374 Addizionario: an Interactive Hypermedia Tool for Language Learning In this paper we present the hypermedia linguistic laboratory ''Addizionario'', an open and flexible software tool aimed at studying Italian either as native or as foreign language. The product is directed to various categories of users: school children who can perform in a pleasant and appealing manner various tasks generally considered difficult and boring, such as dictionary look-up, word definition and vocabulary expansion; teachers who can use it to prepare didactic units specifically designed to meet the needs of their students; psychologists and therapists who can use it as an aid to detect impaired development and learning in the child; and editors of children’s dictionaries who can access large quantities of material for the creation of attractive, easy-to-use tools which take into account the capacities, tastes and interests of their users.
377 Recent Developments within the European Language Resources Association (ELRA) The main achievement of ELRA (the most visible) is the growth of its catalogue. The ELRA catalogue as of April 2000 lists 111 speech resources, 50 monolingual lexica, 113 multilingual lexica, 24 written corpora and 275 terminological databases. However, many Language Resources (LRs) need to be identified and/or produced. To this effect, ELRA is active in promoting and funding the co-production of new LRs through several calls for proposals. As for the validity of the existence of ELRA for the distribution of language resources, the statistics from the past two years speak for themselves. The 1999 fiscal report showed a rise with the sale of 217 LRs (122 for research and 95 for commercial purposes; with speech databases representing nearly 45%), compared to the sale of 180 LRs (90 for research and 90 for commercial purposes; with speech databases representing nearly 65%), in 1998 and to 33 sold in 1997. The other visible action of ELRA is its membership drive: since its foundation, ELRA has attracted an increasing number of members (from 63 in 1995 to 95 in 1999). This article is updated from a paper presented at Eurospeech'99.