LREC 2000 2nd International Conference on Language Resources & Evaluation

Papers and abstracts by paper title: A B C D E F G H I J K L M N O P Q R S T U V W X Y Z

Papers and abstracts by ID number: 1-50, 51-100, 101-150, 151-200, 201-250, 251-300, 301-350, 351-377.

List of all papers and abstracts

Paper Paper Title Abstract
301 Automatic Style Categorisation of Corpora in the Greek Language In this article, a system is proposed for the automatic style categorisation of text corpora in the Greek language. This categorisation is based to a large extent on the type of language used in the text, for example whether the language used is representative of formal Greek or not. To arrive to this categorisation, the highly inflectional nature of the Greek language is exploited. For each text, a vector of both structural and morphological characteristics is assembled. Categorisation is achieved by comparing this vector to given archetypes using a statistical-based method. Experimental resu
302 Automatic Extraction of Semantic Similarity of Words from Raw Technical Texts In this paper we address the problem of extracting semantic similarity relations between lexical entities based on context similarities as they appear in specialized text corpora. Only general-purpose linguistic tools are utilized in order to achieve portability across domains and languages. Lexical context is extended beyond immediate adjacency but is still confined by clause boundaries. Morfological and collocational information are employed in order to exploit the most of the contextual data. The extracted semantic similarity relations are transformed to semantic clusters which is a primal form of a domain-specific term thesaurus.
303 Predictive Performance of Dialog Systems This paper relates some of our experiments on the possibility of predictive performance measures of dialog systems. Experimenting dialog systems is often a very high cost procedure due to the necessity to carry out user trials. Obviously it is advantageous when evaluation can be carried out automatically. It would be helpfull if for each application we were able to measure the system performances by an objective cost function. This performance function can be used for making predictions about a future evolution of the systems without user interaction. Using the PARADISE paradigm, a performance function derived from the relative contribution of various factors is first obtained for one system developed at LIMSI: PARIS-SITI (kiosk for tourist information retrieval in Paris). A second experiment with PARIS-SITI with a new test population confirms that the most important predictors of user satisfaction are understanding accuracy, recognition accuracy and number of user repetitions. Futhermore, similar spoken dialog features appear as important features for the Arise system (train timetable telephone information system). We also explore different ways of measuring user satisfaction. We then discuss the introduction of subjective factors in the predictive coefficients.
306 Automatic Generation of Dictionary Definitions from a Computational Lexicon This paper presents an automatic Generator of dictionary definitions for concrete entities, based on information extracted from a Computational Lexicon (CL) containing semantic information. The aim of the adopted approach, combining NLG techniques with the exploitation of the formalised and systematic lexical information stored in CL, is to produce well formed dictionary definitions free from the shortcomings of traditional dictionaries. The architecture of the system is presented, focusing on the adaptation of the NLG techniques to the specific application requirements, and on the interface between the CL and the Generator. Emphasis is given on the appropriateness of the CL for the application purposes.
307 Regional Pronunciation Variants for Automatic Segmentation The goal of this paper is to create an extended rule corpus with approximately 2300 phonetic rules which model segmental variation of regional variants of German. The phonetic rules express at a broad-phonetic level phenomena of phonetic reduction in German that occurs within words and across word boundaries. In order to get an improvement in automatic segmentation of regional speech variants, these rules are clustered and implemented depending on regional specification in the Munich Automatic Segmentation System.
310 SegWin: a Tool for Segmenting, Annotating, and Controlling the Creation of a Database of Spoken Italian Varieties A number of actions have been recently proposed, aiming at filling the gap existing in the availability of speech annotated corpora of Italian regional varieties. A starting action is represented by the national project AVIP (Archivio delle Varietà di Italiano Parlato, Spoken Italian Varieties Archive), whose main challenge is a methodological one, namely finding annotation strategies and developing suitable software tools for coping with the inadequacy of linguistic models for Italian accent variations. Basically, these strategies consist in adopting an iterative process of labelling such that a description for each variety could be achieved by successive refinement stages without loosing intermediate stages information. To satisfy such requirements, a specific software system, called SegWin, has been developed by Politecnico di Bari, which: • “guides” the human transcribers in the annotation phases by a sort of “scheduled procedure”; • allows incremental addition of information at any stage of the database creation; • monitors/checks the consistency of the database during every stage of its creation The system has been extensively used by all the partners of the project AVIP and is continuously updated to take into account the project needs. The main characteristics of SegWin are here described, in relation to the above mentioned aspects.
312 Automotive Speech-Recognition - Success Conditions Beyond Recognition Rates From a car-manufacturer’s point of view it is very important to integrate evaluation procedures into the MMI development process. Focusing the usability evaluation of speech-input and speech-output systems aspects beyond recognition rates must be fulfilled. Two of these conditions will be discussed based upon user studies conducted in 1999: • Mental-workload and distraction • Learnability
313 The ISLE Corpus of Non-Native Spoken English For the purpose of developing pronunciation training tools for second language learning a corpus of non-native speech data has been collected, which consists of almost 18 hours of annotated speech signals spoken by Italian and German learners of English. The corpus is based on 250 utterances selected from typical second language learning exercises. It has been annotated at the word and the phone level, to highlight pronunciation errors such as phone realisation problems and misplaced word stress assignments. The data has been used to develop and evaluate several diagnostic components, which can be used to produce corrective feedback of unprecedented detail to a language learner.
314 A Graphical Parametric Language-Independent Tool for the Annotation of Speech Corpora Robust speech recognizers and synthesizers require well-annotated corpora in order to be trained and tested, thus making speech annotation tools crucial in speech technology. It is very important that these tools are parametric so that they can handle various directory and file structures and deal with different waveform and transcription formats. They should also be language-independent, provide a user-friendly interface or even interact with other kinds of speech processing software. In this paper we describe an efficient tool able to cope with the above requirements. It was first developed for the annotation of the SpeechDat-II recordings, and then it was extended to incorporate the additional features of the SpeechDat-Car project. Nevertheless, it has been parameterized so that it is not restricted to the SpeechDat format and Greek, and it can handle any other formalism and language.
315 The PAROLE Program The PAROLE project (Contract LE2-4017) was launched in May 1996 by the Commission of the European Communities, at the initiative of the DG XIII (Telecommunications, Information Market and Exploitation of Research). PAROLE is not just a project for gathering and creating a corpus. We are creating a true architectural model whose riches and quality will constitute strategic assets for European linguistic studies. This two-level architecture will link together two major morphological and syntactical component.
316 For a Repository of NLP Tools In this paper, we assume that the perspective which consists of identifying the NLP supply according to its different uses gives a general and efficient framework to understand the existing technological and industrial offer in a user-oriented approach. The main feature of this approach is to analyse how a specific technical product is really used by the users and not only to highlight how the developers expect the product to be used. To achieve this goal with NLP products, we first need to have a clear and quasi-exhaustive picture of the technical and industrial supply. During the 1998-1999 period, the European Language Resources Association (ELRA) conducted a study funded by the French Ministry of Research and Higher Education to produce a directory of language engineering tools and resources for French. In this paper, we present the main results of the study. The first part gives some information on the methodology adopted to conduct the study, the second part presents the main characteristics of the classification and the third part gives an overview of the applications which have been identified.
317 Survey of Language Engineering Needs: a Language Resources Perspective This paper describes the current state of an on-going survey that aims at determining the needs of users with respect to available and potentially available Language Resources (LRs). Following market monitoring strategies that have been outlined within the Language Resources- Packaging and Production project (LRsP&P LE4-8335), the main objective of this survey is to provide concrete figures for developing a more reliable and workable business plan for the European Language Resources Association (ELRA) and its Distribution Agency (ELDA), and to determine investment plans for sponsoring the production of new resources.
319 Interarbora and Thistle - Delivering Linguistic Structure by the Internet I describe an Internet service ''Interarbora'', which facilitates the visualization of tree structures. The service is built on top of a general purpose editor ''Thistle'', which allows the editing of diagrams and the generation of print format representations.
320 Automatically Augmenting Terminological Lexicons from Untagged Text Lexical resources play a crucial role in language technology but lexical acquisition can often be a time-consuming, laborious and costly exercise. In this paper, we describe a method for the automatic acquisition of technical terminology from domain restricted texts without the need for sophisticated natural language processing tools, such as taggers or parsers, or text corpora annotated with labelled cases. The method is based on the idea of using prior or seed knowledge in order to discover co-occurrence patterns for the terms in the texts. A bootstrapping algorithm has been developed that identifies patterns and new terms in an iterative manner. Experiments with scientific journal abstracts in the biology domain indicate an accuracy rate for the extracted terms ranging from 58% to 71%. The new terms have been found useful for improving the coverage of a system used for terminology identification tasks in the biology domain.
321 Annotating Events and Temporal Information in Newswire Texts If one is concerned with natural language processing applications such as information extraction (IE), which typically involve extracting information about temporally situated scenarios, the ability to accurately position key events in time is of great importance. To date only minimal work has been done in the IE community concerning the extraction of temporal information from text, and the importance, together with the difficulty of the task, suggest that a concerted effort be made to analyse how temporal information is actually conveyed in real texts. To this end we have devised an annotation scheme for annotating those features and relations in texts which enable us to determine the relative order and, if possible, the absolute time, of the events reported in them. Such a scheme could be used to construct an annotated corpus which would yield the benefits normally associated with the construction of such resources: a better understanding of the phenomena of concern, and a resource for the training and evaluation of adaptive algorithms to automatically identify features and relations of interest. We also describe a framework for evaluating the annotation and compute precision and recall for different responses.
327 Chinese-English Semantic Resource Construction We describe an approach to large-scale construction of a semantic lexicon for Chinese verbs. We leverage off of three existing resources— a classification of English verbs called EVCA (English Verbs Classes and Alternations) (Levin, 1993), a Chinese conceptual database called HowNet (Zhendong, 1988c; Zhendong, 1988b; Zhendong, 1988a) (http://www, and a large machine-readable dic-tionary called Optilex. The resulting lexicon is used for determining appropriate word senses in applications such as machine translation and cross-language information retrieval.
328 Production of NLP-oriented Bilingual Language Resources from Human-oriented dictionaries In this paper, the main features of manually produced bilingual dictionaries, which have been originally designed for human use, are considered. The problem is to find the way to use such kind of dictionaries in order to produce bilingual language resources that could make a base for automate text processing, such as machine translation, cross-lingual interrogation in text retrieval, etc. The transformation technology suggested hereby is based on XML-parsing of the file obtained from the source data by means of serial of special procedures. In order to produce well-formed XML-file, automatic procedures suffice. But in most cases, there are still semantic problems and inconveniencies that could be retired only in interactive way. However, the volume of this work can be minimized due to automatic pre-editing and suitable XML mark-up. The paper presents the results of R&D project which was carried out in the framework of ELRA’1999 Call for proposals on Language resources Production. The paper is based on the authors’ experience with English-Russian and French-Russian dictionaries, but the technology can be applied to other pairs of languages.
329 Developing a Multilingual Telephone Based Information System in African Languages This paper introduces the first project of its kind within the Southern African language engineering context. It focuses on the role of idiosyncratic linguistic and pragmatic features of the different languages concerned and how these features are to be accommodated within (a) the creation of applicable speech corpora and (b) the design of the system at large. An introduction to the multilingual realities of South Africa and its implications for the development of databases is followed by a description of the system and different options that may be implemented in the system.
330 Tuning Lexicons to New Operational Scenarios In this paper the role of the lexicon within typical application tasks based on NLP is analysed. A large scale semantic lexicon is studied within the framework of a NLP application. The coverage of the lexicon with respect the target domain and a (semi)automatic tuning approach have been evaluated. The impact of a corpus-driven inductive architecture aiming to compensate lacks in lexical information are thus measured and discussed.
331 SpeechDat-Car Fixed Platform SpeechDat-Car aims to develop a set of speech databases to support training and testing of multilingual speech recognition applications in the car environment. Two types of recordings compose the database. The first type consist of wideband audio signals recorded directly in the car while the second type is composed by GSM signals transmitted from the car and recorded simultaneously in a far-end. Therefore, two recording platforms were used, a ‘mobile’ recording platform installed inside the car and a ‘fixed’ recording platform located at the far-end fixed side of the GSM communications system. This paper describes the fixed platform software developed by the Universitat Politècnica de Catalunya (ADA-K). This software is able to work with standard inexpensive PC cards for ISDN lines.
333 Inter-annotator Agreement for a German Newspaper Corpus This paper presents the results of an investigation on inter-annotator agreement for the NEGRA corpus, consisting of German newspaper texts. The corpus is syntactically annotated with part-of-speech and structural information. Agreement for part-of-speech is 98.6%, the labeled F-score for structures is 92.4%. The two annotations are used to create a common final version by discussing differences and by several iterations of cleaning. Initial and final versions are compared. We identify categories causing large numbers of differences and categories that are handled inconsistently.
334 Interactive Corpus Annotation We present an easy-to-use graphical tool for syntactic corpus annotation. This tool, Annotate, interacts with a part-of-speech tagger and a parser running in the background. The parser incrementally suggests single phrases bottom-up based on cascaded Markov models. A human annotator confirms or rejects the parser’s suggestions. This semi-automatic process facilitates a very rapid and efficient annotation.
335 The Concede Model for Lexical Databases The value of language resources is greatly enhanced if they share a common markup with an explicit minimal semantics. Achieving this goal for lexical databases is difficult, as large-scale resources can realistically only be obtained by up-translation from pre-existing dictionaries, each with its own proprietary structure. This paper describes the approach we have taken in the Concede project, which aims to develop compatible lexical databases for six Central and Eastern European languages. Starting with sample entries from original presentation-oriented electronic representations of dictionaries, we transformed the data into an intermediate TEI-compatible represen-tation to provide a common baseline for evaluating and comparing the dictionaries. We then developed a more restrictive encoding, formalised as an XML DTD with a clearly-defined semantic interpretation. We present this DTD and discuss a sample conversion from TEI, together with an application which hyperlinks a HTML representation of the dictionary to on-line concordancing over a corpus.
336 Design and Implementation of the Online ILSP Greek Corpus This paper presents the Hellenic National (HNC), which is the corpus of Modern Greek developed by the Institute for Language and Speech Processing (ILSP). The presentation describes all stages of the creation of the corpus: collection of the material, tagging and tokenizing, construction of the database and the online implementation which aims at rendering the corpus accessible over Internet to the research community.
337 A Software Toolkit for Sharing and Accessing Corpora Over the Internet This paper describes the Translational English Corpus (TEC) and the software tools developed in order to enable the use of the corpus remotely, over the internet. The model underlying these tools is based on an extensible client-server architecture implemented in Java. We discuss the data and processing constraints which motivated the TEC architecture design and its impact on the efficiency and scalability of the system. We also suggest that the kind of distributed processing model adopted in TEC could play a role in fostering the availability of corpus linguistic resources to the research community.
338 Tools for the Generation of Morphological Entries in Dictionaries he lexicographer's tool introduced in the report represents a semiautomatic system to generate the section of morphological information for Estonian words in dictionary entries. Estonian is a language with a complicated morphology featuring (1) rich inflection and (2) marked and diverse morpheme variation, applying both to stems and formatives. The kernel of the system is a rule-based automatic morphology with separate program modules for every linguistic subsystem such as syllabification, recognition of part of speech and type of inflection, stem variation, morpheme and allomorph combinatorics. The modules function as rule interpreters applying formal grammars in an editable text format. The system enables generation of the following: (1) part of speech, (2) type of inflection, (3) inflected forms, (4) morphonological marking: degree of quantity, morpheme boundaries (stem+formative, component boundaries in compounds), (5) morphological references for inflected forms considerably different from the headword. The system permits of set-up, so that the inflected forms to be generated, the style of morphonological marking and the criteria for reference selection are all up to the user to choose. Full automation of the system application is restricted mainly by morphological homonymy.
340 Improving Lexical Databases with Collocational Information: Data from Portuguese This article focuses on ongoing work done for Portuguese concerning the phenomenon of lexical co-occurrence known as collocation (cf. Cruse, 1986, inter al.). Instances of the syntactic variety formed by noun plus adjective have been especially observed. Collocational instances are not lexical entries, and thus should not be stored in the lexicon as multiword lexical units. Their processing can be conceived through relations linking the lexical components. Mechanisms for dealing with the collocation-hood of the expressions are required to be included in the systems, topographically, in their lexical modules. Lexical databases like wordnets, with a general architecture typically structured on semantic relations, make room for the specification of this phenomenon. This can be handled through the definition of ad-hoc relations expressing the different semantic effects the adjectival modification bring to nominal phrases, collocationally.
341 Semi-automatic Construction of a Tree-annotated Corpus Using an Iterative Learning Statistical Language Model In this paper, we propose a method to construct a tree-annotated corpus, when a certain statistical parsing system exists and no tree-annotated corpus is available as training data. The basic idea of our method is to sequentially annotate plain text inputs with syntactic trees using a parser with a statistical language model, and iteratively retrain the statistical language model over the obtained annotated trees. The major characteristics of our method are as follows: (1)in the first step of the iterative learning process, we manually construct a tree-annotated corpus to initialize the statistical language model over, and (2) at each step of the parse tree annotation process, we use both syntactic statistics obtained from the iterative learning process and lexical statistics pre-derived from existing language resources, to choose the most probable parse tree.
342 Issues from Corpus Analysis that have influenced the On-going Development of Various Haitian Creole Text- and Speech-based NLP Systems and Applications This paper describes issues that are relevant to using small- to large-sized corpora for the training and testing of various text- and speech-based natural language processing (NLP) systems for minority and vernacular languages. These R&D and commercial systems and applications include machine translation, orthography conversion, optical character recognition, speech recognition, and speech synthesis that have already been produced for the Haitian Creole (HC) language. Few corpora for minority and vernacular languages have been created specifically for language resource distribution and for NLP system training. As a result, some of the only available corpora are those that are produced within real end-user environments. It is therefore of utmost importance that written language standards be created and then observed so that research on various text- and speech-based systems can be fruitful. In doing so, this also provides vernacular and minority languages with the opportunity to have an impact within the globalization and advanced communication needs efforts of the modern day world. Such technologies can significantly influence the status of these languages, yet the lack of standardization is a severe impediment to technological development. A number of relevant issues are discussed in this paper.
345 NaniTrans: a Speech Labelling Tool This paper deals with a description of NaniTrans, a tool for segmentation and labeling of speech. The tool is programmed to work on the MATLAB application interface, in any of the supported platforms (Unix, Windows, Macintosh). The tool has been designed to annotate large speech databases, which can be also partially preprocessed (but require manual supervision). It supports the definition of an environment of annotation: set of annotation levels (orthographic, phonetic, etc.), display mode (how to show information), graphic representation (waveform, spectrogram), keyboard short-cuts, etc. This configuration is then used on a speech database. A safe file locking system allows many annotators to work concurrently on the same speech database. The tool is very friendly and easy to use by non experienced annotators, and it is designed to optimize speed using both keyboard and mouse. New options or speech processing tools can be easily added by using any MATLAB or user defined function.
347 Acquisition of Linguistic Patterns for Knowledge-based Information Extraction In this paper we present a new method of automatic acquisition of linguistic patterns for Information Extraction, as implemented in the CICERO system. Our approach combines lexico-semantic information available from the WordNet database with collocating data extracted from training corpora. Due to the open-domain nature of the WordNet information and the immediate availability of large collections of texts, our method can be easily ported to open-domain Information Extraction.
348 A Platform for Dutch in Human Language Technologies As ICT increasingly forms a part of our daily life it becomes more and more important that all citizens can make use of their native languages in all communicative situations. For the development of successful applications and products for Dutch basic provisions are required. The development of the basic material that is lacking, is an expensive undertaking which exceeds the capacity of the individuals involved. Collaboration between the various agents (policy, knowledge infrastructure and industry) in the Netherlands and Flanders is required. The existence of the Dutch Language Union (Nederlandse Taalunie) facilitates this co-operation. The responsible ministers decided to set up a Dutch-Flemish platform for Dutch in Human Language Technologies. The purpose of the platform is the further construction of an adequate digital language infrastructure for Dutch so that the industry develops the required applications which must guarantee that the citizens in Holland and Flanders can use their own language in their communication within the information society and the Dutch language area remains a full player in a multi-lingual Europe. This paper will show some of the efforts that have been taken
349 Developing and Testing General Models of Spoken Dialogue System Peformance The design of methods for performance evaluation is a major open research issue in the area of spoken language dialogue systems. This paper presents the PARADISE methodology for developing predictive models of spoken dialogue performance, and shows how to evaluate the predictive power and generalizability of such models. To illustrate the methodology, we develop a number of models for predicting system usability (as measured by user satisfaction), based on the application of PARADISE to experimental data from two different spoken dialogue systems. We compare both linear and tree-based models. We then measure the extent to which the models generalize across different systems, different experimental conditions, and different user populations, by testing models trained on a subset of the corpus against a test set of dialogues. The results show that the models generalize well across the two systems, and are thus a first approximation towards a general performance model of system usability.
350 Using Few Clues Can Compensate the Small Amount of Resources Available for Word Sense Disambiguation Word Sense Disambiguation (WSD) is considered as one of the most difficult tasks in Natural Language Processing. Probabilistic methods have shown their efficiency in many NLP tasks, but they imply a training phase and very few resources are available for WSD. This paper aims at showing how to make the most of size-limited resources in order to partially overcome the knowledge acquisition bottleneck. Experiments are performed within the SENSEVAL test framework in order to evaluate the advantage of a lemmatized or stemmed context over an original context (inflected forms as they are observed in the rough text). Then, we measure the precision improvement (about 6 %) when looking at the inflected form of the word to be disambiguated. Lastly, we show that it is possible to reduce the ambiguity if the word to be disambiguated has a particular inflected form or occurs as part of a compound.