LREC 2000 2nd International Conference on Language Resources & Evaluation

Papers and abstracts by paper title: A B C D E F G H I J K L M N O P Q R S T U V W X Y Z

Papers and abstracts by ID number: 1-50, 51-100, 101-150, 151-200, 201-250, 251-300, 301-350, 351-377.

List of all papers and abstracts

Paper Paper Title Abstract
251 An Optimised FS Pronunciation Resource Generator for Highly Inflecting Languages We report on a new approach to grapheme-phoneme transduction for large-scale German spoken language corpus resources using explicit morphotactic and graphotactic models. Finite state optimisation techniques are introduced to reduce lexicon development and production time, with a speed increase factor of 10. The motivation for this tool is the problem of creating large pronunciation lexica for highly inflecting languages using morphological out of vocabulary (MOOV) word modelling, a subset of the general OOV problem of non-attested word forms. A given spoken language system which uses fully inflected word forms performs much worse with highly inflecting languages (e.g. French, German, Russian) for a given stem lexicon size than with less highly inflecting languages (e.g. English) because of the `morphological handicap' (ratio of stems to inflected word forms), which for German is about 1:5. However, the problem is worse for current speech recogniser development techniques, because a specific corpus never contains all the inflected forms of a given stem. Non-attested MOOV forms must therefore be `projected' using a morphotactic grammar, plus table lookup for irregular forms. Enhancement with statistical methods is possible for regular forms, but does not help much with large, heterogeneous technical vocabularies, where extensive manual lexicon construction is still used. The problem is magnified by the need for defining pronunciation variants for inflected word forms; we also propose an efficient solution to this problem.
252 Sublanguage Dependent Evaluation: Toward Predicting NLP performances In Natural Language Processing (NLP) Evaluation, such as MUC (Hirshman, 98), TREC (Harman, 98), GRACE (Adda et al, 97), SENSEVAL (Kilgarriff98), performance results provided are often average made on the complete test set. That does not give any clues on the systems robustness. knowing which system performs better on average does not help us to find which is the best for a given subset of a language. In the present article, the existing approaches which take into account language heterogeneity and offer methods to identify sublanguages are presented. Then we propose a new metric to assess robustness and we study the effect of different sublanguages identified in the Penn Tree Bank Corpus on performance variations observed for POS tagging. The work we present here is a first step in the development of predictive evaluation methods, intended to propose new tools to help in determining in advance the range of performance that can be expected from a system on a given dataset.
253 The Universal XML Organizer: UXO The integrated editor UXO is the result of ongoing research and development of the text-technology group at Bielefeld. Being a full featured XML-based editing system, it also allows to combine the structured annotated data with information imported from relational databases by integrating a JDBC interface. The mapping processes between different levels of annotation can be programmed either by the integrated scheme interpreter, or by extending the functionality of UXO using the predefined Java API.
254 TyPTex: Inductive Typological Text Classification by Multivariate Statistical Analysis for NLP Systems Tuning/Evaluation The increasing use of methods in natural language processing (NLP) which are based on huge corpora require that the lexical, morpho-syntactic and syntactic homogeneity of texts be mastered. We have developed a methodology and associate tools for text calibration or ''profiling'' within the ELRA benchmark called ''Contribution to the construction of contemporary french corpora'' based on multivariate analysis of linguistic features. We have integrated these tools within a modular architecture based on a generic model allowing us on the one hand flexible annotation of the corpus with the output of NLP and statistical tools and on the other hand retracing the results of these tools through the annotation layers back to the primary textual data. This allows us to justify our interpretations.
256 An Approach to Lexical Development for Inflectional Languages We describe a method for the semi-automatic development of morphological lexicons. The method aims at using minimal pre-existing resources and only relies upon the existence of a raw text corpus and a database of inflectional classes. No lexicon or list of base forms is assumed. The method is based on a contrastive approach, which generates hypothetical entries based on evidence drawn form a corpus, and selects the best candidates by heuristically comparing the candidate entries. The reliance upon inflectional information and the use of minimal resources make this approach particularly suitable for highly inflectional, lower-density languages. A prototype tool has been developed for Modern Greek.
257 Some Language Resources and Tools for Computational Processing of Portuguese at INESC In the last few years automatic processing tools and studies based on corpora have became of a great importance for the community. The possibility of evaluating and developing such tools and studies depends on the availability of language resources. For the Portuguese language in its several national varieties these resources are not enough to meet the community needs. In this paper some valuable resources are presented, such as a multifunctional lexicon, general-purpose lexicons for European and Brazilian Portuguese and corpus processing tools.
258 Minimally Supervised Japanese Named Entity Recognition: Resources and Evaluation Approaches to named entity recognition that rely on hand-crafted rules and/or supervised learning techniques have limitations in terms of their portability into new domains as well as in the robustness over time. For the purpose of overcoming those limitations, this paper evaluates named entity chunking and classification techniques in Japanese named entity recognition in the context of minimally supervised learning. This experimental evaluation demonstrates that the minimally supervised learning method proposed here improved the performance of the seed knowledge on named entity chunking and classification. We also investigated the correlation between performance of the minimally supervised learning and the sizes of the training resources such as the seed set as well as the unlabeled training data.
259 Evaluation of a Generic Lexical Semantic Resource in Information Extraction We have created an information extraction system that allows users to train the system on a domain of interest. The system helps to maximize the effect of user training by applying WordNet to rule generation and validation. The results show that, with careful control, WordNet is helpful in generating useful rules to cover more instances and hence improve the overall performance. This is particularly true when the training set is small, where F-measure is increased from 65% to 72%. However, the impact of WordNet diminishes as the size of training data increases. This paper describes our experience in applying WordNet to this system and gives an evaluation of such an effort.
260 The Establishment of Motorola's Human Language Data Resource Center: Addressing the Criticality of Language Resources in the Industrial Setting Within the human language technology (HLT) field it is widely understood that the availability (and effective utilization) of voluminous, high quality language resources is both a critical need and a critical bottleneck in the advancement and deployment of cutting edge HLT applications. Recently formed (inter-)national human language resource (HLR) consortia (e.g., LDC, ELRA,...) have made great strides in addressing this challenge by distributing a rich array of pre-competitive HLRs. However, HLT application commercialization will continue to demand that HLRs specific to target products (and complementary to consortially available resources) be created. In recognition of the general criticality of HLRs, Motorola has recently formed the Human Language Data Resource Center (HLDRC) to streamline and leverage our HLR creation and utilization efforts. In this paper, we use the specific case of the Motorola HLDRC to help examine the goals and range of activities which fall into the purview of a company- internal HLR organization, look at ways in which such an organization differs from (and is similar to) HLR consortia, and explore some issues with respect to implementation of a wholly within-company HLR organization like the HLDRC.
261 IPA Japanese Dictation Free Software Project Large vocabulary continuous speech recognition (LVCSR) is an important basis for the application development of speech recognition technology. We had constructed Japanese common LVCSR speech database and have been developing sharable Japanese LVCSR programs/models by the volunteer-based efforts. We have been engaged in the following two volunteer-based activities. a) IPSJ (Information Processing Society of Japan) LVCSR speech database working group. b) IPA (Information Technology Promotion Agency) Japanese dictation free software project. IPA Japanese dictation free software project (April 1997 to March 2000) is aiming at building Japanese LVCSR free software/models based on the IPSJ LVCSR speech database (JNAS) and Mainichi newspaper article text corpus. The software repository as the product of the IPA project is available to the public. More than 500 CD-ROMs have been distributed. The performance evaluation was carried out for the simple version, the fast version, and the accurate version in February 2000. The evaluation uses 200 sentence utterances from 46 speakers. The gender-independent HMM models and 20k/60k language models are used for evaluation. The accurate version with the 2000 HMM states and 16 Gaussian mixtures shows 95.9 % word correct rate. The fast version with the phonetic tied mixture HMM and the 1/10 reduced language model shows 92.2 % word correct rate and realtime speed. The CD-ROM with the IPA Japanese dictation free software and its developing workbench will be distributed by the registration to or by sending e-mail to
262 Spontaneous Speech Corpus of Japanese Design issues of a spontaneous speech corpus is described. The corpus under compilation will contain 800-1000 hour spontaneously uttered Common Japanese speech and the morphologically annotated transcriptions. Also, segmental and intonation labeling will be provided for a subset of the corpus. The primary application domain of the corpus is speech recognition of spontaneous speech, but we plan to make it useful for natural language processing and phonetic/linguistic studies also.
263 Annotating Resources for Information Extraction Trained systems for NE extraction have shown significant promise because of their robustness to errorful input and rapid adaptability. However, these learning algorithms have transferred the cost of development from skilled computational linguistic expertise to data annotation, putting a new premium on effective ways to produce high-quality annotated resources at minimal cost. The paper reflects on BBN’s four years of experience in the annotation of training data for Named Entity (NE) extraction systems discussing useful techniques for maximizing data quality and quantity.
267 The New Edition of the Natural Language Software Registry (an Initiative of ACL hosted at DFKI) In this paper we present the new version (4th edition) of the Natural Language Software Registry (NLSR), an initiative of the Association for Computational Linguistics (ACL) hosted at DFKI in Saarbr¨ ucken. We give a brief overview of the history of this repository for Natural Language Processing (NLP) software, list some related works and go into the details of the design and the implementation of the new edition.
269 Design Methodology for Bilingual Pronunciation Dictionary This paper presents the design methodology for the bilingual pronunciation dictionary of sound reference usage, which reflects the cross-linguistic, dialectal, first language (L1) interfered, biological and allophonic variations. The design methodology features 1) the comprehensive coverage of allophonic variation, 2) concise data entry composed of a balanced distribution of dialects, genders, and ages of speakers, 3) bilingual data coverage including L1-interfered speech, and 4) eurhythmic arrangements of the recording material for temporal regularity. The recording consists of the triple way comparison of 1) English sounds spoken by native English speakers, 2) Korean sounds spoken by native Korean speakers, and 3) English sounds spoken by Korean speakers. This paper also presents 1) the quality controls and 2) the structure and format of the data. The intended usage of this “sound-based” bilingual dictionary aims at 1) cross-linguistic and acoustic research, 2) application to speech recognition, synthesis and translation, and 3) foreign language learning including exercises.
271 LEXIPLOIGISSI: An Educational Platform for the Teaching of Terminology in Greece This paper introduces a project, LEXIPLOIGISSI * , which involves use of language resources for educational purposes. More particularly, the aim of the project is to develop written corpora, electronic dictionaries and exercises to enhance students’ reading and writing abilities in six different school subjects. It is the product of a small-scale pilot program that will be part of the school curriculum in the three grades of Upper Secondary Education in Greece. The application seeks to create exploratory learning environments in which digital sound, image, text and video are fully integrated through the educational platform and placed under the direct control of users who are able to follow individual pathways through data stores.
272 An HPSG-Annotated Test Suite for Polish The paper presents both conceptual and technical issues related to the construction of an HPSG test-suite for Polish. The test-suite consists of sentences of written Polish — both grammatical and ungrammatical. Each sentence is annotated with a list of linguistic phenomena it illustrates. Additionally, grammatical sentences are encoded in HPSG-style AVM structures. We describe also a technical organization of the database, as well as possible operations on it.
274 The COST 249 SpeechDat Multilingual Reference Recogniser The COST 249 SpeechDat reference recogniser is a fully automatic, language-independent training procedure for building a phonetic recogniser. It relies on the HTK toolkit and a SpeechDat(II) compatible database. The recogniser is designed to serve as a reference system in multilingual recognition research. This paper documents version 0.95 of the reference recogniser and presents results on small and medium vocabulary recognition for five languages.
275 Terminology Encoding in View of Multifunctional NLP Resources Given the existing standards for organising terminology resources, the main question raised is how to create a DB or assimilated term list with properties allowing for an efficient NLP treatment of input texts. Here, we have dealt with the output of MT and have attempted to improve terminological annotation of the input text, in order to optimize reusability and efficiency of performance. By organizing terms in BD-like tables, which provide various cross-linked indications about head properties, morpho-syntax, derivational morphology and semantic-pragmatic relations between concepts of terms, we have managed to improve functionality of resources and enable better customisation. Moreover, we have tried to view the proposed term DB organisation as part of a global account of the problem of terminology resolution on-processing via grammar based or user-machine interaction techniques for term recognition and disambiguation, since term boundary definition is generally recognised to be a complex and costly enterprise, directly related to the fact that most problem causing terminology items are multi-word units either characterized as fixed or as ad hoc or not yet fixed terms.
276 Terminology in Korea: KORTERM Korterm (Korea Terminology Research Center for Language and Knowledge Engineering) had been set up in the late August of 1998 under the auspices of Ministry of Culture and Tourism in Korea. Its major mission is to construct terminology resources and their unification, harmonization and standardization. This mission is naturally linked to the general language engineering and knowledge engineering tasks including specific-domain corpus, ontology, wordnet, electronic dictionary construction as well as language engineering products like information extraction and machine translation. This organization is located under the KAIST (Korea Advanced Institute of Science and Technology) that is one of national university specifically under the Ministry of Science and Technology. KORTERM is only one representative for terminology standardization and research with relation to Infoterm.
277 Morphological Tagging to Resolve Morphological Ambiguities The issue of this paper is to present the advantages of a morphological tagging of English in order to resolve morphological ambiguities. Such a way of tagging seems to be more efficient because it allows an intention description of morphological forms compared with the extensive collection of usual dictionaries. This method has already been experimented on French and has given promising results. It is very relevant since it allows both to bring hidden morphological rules to light which are very useful especially for foreign learners and take lexical creativity into account. Moreover, this morphological tagging was conceived in relation to the subsequent disambiguation which is mainly based on local grammars. The purpose is to create a morphological analyser being easily adaptable and modifiable and avoiding the usual errors of the ordinary morphological taggers linked to dictionaries.
278 An Evaluation Tool for Machine Translation: Fast Evaluation for MT Research In this paper we present a tool for the evaluation of translation quality. First, the typical requirements of such a tool in the framework of machine translation (MT) research are discussed. We define evaluation criteria which are more adequate than pure edit distance and we describe how the measurement along these quality criteria is performed semi-automatically in a fast, convenient and above all consistent way using our tool and the corresponding graphical user interface.
279 GéDériF: Automatic Generation and Analysis of Morphologically Constructed Lexical Resources One of the major frequent problems in text retrieval comes from large number of words encountered which are not listed in general language dictionaries. However, it is very often the case that these words are morphologically complex, and as such have a meaning which is predictable on the basis of their structure. Furthermore, such words typically belong to specialized language uses (e.g. scientific, philosophical or media technolects). Consequently, tools for listing and analysing such words can help enrich a terminological database. The purpose of this paper is to present a system that automatically generates morphologically complex lexical French items which are not listed in dictionaries, and that furthermore provides a structural and semantic analysis of these items. The output of this system is a morphological database (currently in progress) which forms a powerful lexical resource. It will be very useful in Natural Language Processing (NLP) and in IR (Information Retrieval) applications. Indeed the system generates a potentially infinite set of complex (derived) lexical units (henceforth CLUs) automatically associated with a rich array of morpho-semantic features, and is thus capable of dealing morphologically complex structures which are unlisted in dictionaries.
281 Le Programme Compalex (COMPAraison LEXicale)  
282 Many Uses, Many Annotations for Large Speech Corpora: Switchboard and TDT as Case Studies This paper discusses the challenges that arise when large speech corpora receive an ever-broadening range of diverse and distinct annotations. Two case studies of this process are presented: the Switchboard Corpus of telephone conversations and the TDT2 corpus of broadcast news. Switchboard has undergone two independent transcriptions and various types of additional annotation, all carried out as separate projects that were dispersed both geographically and chronologically. The TDT2 corpus has also received a variety of annotations, but all directly created or managed by a core group. In both cases, issues arise involving the propagation of repairs, consistency of references, and the ability to integrate annotations having different formats and levels of detail. We describe a general framework whereby these issues can be addressed successfully.
283 Accessibility of Multilingual Terminological Resources - Current Problems and Prospects for the Future In this paper we analyse the various problems in making multilingual terminological resources available to users. Different levels of diversity and incongruence among such resources are discussed. Previous standardization efforts are reviewed. As a solution to the lack of co-ordination and compatibility among an increasing number of ‘standard’ interchange formats, a higher level of integration is proposed for the purpose of terminology-enabled knowledge sharing. The family of formats currently being developed in the SALT project is presented as a contribution to this solution.
285 Using a Formal Approach to Evaluate Grammars In this paper, we present a methodological formal approach to evaluate grammars based on a unified representation. This approach uses two kinds of criteria. The first one considers a grammar as a resource enabling the representation of particular aspects of a given language. The second is interested in using grammars in the development of lingware. The evaluation criteria are defined in a formal way. In addition, we indicate for every criterion how it would be applied.
286 Design Issues in Text-Independent Speaker Recognition Evaluation We discuss various considerations that have been involved in designing the past five annual NIST speaker recognition evaluations. These text-independent evaluations using conversational telephone speech have attracted state-of-the- art automatic systems from research sites around the world. The availability of appropriate data for sufficiently large test sets has been one key design consideration. There have also been variations in the specific task efinitions, the amount and type of training data provided, and the durations of the test segments. The microphone types of the handsets used, as well as the match or mismatch of training and test handsets, have been found to be important considerations that greatly affect system performance.
287 Developing Guidelines and Ensuring Consistency for Chinese Text Annotation With growing interest in Chinese Language Processing, numerous NLP tools (e.g. word segmenters, part-of-speech taggers, and parsers) for Chinese have been developed all over the world. However, since no large-scale bracketed corpora are available to the public, these tools are trained on the corpora with different segmentation criteria, part-of-speech tagsets and bracketing guidelines, and therefore, comparisons are difficult. As a first step towards addressing this issue, we have been preparing a 100-thousand-word bracketed corpus since late 1998 and plan to release it to the public summer 2000. In this paper, we will address several challenges in building the corpus, namely, creating annotation guidelines, ensuring annotation accuracy and maintaining a high level of community involvement.
288 Corpora of Slovene Spoken Language for Multi-lingual Applications The domain of spoken language technologies ranges from speech input and output systems to complex understanding and generation systems, including multi- modal systems of widely differing complexity (such as automatic dictation machines) and multilingual systems (for example automatic dialogue and translation systems). The definition of standards and evaluation methodologies for such systems involves the specification and development of highly specific spoken language corpus and lexicon resources, and measurement and evaluation tools (EAGLES Handbook 1997). This paper presents the MobiLuz spoken resources of the Slovene language, which will be made freely available for research purposes in speech technology and linguistics.
289 GRUHD: A Greek database of Unconstrained Handwriting In this paper we present the GRUHD database of Greek characters, text, digits, and other symbols in unconstrained handwriting mode. The database consists of 1,760 forms that contain 667,583 handwritten symbols and 102,692 words in total, written by 1,000 writers, 500 men and equal number of women. Special attention was paid in gathering data from writers of different age and educational level. The GRUHD database is accompanied by the GRUHD software that facilitates its installation and use and enables the user to extract and process the data from the forms selectively, depending on the application. The various types of possible installations make it appropriate for the training and validation of character recognition, character segmentation and text-dependent writer identification systems.
292 Labeling of Prosodic Events in Slovenian Speech Database GOPOLIS The paper describes prosodic annotation procedures of the GOPOLIS Slovenian speech data database and methods for automatic classi-fication of different prosodic events. Several statistical parameters concerning duration and loudness of words, syllables and allophones were computed for the Slovenian language, for the first time on such a large amount of speech data. The evaluation of the annotated data showed a close match between automatically determined syntactic-prosodic boundary marker positions and those obtained by a rule-based approach. The obtained knowledge on Slovenian prosody can be used in Slovenian speech recognition and understanding for automatic prosodic event determination and in Slovenian speech synthesis for prosody prediction.
294 NL-Translex: Machine Translation for Dutch NL-Translex is an MLIS-project which is funded jointly by the European Commission, the Dutch Language Union, the Ducth Ministry of Education, Culture and Science, the Dutch Ministry of Economic Affairs and the Flemish Institute for the Promotion of Scientific and Technological Research in Industry. The aim of this project is to develop Machine Translation components that will handle unrestricted text and translate Dutch from and into English, French and German. In addition to this practical aim, the partners in this project all have objectives relating to strategy, language policy and culture. The modules to be developed are intended primarily for use by EU institutions and the translation services of official bodies in the Member States. In this paper we describe in detail the aims and structure of the project, the user population, the available resources and the activities carried out so far, in particular the procedure followed for the call for tenders aimed at selecting a technonolgy provider. Finally, we describe the acceptance procedure, the strategic impact of the project and the dissemination plan.
295 Rarity of Words in a Language and in a Corpus A simple method was presented last year (Hlavacova & Rychly, 1999) allowing to distinguish automatically between rare and common words having the same frequency in a language corpus. The method operates with two new terms: reduced frequency and rarity. The rarity was proposed as a measure of word rareness or commonness in a language. This article deals with the rarity a bit more deeply. Its value was calculated for several different corpora and compared. Two experiments were done on the real data taken from the Czech National Corpus. Results of the first one prove that reordering of texts in the corpus does not influence the rarity of words with a high frequency in the corpus. In the second experiment, rarity of the same words in two corpora of different sizes is compared.
297 Language Resources Development at the Spanish Royal Academy This paper explains some of the most relevant issues concerning the development of language resources at the Spanish Royal Academy. Two 125-M words corpus of Spanish language (synchronic and diachronic) and three specialized corpus has been developed. Around the corpus, RAE is also developing NLP tools and resources to morpho-syntactically annotate them. Some of the most relevant are: The Computational Lexicon, the Morphological analysis tools, the Disambiguation grammars and the Tokenizer generator. The last section describes the lexicographic use of corpus materials and includes a brief description of the Corpus-based lexicographical workbench and his related tools.
298 Reusability as Easy Adaptability: A Substantial Advance in NL Technology The design and implementation of new applications in NLP at low costs mostly depends upon the availability of technologies oriented to the solution of any specific problem. The success of this task, besides the use of widely agreed formats and standards, relies upon at least two families of tools, those for managing and updating, and those for projecting an ''application view-point'' onto the data in the repository. This approach has different realizations if applied to a dictionary, a corpus, or a grammar. Some examples, taken frrom European and other industrial projects, show that reusability: a) in the building of industrial prototypes consists in the easy reconfiguration of resources (dictionary and grammar), easy portability and easy recombination of tools, by means of simple APIs, as well as on different implementation platforms: b) in the building of advanced applications still consists in the same features, together with the possibility of opening different view-points on dictionaries and grammars.
299 Looking for Errors: A Declarative Formalism for Resource-adaptive Language Checking The paper describes a phenomenon-based approach to grammar checking, which draws on the integration of different shallowNLP technologies, including morphological and POS taggers, as well as probabilistic and rule-based partial parsers. We present a declarative specification formalism for grammar checking and controlled language applications which greatly facilitates the development of checking components.
300 The Bank of Swedish The Bank of Swedish is described: affiliation, organisation, linguistic resources and tools. A point is made of the close connection between lexical research and corpus data, the broad textual coverage from Modern Swedish to Old Swed-ish, the official status of the organisation and its connection to Göteborg University. The relation to the broader scope of the comprehensive Language Database of Swedish is discussed. A few current issues of the Bank of Swedish are presented: parallell corpora production, the construction of a Swedish morphology database and sense tagging of text corpora. Finally, the updating of the Bank of Swedish concordance system is mentioned.