LREC 2000 2nd International Conference on Language Resources & Evaluation

Papers and abstracts by paper title: A B C D E F G H I J K L M N O P Q R S T U V W X Y Z

Papers and abstracts by ID number: 1-50, 51-100, 101-150, 151-200, 201-250, 251-300, 301-350, 351-377.

List of all papers and abstracts

Paper Paper Title Abstract
69 A Bilingual Electronic Dictionary for Frame Semantics Frame semantics is a linguistic theory which is currently gaining ground. The creation of lexical entries for a large number of words presupposes the development of complex lexical acquisition techniques in order to identify the vocabulary for describing the elements of a 'frame'. In this paper, we show how a lexical-semantic database compiled on the basis of a bilingual (English-French) dictionary can be used to identify some general frame elements which are relevant in a frame-semantic approach such as the one adopted in the FrameNet project (Fillmore & Atkins 1998, Gahl 1998). The database has been systematically enriched with explicit lexical-semantic relations holding between some elements of the microstructure of the dictionary entries. The manifold relationships have been labelled in terms of lexical functions, based on Mel'cuk's notion of co-occurrence and lexical-semantic relations in Meaning-Text Theory (Mel'cuk et al. 1984). We show how these lexical functions can be used and refined to extract potential realizations of frame elements such as typical instruments or typical locatives, which are believed to be recurrent elements in a large number of frames. We also show how the database organization of the computational lexicon makes it possible to readily access implicit and translationally-relevant combinatorial information.
14 A Comparison of Summarization Methods Based on Task-based Evaluation A task-based evaluation scheme has been adopted as a new method of evaluation for automatic text summarization systems. It evaluates the performance of a summarization system in a given task, such as information retrieval and text categorization. This paper compares ten different summarization methods based on information retrieval tasks. In order to evaluate the system performance, the subjects’ speed and accuracy are measured in judging the relevance of texts using summaries. We also analyze the similarity of summaries in order to investigate the similarity of the methods. Furthermore, we analyze what factors can affect evaluation results, and describe the problems that arose from our experimental design, in order to establish a better evaluation scheme.
175 A Computational Platform for Development of Morphologic and Phonetic Lexica Statistic approaches in speech technology, either based on statistical language models, trees, hidden Markov models or neural networks, represent the driving forces for the creation of language resources (LR), e.g. text corpora, pronunciation lexica and speech databases. This paper presents the system architecture for rapid construction of morphologic and phonetic lexica for Slovenian language. The integrated graphic user interface focuses in morphologic and phonetic aspects of the Slovenian language and allows the experts good performance in analysis time.
226 A Flexible Infrastructure for Large Monolingual Corpora In this paper we describe a flexible and portable infrastructure for setting up large monolingual language corpora. The approach is based on collecting a large amount of monolingual text from various sources. The input data is processed on the basis of a sentence-based text segmentation algorithm. We describe the entry structure of the corpus database as well as various query types and tools for information extraction. Among them, the extraction and usage of sentence-based word collocations is discussed in detail. Finally we give an overview of different application for this language resource. A WWW interface allows for public access to most of the data and information extraction tools (
201 A Framework for Cross-Document Annotation We introduce a cross-document annotation toolset that serves as a corpus-wide knowledge base for linguistic annotations. This imple-mented system is designed to address the unique cognitive demands placed on human annotators who must relate information that is expressed across document boundaries.
133 A French Phonetic Lexicon with Variants for Speech and Language Processing This paper reports on a project aiming at the semi-automatic development of a large orthographic-phonetic lexicon for French, based on the Multext dictionary. It details the various stages of the project, with an emphasis on the methodological and design aspects. Information regarding the lexicon’s content is also given, together with a description of interface tools which should facilitate its exploitation.
314 A Graphical Parametric Language-Independent Tool for the Annotation of Speech Corpora Robust speech recognizers and synthesizers require well-annotated corpora in order to be trained and tested, thus making speech annotation tools crucial in speech technology. It is very important that these tools are parametric so that they can handle various directory and file structures and deal with different waveform and transcription formats. They should also be language-independent, provide a user-friendly interface or even interact with other kinds of speech processing software. In this paper we describe an efficient tool able to cope with the above requirements. It was first developed for the annotation of the SpeechDat-II recordings, and then it was extended to incorporate the additional features of the SpeechDat-Car project. Nevertheless, it has been parameterized so that it is not restricted to the SpeechDat format and Greek, and it can handle any other formalism and language.
135 A Methodology for Evaluating Spoken Language Dialogue Systems and Their Components As spoken language dialogue systems (SLDSs)proliferate in the market place,the issue of SLDS evaluation has come to attract wide interest from research and industry alike.Yet it is only recently that spoken dialogue engineering researchers have come to face SLDSs evaluation in its full complexity.This paper presents results of the European DISC project concerning technical evaluation and usability evaluation of SLDSs and their components.The paper presents a methodology for complete and correct evaluation of SLDSs and components together with a generic evaluation template for describing the evaluation criteria needed.
243 A Multi-view Hyperlexicon Resource for Speech and Language System Development New generations of integrated multimodal speech and language systems with dictation, readback or talking face facilities require multiple sources of lexical information for development and evaluation. Recent developments in hyperlexicon development offer new perspectives for the development of such resources which are at the same time practically useful, computationally feasible, and theoretically well- founded. We describe the specification, three-level lexical document design principles, and implementation of a MARTIF document structure and several presentation structures for a terminological lexicon, including both on demand access and full hypertext lexicon compilation. The underlying resource is a relational lexical database with SQL querying and access via a CGI internet interface. This resource is mapped on to the hypergraph structure which defines the macrostructure of the hyperlexicon.
235 A New Methodology for Speech Corpora Definition from Internet Documents In this paper, a new methodology for speech corpora definition from internet documents is described, in order to record a large speech database, dedicated to the training and testing of acoustic models for speech recognition. In the first section, the Web robot which is in charge of collecting Web pages from Internet is presented, then the web text to French sentences filtering mechanism is explained. Some information about the corpus organization (90% for training and 10% for test) is given. In the third section, the phoneme distribution of the corpus is presented and comparison is made with others French language studies. Finally tools and planning for recording the speech database with more than one hundred speakers are described.
113 A Novelty-based Evaluation Method for Information Retrieval In information retrieval research, precision and recall have long been used to evaluate IR systems. However, given that a number of retrieval systems resembling one another are already available to the public, it is valuable to retrieve novel relevant documents, i.e., documents that cannot be retrieved by those existing systems. In view of this problem, we propose an evaluation method that favors systems retrieving as many novel documents as possible. We also used our method to evaluate systems that participated in the IREX workshop.
140 A Parallel Corpus of Italian/German Legal Texts This paper presents the creation of a parallel corpus of Italian and German legal documents which are translations of one another. The corpus, which contains approximately 5 mio. words, is primarily intended as a resource for (semi-)automatic terminology acquisition. The guidelines of the Corpus Encoding Standard have been applied for encoding structural information, segmentation information, and sentence alignment. Since the parallel texts have a one-to-one correspondence on the sentence level, building a perfect sentence alignment is rather straightforward. As a result of this the corpus constitutes also a valuable testbed for the evaluation of alignment algorithms. The paper discusses the intended use of the corpus, the various phases of corpus compilation, and basic statistics.
248 A Parallel English-Japanese Query Collection for the Evaluation of On-Line Help Systems An experiment concerning the creation of parallel evaluation data for information retrieval is presented. A set of English queries was gathered for the domain of wordprocessing using Lotus Ami Pro. A set of Japanese queries was then created from these. The answers to the queries were elicited from eight respondents comprising four native speakers of each language. We first describe how the queries were created and the answers elicited. We then present analyses of the responses in each language. The results show a lower level of agreeement between respondents than was expected. We discuss a refinement of the elicitation process which is designed to address this problem as well as measuring the integrity of individual respondents.
348 A Platform for Dutch in Human Language Technologies As ICT increasingly forms a part of our daily life it becomes more and more important that all citizens can make use of their native languages in all communicative situations. For the development of successful applications and products for Dutch basic provisions are required. The development of the basic material that is lacking, is an expensive undertaking which exceeds the capacity of the individuals involved. Collaboration between the various agents (policy, knowledge infrastructure and industry) in the Netherlands and Flanders is required. The existence of the Dutch Language Union (Nederlandse Taalunie) facilitates this co-operation. The responsible ministers decided to set up a Dutch-Flemish platform for Dutch in Human Language Technologies. The purpose of the platform is the further construction of an adequate digital language infrastructure for Dutch so that the industry develops the required applications which must guarantee that the citizens in Holland and Flanders can use their own language in their communication within the information society and the Dutch language area remains a full player in a multi-lingual Europe. This paper will show some of the efforts that have been taken
68 A Proposal for the Integration of NLP Tools using SGML-Tagged Documents In this paper we present the strategy used for an integration, in a common framework, of the NLP tools developed for Basque during the last ten years. The documents used as input and output of the different tools contain TEI-conformant feature structures (FS) coded in SGML. These FSs describe the linguistic information that is exchanged among the integrated analysis tools. The tools integrated until now are a lexical database, a tokenizer, a wide-coverage morphosyntactic analyzer, and a general purpose tagger/lemmatizer. In the future we plan to integrate a shallow syntactic parser. Due to the complexity of the information to be exchanged among the different tools, FSs are used to represent it. Feature structures are coded following the TEI’s DTD for FSs, and Feature Structure Definition descriptions (FSD) have been thoroughly defined. The use of SGML for encoding the I/O streams flowing between programs forces us to formally describe the mark-up, and provides software to check that these mark-up hold invariantly in an annotated corpus. A library of Abstract Data Types representing the objects needed for the communication between the tools has been designed and implemented. It offers the necessary operations to get the information from an SGML document containing FSs, and to produce the corresponding output according to a well-defined FSD.
174 A Robust Parser for Unrestricted Greek Text In this paper we describe a method for the efficient parsing of real-life Greek texts at the surface syntactic level. A grammar consisting of non-recursive regular expressions describing Greek phrase structure has been compiled into a cascade of finite state transducers used to recognize syntactic constituents. The implemented parser lends itself to applications where large scale text processing is involved, and fast, robust, and relatively accurate syntactic analysis is necessary. The parser has been evaluated against a ca 34000 word corpus of financial and news texts and achieved promising precision and recall scores.
362 A Self-Expanding Corpus Based on Newspapers on the Web A Unix-based system is presented which automatic collects newspaper articles from the web, converts the texts, and includes these texts in a newspaper corpus. This corpus can be searched from a web-browser. The corpus is currently 70 millions words and increases by 4 millions words each month.
165 A Semi-automatic System for Conceptual Annotation, its Application to Resource Construction and Evaluation The CONCERTO project, primarily concerned with the annotation of texts for their conceptual content, combines automatic linguistic analysis with manual annotation to ensure the accuracy of fact extraction, and to encode content in a rich knowledge representation framework. The system provides annotation tools, automatic multi-level linguistic analysis modules, a partial parsing formalism with a more user friendly language than standard regular expression languages, XML-based document management, and a powerful knowledge representation and query facility. We describe the architecture and functionality of the system, and how it can be adapted for a range of resource construction tasks, and how the system can be configured to compute statistics on the accuracy of its automatic analysis components.
337 A Software Toolkit for Sharing and Accessing Corpora Over the Internet This paper describes the Translational English Corpus (TEC) and the software tools developed in order to enable the use of the corpus remotely, over the internet. The model underlying these tools is based on an extensible client-server architecture implemented in Java. We discuss the data and processing constraints which motivated the TEC architecture design and its impact on the efficiency and scalability of the system. We also suggest that the kind of distributed processing model adopted in TEC could play a role in fostering the availability of corpus linguistic resources to the research community.
161 A Step toward Semantic Indexing of an Encyclopedic Corpus This paper investigates a method for extracting and acquiring knowledge from Linguistic resources. In particular, we propose an NLP based architecture for building a semantic network out of an XML on line encyclopedic corpus. The general application underlying this work is a question-answering system on proper nouns within an encyclopedia.
111 A Strategy for the Syntactic Parsing of Corpora: from Constraint Grammar Output to Unification-based Processing This paper presents a strategy for syntactic analysis based on the combination of two different parsing techniques: lexical syntactic tagging and phrase structure syntactic parsing. The basic proposal is to take advantage of the good results on lexical syntactic tagging to improve the whole performance of unification-based parsing. The syntactic functions attached to every word by the lexical syntactic tagging are used as head features in the unification-based grammar, and are the base for grammar rules.
132 A Text->Meaning->Text Dictionary and Process In this article we deal with various applications of a multilingual semantic network named The Integral Dictionary. We revise different commercial applications that uses semantic networks and we show the results with the Integral Dictionary. The details of the semantic calculations are not given here but we show that contrary to the WordNet semantic net, the Integral Dictionary provides most data and relations needed to these calculations. The article presents results and discussion on lexical expanding, lexical reduction, WSD, query expansion, lexical translation extraction, document summary Emails sorting, catalogue access and information retrieval. We conclude that resource like Integral Dictionary can become a good new step for all those who tried to compute semantics with WordNet and that complementary between the two dictionaries could be seriously study in a shared project.
66 A Treebank of Spanish and its Application to Parsing This paper presents joint research between a Spanish team and an American one on the development and exploitation of a Spanish treebank. Such treebanks for other languages have proven valuable for the development of high-quality parsers and for a wide variety of language studies. However, when the project started, at the end of 1997, there was no syntactically annotated corpus for Spanish. This paper describes the design of such a treebank and its initial application to parser construction.
181 A Unified POS Tagging Architecture and its Application to Greek This paper proposes a flexible and unified tagging architecture that could be incorporated into a number of applications like information extraction, cross-language information retrieval, term extraction, or summarization, while providing an essential component for subsequent syntactic processing or lexicographical work. A feature-based multi-tiered approach (FBT tagger) is introduced to part-of-speech tagging. FBT is a variant of the well-known transformation based learning paradigm aiming at improving the quality of tagging highly inflective languages such as Greek. Additionally, a large experiment concerning the Greek language is conducted and results are presented for a variety of text genres, including financial reports, newswires, press releases and technical manuals. Finally, the adopted evaluation methodology is discussed.
363 A Web-based Advanced and User Friendly System: The Oslo Corpus of Tagged Norwegian Texts A general purpose text corpus meant for linguists and lexicographers needs to satify quality criteria at at least four different levels. The first two criteria are fairly well established; the corpus should have a wide variety of texts and be tagged according to a fine-grained system. The last two criteria are much less widely appreciated, unfortunately. One has to do with variety of search criteria: the user should be allowed to search for any information contained in the corpus, and with any combination possible. In addition, the search results should be presented in a choice of ways. The fourth criterion has to do with accessability. It is a rather surprising fact that while user interfaces tend to be simple and self explanatory in most areas of life represented electronically, corpus interfaces are still extremely user unfriendly. In this paper, we present a corpus whose interface we have given a lot of thought, and likewise the possible search options, viz. the Oslo Corpus of Tagged Norwegian Texts.
105 A Web-based Text Corpora Development System One of the most important starting points for any NLP endeavor is the construction of text corpora of appropriate size and quality. This paper presents a web-based text corpora development system which focuses both on the size and the quality of these corpora. The quantitative problem is solved by using the Internet as a practically limitless source of texts. To ensure a certain quality, we enrich the text with relevant information, to be fit for further use, by treating in an integrated manner the problems of morpho-syntactic annotation, lexical ambiguity resolution, and diacritic characters restoration. Although at this moment it is targeted at texts in Romanian, the system can be adapted to other languages, provided that some appropriate auxiliary resources are available.
15 A Word Sense Disambiguation Method Using Bilingual Corpus This paper proposes a word sense disambiguation (WSD) method using bilingual corpus in English-Chinese machine translation system. A mathematical model is constructed to disambiguate word in terms of context phrasal collocation. A rules learning algorithm is proposed, and an application algorithm of the learned rules is also provided, which can increase the recall ratio. Finally, an analysis is given by an experiment on the algorithm. Its application gives an increase of 10% in precision.
44 A Word-level Morphosyntactic Analyzer for Basque This work presents the development and implementation of a full morphological analyzer for Basque, an agglutinative language. Several problems (phrase structure inside word-forms, noun ellipsis, multiplicity of values for the same feature and the use of complex linguistic representations) have forced us to go beyond the morphological segmentation of words, and to include an extra module that performs a full morphosyntactic parsing of each word-form. A unification-based word-level grammar has been defined for that purpose. The system has been integrated into a general environment for the automatic processing of corpora, using TEI-conformant SGML feature structures.
75 Abstraction of the EDR Concept Classification and its Effectiveness in Word Sense Disambiguation The relation between the degree of abstraction of a concept and the explanation capability (validity and coverage) of conceptual description which is the constraint held between concepts is clarified experimentally by performing the operation called concept abstraction. This is the procedure that chooses a set certain of lower level concepts in a concept hierarchy and maps the set to one or more upper level (abstract) concepts. We took the three abstraction techniques of a flat depth, a flat size, and a flat probability method for the degree of abstraction. By taking these methods and degrees as a parameter, we applied the concept abstraction to the EDR Concept Classifications and performed word sense disambiguation test. The test set and the disambiguation knowledge were extracted as a co-occurrence expression from the EDR Corpora. Through the test, we found that the flat probability method gives the best result. We also carried out an evaluation by comparing the abstracted hierarchy with that of human introspection and found the flat size method gives the most similar results to human. These results would contribute to clarify the appropriate detailed-ness of a concept when given an application purpose of a concept hierarchy.
283 Accessibility of Multilingual Terminological Resources - Current Problems and Prospects for the Future In this paper we analyse the various problems in making multilingual terminological resources available to users. Different levels of diversity and incongruence among such resources are discussed. Previous standardization efforts are reviewed. As a solution to the lack of co-ordination and compatibility among an increasing number of ‘standard’ interchange formats, a higher level of integration is proposed for the purpose of terminology-enabled knowledge sharing. The family of formats currently being developed in the SALT project is presented as a contribution to this solution.
356 Acoustical Sound Database in Real Environments for Sound Scene Understanding and Hands-Free Speech Recognition This paper reports on a project for collection of the sound scene data. The sound scene data is necessary for studies such as sound source localization, sound retrieval, sound recognition and hands-free speech recognition in real acoustical environments. There are many kinds of sound scenes in real environments. The sound scene is denoted by sound sources and room acoustics. The number of combination of the sound sources, source positions and rooms is huge in real acoustical environments. However, the sound in the environments can be simulated by convolution of the isolated sound sources and impulse responses. As an isolated sound source, a hundred kinds of non-speech sounds and speech sounds are collected. The impulse responses are collected in various acoustical environments. In this paper, progress of our sound scene database project and application to environment sound recognition are described.
347 Acquisition of Linguistic Patterns for Knowledge-based Information Extraction In this paper we present a new method of automatic acquisition of linguistic patterns for Information Extraction, as implemented in the CICERO system. Our approach combines lexico-semantic information available from the WordNet database with collocating data extracted from training corpora. Due to the open-domain nature of the WordNet information and the immediate availability of large collections of texts, our method can be easily ported to open-domain Information Extraction.
374 Addizionario: an Interactive Hypermedia Tool for Language Learning In this paper we present the hypermedia linguistic laboratory ''Addizionario'', an open and flexible software tool aimed at studying Italian either as native or as foreign language. The product is directed to various categories of users: school children who can perform in a pleasant and appealing manner various tasks generally considered difficult and boring, such as dictionary look-up, word definition and vocabulary expansion; teachers who can use it to prepare didactic units specifically designed to meet the needs of their students; psychologists and therapists who can use it as an aid to detect impaired development and learning in the child; and editors of children’s dictionaries who can access large quantities of material for the creation of attractive, easy-to-use tools which take into account the capacities, tastes and interests of their users.
256 An Approach to Lexical Development for Inflectional Languages We describe a method for the semi-automatic development of morphological lexicons. The method aims at using minimal pre-existing resources and only relies upon the existence of a raw text corpus and a database of inflectional classes. No lexicon or list of base forms is assumed. The method is based on a contrastive approach, which generates hypothetical entries based on evidence drawn form a corpus, and selects the best candidates by heuristically comparing the candidate entries. The reliance upon inflectional information and the use of minimal resources make this approach particularly suitable for highly inflectional, lower-density languages. A prototype tool has been developed for Modern Greek.
91 An Architecture for Document Routing in Spanish: Two Language Components, Pre-processor and Parser This paper describes the language components of a system for Document Routing in Spanish. The system identifies relevant terms for classification within involved documents by means of natural language processing techniques. These techniques are based on the isolation and normalization of syntactic unities considered relevant for the classification, especially noun phrases, but also other constituents built around verbs, adverbs, pronouns or adjectives. After a general introduction about the research project, the second Section relates our approach to the problem with other previous and current approaches, the third one describes corpora used for evaluating the system. The linguistic analysis architecture, including pre-processing and two different levels of syntactic analysis, is described in following fourth and fifth Sections, while the last one is dedicated to a comparative analysis of results obtained from the processing of corpora introduced in third Section. Certain future developments of the system are also included in this Section.
278 An Evaluation Tool for Machine Translation: Fast Evaluation for MT Research In this paper we present a tool for the evaluation of translation quality. First, the typical requirements of such a tool in the framework of machine translation (MT) research are discussed. We define evaluation criteria which are more adequate than pure edit distance and we describe how the measurement along these quality criteria is performed semi-automatically in a fast, convenient and above all consistent way using our tool and the corresponding graphical user interface.
60 An Experiment of Lexical-Semantic Tagging of an Italian Corpus The availability of semantically tagged corpora is becoming a very important and urgent need for training and evaluation within a large number of applications but also they are the natural application and accompaniment of semantic lexicons of which they constitute both a useful testbed to evaluate their adequacy and a repository of corpus examples for the attested senses. It is therefore essential that sound criteria are defined for their construction and a specific methodology is set up for the treatment of various semantic phenomena relevant to this level of description. In this paper we present some observations and results concerning an experiment of manual lexical-semantic tagging of a small Italian corpus performed within the framework of the ELSNET project. The ELSNET experimental project has to be considered as a feasibility study. It is part of a preparatory and training phase, started with the Romanseval/Senseval experiment (Calzolari et al., 1998), and ending up with the lexical-semantic annotation of larger quantities of semantically annotated texts such as the syntactic-semantic Treebank which is going to be annotated within an Italian National Project (SI-TAL). Indeed, the results of the ELSNET experiment have been of utmost importance for the definition of the technical guidelines for the lexical-semantic level of description of the Treebank.
272 An HPSG-Annotated Test Suite for Polish The paper presents both conceptual and technical issues related to the construction of an HPSG test-suite for Polish. The test-suite consists of sentences of written Polish — both grammatical and ungrammatical. Each sentence is annotated with a list of linguistic phenomena it illustrates. Additionally, grammatical sentences are encoded in HPSG-style AVM structures. We describe also a technical organization of the database, as well as possible operations on it.
176 An Open Architecture for the Construction and Administration of Corpora The use of language corpora for a variety of purposes has increased significantly in recent years. General corpora are now available for many languages, but research often requires more specialized corpora. The rapid development of the World Wide Web has greatly improved access to data in electronic form, but research has tended to focus on corpus annotation, rather than on corpus building tools. Therefore many researchers are building their own corpora, solving problems independently, and producing project-specific systems which cannot easily be re-used. This paper proposes an open client-server architecture which can service the basic operations needed in the construction and administration of corpora, but allows customisation by users in order to carry out project-specific tasks. The paper is based partly on recent practical experience of building a corpus of 10 million words of Written Business English from webpages, in a project which was co-funded by ELRA and the University of Wolverhampton.
371 An Open Source Grammar Development Environment and Broad-coverage English Grammar Using HPSG The LinGO (Linguistic Grammars Online) project's English Resource Grammar and the LKB grammar development environment are language resources which are freely available for download for any purpose, including commercial use (see Executable programs and source code are both included. In this paper, we give an outline of the LinGO English grammar and LKB system, and discuss the ways in which they are currently being used. The grammar and processing system can be used independently or combined to give a central component which can be exploited in a variety of ways. Our intention in writing this paper is to encourage more people to use the technology, which supports collaborative development on many levels.
251 An Optimised FS Pronunciation Resource Generator for Highly Inflecting Languages We report on a new approach to grapheme-phoneme transduction for large-scale German spoken language corpus resources using explicit morphotactic and graphotactic models. Finite state optimisation techniques are introduced to reduce lexicon development and production time, with a speed increase factor of 10. The motivation for this tool is the problem of creating large pronunciation lexica for highly inflecting languages using morphological out of vocabulary (MOOV) word modelling, a subset of the general OOV problem of non-attested word forms. A given spoken language system which uses fully inflected word forms performs much worse with highly inflecting languages (e.g. French, German, Russian) for a given stem lexicon size than with less highly inflecting languages (e.g. English) because of the `morphological handicap' (ratio of stems to inflected word forms), which for German is about 1:5. However, the problem is worse for current speech recogniser development techniques, because a specific corpus never contains all the inflected forms of a given stem. Non-attested MOOV forms must therefore be `projected' using a morphotactic grammar, plus table lookup for irregular forms. Enhancement with statistical methods is possible for regular forms, but does not help much with large, heterogeneous technical vocabularies, where extensive manual lexicon construction is still used. The problem is magnified by the need for defining pronunciation variants for inflected word forms; we also propose an efficient solution to this problem.
59 An XML-based Representation Format for Syntactically Annotated Corpora This paper discusses a general approach to the description and encoding of linguistic corpora annotated with hierarchically structured syntactic information. A general format can be motivated by the variety and incompatibility of existing annotation formats. By using XML as a representation format the theoretical and technical problems encountered can be overcome.
193 Annotating a Corpus to Develop and Evaluate Discourse Entity Realization Algorithms: Issues and Preliminary Results We are annotating a corpus with information relevant to discourse entity realization, and especially the information needed to decide which type of NP to use. The corpus is being used to study correlations between NP type and certain semantic or discourse features, to evaluate hand-coded algorithms, and to train statistical models. We report on the development of our annotation scheme, the problems we have encountered, and the results obtained so far.
134 Annotating Communication Problems Using the MATE Workbench The increasing commercialisation and sophistication of language engineering products reinforces the need for tools and standards in support of a more cost-effective development and evaluation process than has been possible so far.This paper presents results of the MATE project which was launched in response to the need for standards and tools in support of creating,annotating,evaluating and exploiting spoken language resources.Focusing on the MATE workbench,we illustrate its functionality and usability through its use for markup of communication problems.
321 Annotating Events and Temporal Information in Newswire Texts If one is concerned with natural language processing applications such as information extraction (IE), which typically involve extracting information about temporally situated scenarios, the ability to accurately position key events in time is of great importance. To date only minimal work has been done in the IE community concerning the extraction of temporal information from text, and the importance, together with the difficulty of the task, suggest that a concerted effort be made to analyse how temporal information is actually conveyed in real texts. To this end we have devised an annotation scheme for annotating those features and relations in texts which enable us to determine the relative order and, if possible, the absolute time, of the events reported in them. Such a scheme could be used to construct an annotated corpus which would yield the benefits normally associated with the construction of such resources: a better understanding of the phenomena of concern, and a resource for the training and evaluation of adaptive algorithms to automatically identify features and relations of interest. We also describe a framework for evaluating the annotation and compute precision and recall for different responses.
263 Annotating Resources for Information Extraction Trained systems for NE extraction have shown significant promise because of their robustness to errorful input and rapid adaptability. However, these learning algorithms have transferred the cost of development from skilled computational linguistic expertise to data annotation, putting a new premium on effective ways to produce high-quality annotated resources at minimal cost. The paper reflects on BBN’s four years of experience in the annotation of training data for Named Entity (NE) extraction systems discussing useful techniques for maximizing data quality and quantity.
84 Annotating, Disambiguating & Automatically Extending the Coverage of the Swedish SIMPLE Lexicon During recent years the development of high-quality lexical resources for real-world Natural Language Processing (NLP) applications has gained a lot of attention by many research groups around the world, and the European Union, through the promotion of the language engineering projects dealing directly or indirectly with this topic. In this paper, we focus on ways to extend and enrich such a resource, namely the Swedish version of the SIMPLE lexicon in an automatic manner. The SIMPLE project ({\it Semantic Information for Multifunctional Plurilingual Lexica}) aims at developing wide-coverage semantic lexicons for 12 European languages, though on a rather small scale for practical NLP, namely less than 10,000 entries. Consequently, our intention is to explore and exploit various (inexpensive) methods to progressively enrich the resources and, subsequently, to annotate texts with the semantic information encoded within the framework of SIMPLE, and enhanced with the semantic data from the {\it Gothenburg Lexical DataBase} (GLDB) and from large corpora.
358 Annotation of a Multichannel Noisy Speech Corpus This paper describes the activity of annotation of an Italian corpus of in-car speech material, with specific reference to the JavaSgram tool, developed with the purpose of annotating multichannel speech corpora. Some pre/post processing tools used with JavaSgram are briefly described together with a synthetic description of the annotation criteria which were adopted. The final objective is that of using the resulting corpus for training and testing a hands-free speech recognizer under development.
223 Application of WordNet ILR in Czech Word-formation The aim of this paper is to describe some typical word formation procedures in Czech and to show how the internal language relations (ILR) as they are introduced in Czech WordNet can be related to the chosen derivational processes. In our exploration we have paid attention to the roles of agent, location, instrument and subevent which yield the most regular and rich ways of suffix derivation in Czech. We also deal with the issues of the translation equivalents and corresponding lexical gaps that had to be solved in the framework of EuroWordNet 2 (confronting Czech with English) since they are basically brought about by verb prefixation (single, double, verb aspect pairs) or noun suffixation (diminutives, move in gender). Finally, we try to demonstrate that the mentioned derivational processes can be employed to extend Czech lexical resources in a semiautomatic way.
247 ARC A3: A Method for Evaluating Term Extracting Tools and/or Semantic Relations between Terms from Corpora This paper describes an ongoing project evaluating Natural Language Processing (NLP) systems. The aim of this project is to test software capabilities in automatic or semi-automatic extraction of terminology from French corpora in order to build tools used in NLP applications. We are putting forward a strategy based on qualitative evaluation. The idea is to submit the results to specialists (i.e. field specialists, terminologists and/or knowledge engineers). The research we are conducting is sponsored by the ''Association des Universites Francophones'' (AUF) an international Organisation whose mission is to promote the dissemination of French as a scientific medium. Software submitted to this evaluation are conceived by French, Canadian and US research institutions (National Scientific Research Centre and Universities) and/or companies : CNRS (France), XEROX, and LOGOS Corporation among others.
360 ARISTA Generative Lexicon for Compound Greek Medical Terms A Generative Lexicon for Compound Greek Medical Terms based on the ARISTA method is proposed in this paper. The concept of a representation independent definition-generating lexicon for compound words is introduced in this paper following the ARISTA method. This concept is used as a basis for developing a generative lexicon of Greek compound medical terminology using the senses of their component words expressed in natural language and not in a formal language. A Prolog program that was implemented for this task is presented that is capable of computing implicit relations between the components words in a sublanguage using linguistic and extra linguistic knowledge. An extra linguistic knowledge base containing knowledge derived from the domain or microcosm of the sublanguage is used for supporting the computation of the implicit relations. The performance of the system was evaluated by generating possible senses of the compound words automatically and judging the correctness of the results by comparing them with definitions given in a medical lexicon expressed in the language of the lexicographer.
184 ATLAS: A Flexible and Extensible Architecture for Linguistic Annotation We describe a formal model for annotating linguistic artifacts, from which we derive an application programming interface (API) to a suite of tools for manipulating these annotations. The abstract logical model provides for a range of storage formats and promotes the reuse of tools that interact through this API. We focus first on “Annotation Graphs,” a graph model for annotations on linear signals (such as text and speech) indexed by intervals, for which efficient database storage and querying techniques are applicable. We note how a wide range of existing annotated corpora can be mapped to this annotation graph model. This model is then generalized to encompass a wider variety of linguistic “signals,” including both naturally occuring phenomena (as recorded in images, video, multi-modal interactions, etc.), as well as the derived resources that are increasingly important to the engineering of natural language processing systems (such as word lists, dictionaries, aligned bilingual corpora, etc.). We conclude with a review of the current efforts towards implementing key pieces of this architecture.
218 Automatic Assignment of Grammatical Relations This paper presents a method for the assignment of grammatical relation labels in a sentence structure. The method has been implemented in the software tool AGRA (Automatic Grammatical Relation Assigner), which is part of a project for the development of a treebank of Italian sentences, and a knowledge base of Italian subcategorization frames. The annotation schema implements a notion of underspecification, that arranges grammatical relations from generic to specific one onto a hierarchy; the software tool works with hand-coded rules, which apply heuristic knowledge (on syntactic and semantic cues) to distinguish between complements and modifiers.
208 Automatic Extraction of English-Chinese Term Lexicons from Noisy Bilingual Corpora This paper describes our system, which is designed to extract English-Chinese term lexicons from noisy complex bilingual corpora and use them as translation lexicon to check sentence alignment results. The noisy bilingual corpora are aligned firstly by our improved length based statistical approach, which could detect sentence omission and insertion partly. A term extraction system is used to obtain term translation lexicons form roughly aligned corpora. Then the statistical approach is used to align the corpora again. Finally, we filter the noisy bilingual texts and obtain nearly perfect alignment corpora.
302 Automatic Extraction of Semantic Similarity of Words from Raw Technical Texts In this paper we address the problem of extracting semantic similarity relations between lexical entities based on context similarities as they appear in specialized text corpora. Only general-purpose linguistic tools are utilized in order to achieve portability across domains and languages. Lexical context is extended beyond immediate adjacency but is still confined by clause boundaries. Morfological and collocational information are employed in order to exploit the most of the contextual data. The extracted semantic similarity relations are transformed to semantic clusters which is a primal form of a domain-specific term thesaurus.
306 Automatic Generation of Dictionary Definitions from a Computational Lexicon This paper presents an automatic Generator of dictionary definitions for concrete entities, based on information extracted from a Computational Lexicon (CL) containing semantic information. The aim of the adopted approach, combining NLG techniques with the exploitation of the formalised and systematic lexical information stored in CL, is to produce well formed dictionary definitions free from the shortcomings of traditional dictionaries. The architecture of the system is presented, focusing on the adaptation of the NLG techniques to the specific application requirements, and on the interface between the CL and the Generator. Emphasis is given on the appropriateness of the CL for the application purposes.
80 Automatic Speech Segmentation in High Noise Condition The accurate segmentation of speech and end points detection in adverse condition is very important for building robust automatic speech recognition (ASR) systems. Segmentation of speech is not a trivial process - in high noise conditions it is very difficult to determine weak fricatives and nasals at end of the words. An efficient threshold (a priory defined) independent speech segmentation algorithm, robust to level of disturbance signals, is developed. The results show a significant improvement of robustness of proposed algorithm with respect to traditional algorithms.
301 Automatic Style Categorisation of Corpora in the Greek Language In this article, a system is proposed for the automatic style categorisation of text corpora in the Greek language. This categorisation is based to a large extent on the type of language used in the text, for example whether the language used is representative of formal Greek or not. To arrive to this categorisation, the highly inflectional nature of the Greek language is exploited. For each text, a vector of both structural and morphological characteristics is assembled. Categorisation is achieved by comparing this vector to given archetypes using a statistical-based method. Experimental resu
227 Automatic Transliteration and Back-transliteration by Decision Tree Learning Automatic transliteration and back-transliteration across languages with drastically different alphabets and phonemes inventories such as English/Korean, English/Japanese, English/Arabic, English/Chinese, etc, have practical importance in machine translation, cross-lingual information retrieval, and automatic bilingual dictionary compilation, etc. In this paper, a bi-directional and to some extent language independent methodology for English/Korean transliteration and back-transliteration is described. Our method is composed of character alignment and decision tree learning. We induce transliteration rules for each English alphabet and back-transliteration rules for each Korean alphabet. For the training of decision trees we need a large labeled examples of transliteration and back-transliteration. However this kind of resources are generally not available. Our character alignment algorithm is capable of highly accurately aligning English word and Korean transliteration in a desired way.
320 Automatically Augmenting Terminological Lexicons from Untagged Text Lexical resources play a crucial role in language technology but lexical acquisition can often be a time-consuming, laborious and costly exercise. In this paper, we describe a method for the automatic acquisition of technical terminology from domain restricted texts without the need for sophisticated natural language processing tools, such as taggers or parsers, or text corpora annotated with labelled cases. The method is based on the idea of using prior or seed knowledge in order to discover co-occurrence patterns for the terms in the texts. A bootstrapping algorithm has been developed that identifies patterns and new terms in an iterative manner. Experiments with scientific journal abstracts in the biology domain indicate an accuracy rate for the extracted terms ranging from 58% to 71%. The new terms have been found useful for improving the coverage of a system used for terminology identification tasks in the biology domain.
142 Automatically Expansion of Thesaurus Entries with a Different Thesaurus We propose a method for expanding the entries in a thesaurus using a di erent thesaurus constructed with another concept. This method constructs a mapping table between the concept codes of these two di erent thesauri. Then, almost all of the entries of the latter thesaurus are assigned the concept codes of the former thesaurus with the mapping table between them. To con rm whether this method is e ective or not,we construct a mapping table between the ''Kadokawa- shin-ruigo'' thesaurus (hereafter, ''ShinRuigo'') and ''Nihongo-goitaikei'' (hereafter, ''Goitaikei''), and assigne about 350 thousand entries with the mapping table. About 10% of the entries cannot be assigned automatically. It is shown that this method can save cost in expanding a thesaurus.
312 Automotive Speech-Recognition - Success Conditions Beyond Recognition Rates From a car-manufacturer’s point of view it is very important to integrate evaluation procedures into the MMI development process. Focusing the usability evaluation of speech-input and speech-output systems aspects beyond recognition rates must be fulfilled. Two of these conditions will be discussed based upon user studies conducted in 1999: • Mental-workload and distraction • Learnability