LREC 2000 - Abstracts

LREC 2000 2^nd International Conference on Language Resources & Evaluation

Conference Papers and Abstracts

Papers and abstracts by paper title: A B C D E F G H I J K L M N O P Q R S T U V W X Y Z

Papers and abstracts by ID number: 1-50, 51-100, 101-150, 151-200, 201-250, 251-300, 301-350, 351-377.

List of all papers and abstracts.

Paper Paper Title Abstract

92 Target Suites for Evaluating the Coverage of Text Generators Our goal is to evaluate the grammatical coverage of the surface realization component of a natural language generation system by means of target suites. We consider the utility of re-using for this purpose test suites designed to assess the coverage of natural language analysis / understanding systems. We find that they are of some interest, in helping inter-system comparisons and in providing an essential link to annotated corpora. But they have limitations. First, they contain a high proportion of ill-formed items which are inappropriate as targets for generation. Second, they omit phenomena such as discourse markers which are key issues in text production. We illustrate a partial remedy for this situation in the form of a text generator that annotates its own output to an externally specified standard, the TSNLP scheme.

106 Term-based Identification of Sentences for Text Summarisation The present paper describes a methodology for automatic text summarisation of Greek texts which combines terminology extraction and sentence spotting. Since generating abstracts has proven a hard NLP task of questionable effectiveness, the paper focuses on the production of a special kind of abstracts, called extracts: sets of sentences taken from the original text. These sentences are selected on the basis of the amount of information they carry about the subject content. The proposed, corpus-based and statistical approach exploits several heuristics to determine the summary-worthiness of sentences. It actually uses statistical occurrences of terms (TF· IDF formula) and several cue phrases to calculate sentence weights and then extract the top scoring sentences which form the extract.

275 Terminology Encoding in View of Multifunctional NLP Resources Given the existing standards for organising terminology resources, the main question raised is how to create a DB or assimilated term list with properties allowing for an efficient NLP treatment of input texts. Here, we have dealt with the output of MT and have attempted to improve terminological annotation of the input text, in order to optimize reusability and efficiency of performance. By organizing terms in BD-like tables, which provide various cross-linked indications about head properties, morpho-syntax, derivational morphology and semantic-pragmatic relations between concepts of terms, we have managed to improve functionality of resources and enable better customisation. Moreover, we have tried to view the proposed term DB organisation as part of a global account of the problem of terminology resolution on-processing via grammar based or user-machine interaction techniques for term recognition and disambiguation, since term boundary definition is generally recognised to be a complex and costly enterprise, directly related to the fact that most problem causing terminology items are multi-word units either characterized as fixed or as ad hoc or not yet fixed terms.

276 Terminology in Korea: KORTERM Korterm (Korea Terminology Research Center for Language and Knowledge Engineering) had been set up in the late August of 1998 under the auspices of Ministry of Culture and Tourism in Korea. Its major mission is to construct terminology resources and their unification, harmonization and standardization. This mission is naturally linked to the general language engineering and knowledge engineering tasks including specific-domain corpus, ontology, wordnet, electronic dictionary construction as well as language engineering products like information extraction and machine translation. This organization is located under the KAIST (Korea Advanced Institute of Science and Technology) that is one of national university specifically under the Ministry of Science and Technology. KORTERM is only one representative for terminology standardization and research with relation to Infoterm.

17 Terms Specification and Extraction within a Linguistic-based Intranet Service This paper describes the adaptation and extension of an existing morphological system,Word Manager,and its integration into an intranet service of a large international bank.The system includes a tool for the analysis and extraction of simple and complex terms.As a side-effect the procedure for the definition of new terms has been consolidated.The intranet service analyzes HTML pages on the fly,compares the results with the vocabulary of an inhouse terminological database (CLS-TDB)and generates hyperlinks in case matches have been found.Currently,the service handles terms in both German and English.The implementation of the service for Italian,French and Spanish is under way.

109 Textual Information Retrieval Systems Test: The Point of View of an Organizer and Corpuses Provider Amaryllis is an evaluation programme for text retrieval systems which has been carried out as two test campaigns. The second Amaryllis campaign took place in 1998/1999. Corpuses of documents, topics, and the corresponding responses were first sent to each of the participating teams for system learning purposes. Corpuses of new documents and a set of new topics were then supplied for evaluation purposes. Two optional tracks were added for Internet and interlingual track. The first track of these contained a test via the Internet. INIST sent topics to the system and collected responses directly, thus reducing the need for conceptor manipulations. The second contained tests in different European Community language pairs. The corpuses of documents consisted of records of questions and answers from the European Commission, in parallel official language versions. Participants could use any language pair for their tests. The aim of this paper is to give the point of view of an organizer and corpus provider (INIST) on the organization of an operation of this sort. In particular, it will describe the difficulties encountered during the tests (corpus construction, translation of topics and systems evaluation ), and will suggest avenues to explore for future tests.

200 The (Un)Deterministic Nature of Morphological Context The aim of this paper is to contribute to the study of the context within natural language processing and to bring in aspects which, I believe, have a direct influence on the interpretation of the success rates and on a more successful design of language models. This work tries to formalize the (ir)regularities, dynamic characteristics, of context using techniques from the field of chaotic and non-linear systems. The observations are done on the problem of POS tagging.

196 The American National Corpus: A Standardized Resource for American English At the first conference on Language Resources and Evaluation, Granada 1998, Charles Fillmore, Nancy Ide, Daniel Jurafsky, and Catherine Macleod proposed creating an American National Corpus (ANC) that would compare with the British National Corpus (BNC) both in balance and in size (one hundred million words). This paper reports on the progress made over the past two years in launching the project. At present, the ANC project is well underway, with commitments for support and contribution of texts from a number of publishers world-wide.

300 The Bank of Swedish The Bank of Swedish is described: affiliation, organisation, linguistic resources and tools. A point is made of the close connection between lexical research and corpus data, the broad textual coverage from Modern Swedish to Old Swed-ish, the official status of the organisation and its connection to Goteborg University. The relation to the broader scope of the comprehensive Language Database of Swedish is discussed. A few current issues of the Bank of Swedish are presented: parallell corpora production, the construction of a Swedish morphology database and sense tagging of text corpora. Finally, the updating of the Bank of Swedish concordance system is mentioned.

335 The Concede Model for Lexical Databases The value of language resources is greatly enhanced if they share a common markup with an explicit minimal semantics. Achieving this goal for lexical databases is difficult, as large-scale resources can realistically only be obtained by up-translation from pre-existing dictionaries, each with its own proprietary structure. This paper describes the approach we have taken in the Concede project, which aims to develop compatible lexical databases for six Central and Eastern European languages. Starting with sample entries from original presentation-oriented electronic representations of dictionaries, we transformed the data into an intermediate TEI-compatible represen-tation to provide a common baseline for evaluating and comparing the dictionaries. We then developed a more restrictive encoding, formalised as an XML DTD with a clearly-defined semantic interpretation. We present this DTD and discuss a sample conversion from TEI, together with an application which hyperlinks a HTML representation of the dictionary to on-line concordancing over a corpus.

156 The Context (not only) for Humans Our context considerations will be practically oriented; we will explore the specification of a context scope in the Czech morphological tagging. We mean by morphological tagging/annotation the automatic/manual disambiguation of the output of morphological analysis. The Prague Dependency Treebank (PDT) serves as a source of annotated data. The main aim is to concentrate on the evaluation of the influence of the chosen context on the tagging accuracy.

274 The COST 249 SpeechDat Multilingual Reference Recogniser The COST 249 SpeechDat reference recogniser is a fully automatic, language-independent training procedure for building a phonetic recogniser. It relies on the HTK toolkit and a SpeechDat(II) compatible database. The recogniser is designed to serve as a reference system in multilingual recognition research. This paper documents version 0.95 of the reference recogniser and presents results on small and medium vocabulary recognition for five languages.

1 The Cost258 Signal Generation Test Array This paper describes a benchmark for Analysis-Modification-Synthesis Systems (AMSS) that are back-ends of all concatenative speech synthesis systems. After introducing the motivations and principles underlying this initiative, we present here a first anonymous objective evaluation comparing the performance of 5 such AMSS.

260 The Establishment of Motorola's Human Language Data Resource Center: Addressing the Criticality of Language Resources in the Industrial Setting Within the human language technology (HLT) field it is widely understood that the availability (and effective utilization) of voluminous, high quality language resources is both a critical need and a critical bottleneck in the advancement and deployment of cutting edge HLT applications. Recently formed (inter-)national human language resource (HLR) consortia (e.g., LDC, ELRA,...) have made great strides in addressing this challenge by distributing a rich array of pre-competitive HLRs. However, HLT application commercialization will continue to demand that HLRs specific to target products (and complementary to consortially available resources) be created. In recognition of the general criticality of HLRs, Motorola has recently formed the Human Language Data Resource Center (HLDRC) to streamline and leverage our HLR creation and utilization efforts. In this paper, we use the specific case of the Motorola HLDRC to help examine the goals and range of activities which fall into the purview of a company- internal HLR organization, look at ways in which such an organization differs from (and is similar to) HLR consortia, and explore some issues with respect to implementation of a wholly within-company HLR organization like the HLDRC.

45 The EUDICO Project, Multi Media Annotation over the Internet In this paper we dsecribe a software environment that facilitates media annotation and analysis of media related corpora over the internet. We will describe the general architecture of this environment and we will introduce our Abstract Corpus Model with which we isolate corpora specific formats from the annotation and analysis tools. The main set of tools is described by giving examples of their usage. Finally we will discuss features regarding the distributed character of this environment.

70 The Evaluation of Systems for Cross-language Information Retrieval We describe the creation of an infrastructure for the testing of cross-language text retrieval systems within the context of the Text REtrieval Conferences (TREC) organised by the US National Institute of Standards and Technology (NIST). The approach adopted and the issues that had to be taken into consideration when building a multilingual test suite and developing appropriate evaluation procedures to test cross-language systems are described. From 2000 on, a cross-language evaluation activity for European languages known as CLEF (Cross-Language Evaluation Forum) will be coordinated in Europe, while TREC will focus on Asian languages. The implications of the move to Europe and the intentions for the future are discussed.

217 The Influence of Scenario Constraints on the Spontaneity of Speech. A Comparison of Dialogue Corpora In this article we compare two large scale dialogue corpora recorded in different settings. The main differences are unrestricted turn-taking vs. push-to-talk button and complex vs. simple negotiation task. In our investigation we found that vocabulary, durations of turns, words and sounds as well as prosodical features are influenced by differences in the setting.

313 The ISLE Corpus of Non-Native Spoken English For the purpose of developing pronunciation training tools for second language learning a corpus of non-native speech data has been collected, which consists of almost 18 hours of annotated speech signals spoken by Italian and German learners of English. The corpus is based on 250 utterances selected from typical second language learning exercises. It has been annotated at the word and the phone level, to highlight pronunciation errors such as phone realisation problems and misplaced word stress assignments. The data has been used to develop and evaluate several diagnostic components, which can be used to produce corrective feedback of unprecedented detail to a language learner.

166 The MATE Workbench Annotation Tool, a Technical Description The MATE workbench is a tool which aims to simplify the tasks of annotating, displaying and querying speech or text corpora. It is designed to help humans create language resources, and to make it easier for different groups to use one another’s data, by providing one tool which can be used with many different annotation schemes. Any annotation scheme which can be converted to XML can be used with the workbench, and display formats optimised for particular annotation tasks are created using a transformation language similar to XSLT. The workbench is written entirely in Java, which means that it is platform-independent.

29 The Multi-layer Language Knowledge Base of Chinese NLP This paper introduced the effort to build a multi-layer knowledge base of Chinese NLP which combined with list-based, rule-based and corpus-based language information. Different kinds of information are designed to solve different kind of problems that encountered in the Chinese NLP. The whole knowledge base is designed with theoretical consistency and can easily be put into practice in the application systems.

267 The New Edition of the Natural Language Software Registry (an Initiative of ACL hosted at DFKI) In this paper we present the new version (4th edition) of the Natural Language Software Registry (NLSR), an initiative of the Association for Computational Linguistics (ACL) hosted at DFKI in Saarbr¨ ucken. We give a brief overview of the history of this repository for Natural Language Processing (NLP) software, list some related works and go into the details of the design and the implementation of the new edition.

315 The PAROLE Program The PAROLE project (Contract LE2-4017) was launched in May 1996 by the Commission of the European Communities, at the initiative of the DG XIII (Telecommunications, Information Market and Exploitation of Research). PAROLE is not just a project for gathering and creating a corpus. We are creating a true architectural model whose riches and quality will constitute strategic assets for European linguistic studies. This two-level architecture will link together two major morphological and syntactical component.

110 The Spoken Dutch Corpus. Overview and First Evaluation In this paper the Spoken Dutch Corpus project is presented, a joint Flemish-Dutch undertaking aimed at the compilation and annotation of a 10-million-word corpus of spoken Dutch. Upon completion, the corpus will constitute a valuable resource for research in the fields of computational linguistics and language and speech technology. The paper first gives an overall description of the project, its aims, structure and organization. It then goes on to discuss the considerations - both methodological and practical - that have played a role in the design of the corpus as well as in its compilation and annotation. The paper concludes with an account of the data that are available in the first release of the first part of the corpus that came out on March 1st, 2000.

366 The Treatment of Adjectives in SIMPLE: Theoretical Observations This paper discusses the issues that play a part in the characterization of adjectival meaning. It describes the SIMPLE ontology for adjectives and provides insight into the morphological, syntactic and semantic aspects that are included in the SIMPLE adjectival templates.

26 The TREC-8 Question Answering Track The TREC-8 Question Answering track was the first large-scale evaluation of domain-independent question answering systems. This paper summarizes the results of the track, including both an overview of the approaches taken to the problem and an analysis of the evaluation methodology. Retrieval results for the more stringent condition in which system responses were limited to 50 bytes showed that explicit linguistic processing was more effective than the bag-of-words approaches that are effective for document retrieval. The use of multiple human assessors to judge the correctness of the systems' responses demonstrated that assessors have legitimate differences of opinion as to correctness even for fact-based, short-answer questions. Evaluations of question answering technology will need to accommodate these differences since eventual end-users of the technology will have similar differences.

253 The Universal XML Organizer: UXO The integrated editor UXO is the result of ongoing research and development of the text-technology group at Bielefeld. Being a full featured XML-based editing system, it also allows to combine the structured annotated data with information imported from relational databases by integrating a JDBC interface. The mapping processes between different levels of annotation can be programmed either by the integrated scheme interpreter, or by extending the functionality of UXO using the predefined Java API.

338 Tools for the Generation of Morphological Entries in Dictionaries he lexicographer's tool introduced in the report represents a semiautomatic system to generate the section of morphological information for Estonian words in dictionary entries. Estonian is a language with a complicated morphology featuring (1) rich inflection and (2) marked and diverse morpheme variation, applying both to stems and formatives. The kernel of the system is a rule-based automatic morphology with separate program modules for every linguistic subsystem such as syllabification, recognition of part of speech and type of inflection, stem variation, morpheme and allomorph combinatorics. The modules function as rule interpreters applying formal grammars in an editable text format. The system enables generation of the following: (1) part of speech, (2) type of inflection, (3) inflected forms, (4) morphonological marking: degree of quantity, morpheme boundaries (stem+formative, component boundaries in compounds), (5) morphological references for inflected forms considerably different from the headword. The system permits of set-up, so that the inflected forms to be generated, the style of morphonological marking and the criteria for reference selection are all up to the user to choose. Full automation of the system application is restricted mainly by morphological homonymy.

194 Towards a Query Language for Annotation Graphs The multidimensional, heterogeneous, and temporal nature of speech databases raises interesting challenges for representation and query. Recently, annotation graphs have been proposed as a general-purpose representational framework for speech databases. Typical queries on annotation graphs require path expressions similar to those used in semistructured query languages. However, the underlying model is rather different from the customary graph models for semistructured data: the graph is acyclic and unrooted, and both temporal and inclusion relationships are important. We develop a query language and describe optimization techniques for an underlying relational representation.

125 Towards a Standard for Meta-descriptions of Language Resources The desire is to improve the availability of Language Resources (LR) on the Intra- and Internet. It is suggested that this can be achieved by creating a browsable & searchable universe of meta-descriptions. This asks for the development of a standard for tagging LRs with meta-data and several conventions agreed within the community.

47 Towards a Strategy for a Representation of Collocations - Extending the Danish PAROLE-lexicon We describe our attempts to formulate a pragmatic definition and a partial typology of the lexical category of ’collocation’ taking both lexicographical and computational aspects into consideration. This provides a suitable basis for encoding collocations in an NLP-lexicon. Further, this paper explains the principles of an operational encoding strategy which is applied to a core section of the typology, namely to subtypes of verbal collocation. This strategy is adapted to a pre-defined lexicon model which has been developed in the PAROLE-project. The work is carried out within the framework of the STO-project the aim of which is to extend the Danish PAROLE-lexicon. The encoding of collocations, in addition to single-word lemmas, greatly increases the lexical and linguistic coverage and thereby also the usability of the lexicon as a whole. Decisions concerning the selection of the most frequent types of collocation to be encoded are made on empirical data i.e. corpus-based recognition. We present linguistic descriptions with focus on some characteristic syntactic features of collocations that are observed in a newspaper corpus. We then give a few prototypical examples provided with formalised descriptions in order to illustrate the restriction features. Finally, we discuss the perspectives of the work done so far.

28 Towards A Universal Tool For NLP Resource Acquisition This paper describes an approach to developing a universal tool for eliciting, from a non-expert human user, knowledge about any language L. The purpose of this elicitation is rapid development of NLP systems. The approach is described on the example of the syntax module of the Boas knowledge elicitation system for a quick ramp up of a standard transfer-based machine translation system from L into English. The preparation of knowledge for the MT system is carried out into two stages; the acquisition of descriptive knowledge about L and using the descriptive knowledge to derive operational knowledge for the system. Boas guides the acquisition process using data-driven, expectation-driven and goal-driven methodologies.

115 Towards More Comprehensive Evaluation in Anaphora Resolution The paper presents a package of evaluation tasks for anaphora resolution. We argue that these newly added tasks which have been carried out on Mitkov's (1998) knowledge-poor, robust approach, provide a better picture of the performance of an anaphora resolution system. The paper also outlines future work on the development of a 'consistent' evaluation environment for anaphora resolution.

192 Transcribing with Annotation Graphs Transcriber is a tool for manual annotation of large speech files. It was originally designed for the broadcast news transcription task. The annotation file format was derived from previous formats used for this task, and many related features were hard-coded. In this paper we present a generalization of the tool based on the annotation graph formalism, and on a more modular design. This will allow us to address new tasks, while retaining Transcriber’s simple, crisp user-interface which is critical for user acceptance.

12 TransSearch: A Free Translation Memory on the World Wide Web A translation memory is an archive of existing translations, structured in such a way as to promote translation re-use. Under this broad definition, an interactive bilingual concordancing tool like the RALI’s TransSearch system certainly qualifies as a translation memory. This paper describes the Web-based version of TransSearch, which, for the last three years, has given Internet users access to a large English-French translation database made up of Canadian parliamentary debates. Despite the fact that the RALI has done very little to publicize the availability of TransSearch on the Web, the system has been attracting a growing and impressive number of users. We present some basic data on who is using TransSearch and how, data which was collected from the system’s log file and by means of a questionnaire recently added to our Web site. We conclude with a call to the international community to help set up a network of bi-textual databases like TransSearch, which translators around the world could freely access over the Web.

330 Tuning Lexicons to New Operational Scenarios In this paper the role of the lexicon within typical application tasks based on NLP is analysed. A large scale semantic lexicon is studied within the framework of a NLP application. The coverage of the lexicon with respect the target domain and a (semi)automatic tuning approach have been evaluated. The impact of a corpus-driven inductive architecture aiming to compensate lacks in lexical information are thus measured and discussed.

86 Turkish Electronic Living Lexicon (TELL): A Lexical Database The purpose of the TELL project is to create a database of Turkish lexical items which reflects actual speaker knowledge, rather than the normative and phonologically incomplete dictionary representations on which most of the existing phonological literature on Turkish is based. The database, accessible over the internet, should greatly enhance phonological, morphological, and lexical research on the language. The current version of TELL consists of the following components: • Some 15,000 headwords from the 2d and 3d editions of the Oxford Turkish-English dictionary, orthographically represented. • Proper names, including 175 place names from a guide of Istanbul, and 5,000 place names from a telephone area code directory of Turkey. • Phonemic transcriptions of the pronunciations of the same headwords and place names embedded in various morphological contexts. (Eliciting suffixed forms along with stems exposes any morphophonemic alternations that the headwords in question are subject to.) • Etymological information, garnered from a variety of etymological sources. • Roots for a number of morphologically complex headwords. The paper describes the construction of the current structure of the TELL database, points out potential questions that could be addressed by putting the database into use, and specifies goals for the next phase of the project.

221 Typographical and Orthographical Spelling Error Correction This paper focuses on selection techniques for best correction of misspelt words at the lexical level. Spelling errors are introduced by either cognitive or typographical mistakes. A robust spelling correction algorithm is needed to cover both cognitive and typographical errors. For the most effective spelling correction system, various strategies are considered in this paper: ranking heuristics, correction algorithms, and correction priority strategies for the best selection. The strategies also take account of error types, syntactic information, word frequency statistics, and character distance. The findings show that it is very hard to generalise the spelling correction strategy for various types of data sets such as typographical, orthographical, and scanning errors.

254 TyPTex: Inductive Typological Text Classification by Multivariate Statistical Analysis for NLP Systems Tuning/Evaluation The increasing use of methods in natural language processing (NLP) which are based on huge corpora require that the lexical, morpho-syntactic and syntactic homogeneity of texts be mastered. We have developed a methodology and associate tools for text calibration or ''profiling'' within the ELRA benchmark called ''Contribution to the construction of contemporary french corpora'' based on multivariate analysis of linguistic features. We have integrated these tools within a modular architecture based on a generic model allowing us on the one hand flexible annotation of the corpus with the output of NLP and statistical tools and on the other hand retracing the results of these tools through the annotation layers back to the primary textual data. This allows us to justify our interpretations.

e annotation layers back to the primary textual data. This allows us to justify our interpretations.