Organizers: Yorick Wilks, Hamish Cunningham, Wim Peters, Remi Zajac
The following papers will be presented in order of enumeration. After each 15 minute presentation there will be 5 minutes for discussion.
Distributed Thesaurus Storage and Access in a Cultural Domain
S. Boutsis, B. Georgantopoulos, S. Piperidis
Institute for Language and Speech Processing, Athens
A New Model for Language Resource Access and Distribution
W. Peters, H. Cunningham, Y. Wilks, C. McCauley
University of Sheffield
Reuse and Integration of NLP Components in the Calypso Architecture
New Mexico State University
Corpus-based Research using the Internet
H. Brugman, A. Russel, P. Wittenburg
Max Planck Institute for Psycholinguistics, Nijmegen
The CUE Corpus Access Tool
University of Birmingham
Linguistic Research Utilizing the EDR Electronic Dictionary as a
The following posters will be on display during the workshop, and presentations are planned during the breaks:
TRACTOR: TELRI Research Archive of Computational Tools and Resources
University of Birmingham
Web-Surfing the Lexicon
D. Cabrero, M. Vilares, L. Docampo, S. Sotelo
Ramon Pineiro Research Centre/Universities of Coruna and Santiago
Exploring Distributed MT
O. Streiter, A. Schmidt-Wigger, U. Reuther, C. Pease
A Proposal for an On-line Lexical Database
The final part of the workshop will consist of a panel discussion on:
The panel participants are:
Khalid Choukri, Eduard Hovy, Judith Klavans, Yorick Wilks, and Antonio Zampolli.
In general the reuse of of NLP data resources (such as lexicons or corpora) has exceeded that of algorithmic resources (such as lemmatisers or parsers). However, there are still two barriers to data resource reuse:
The consequence of 2) is that there is no way to "try before you buy": no way to examine a data resource for its suitability for your needs before licencing it. Correspondingly there is no way for a resource provider to expose limitted access to their products for advertising purposes, or gain revenue through piecemeal supply of sections of a resource.
This workshop will discuss ways to overcome these barriers. The proposers will discuss a new method for distributing and accessing language resources involving the development of a common programmatic model of the various resources types, implemented in CORBA IDL and/or Java, along with a distributed server for non-local access. This model is being designed as part of the GATE project (General Architecture for Text Engineering) and goes under the provisional title of an Active CREOLE Server. (CREOLE: Collection of REusable Objects for Language Engineering. Currently CREOLE supports only algortihmic objects, but will be extended to data objects.)
A common model of language data resources would be a set of inheritance hierarchies making up a forest or set of graphs. At the top of the hierarchies would be very general abstractions from resources (e.g. lexicons are about words); at the leaves would be data items that were specific to individual resources. Programmatic access would be available at all levels, allowing the developer to select an appropriate level of commonality for each application.
Note that although an exciting element of the work could be to provide algorithms to dynamically merge common resources (e.g. connect WordNet to Celex), what we're suggesting initially is not to develop anything substantively new, but simply to improve access to existing resources. This is NOT a new standards initiative, but a way to build on previous initiatives.
Of course, the production of a common model that fully expressed all the subtleties of all resources would be a large undertaking, but we believe that it can be done incrementally, with useful results at each stage. Early versions will stop decomposing the object structure of resources at a fairly high level, leaving the developer to handle the data structures native to the resources at the leaves of the forest. There should still be a substantial benefit in uniform access to higher level strucures.
Draft Program Committee
Maria Teresa Pazienza