LREC 2002 Workshop

Linguistic Knowledge Acquisition and Representation:
Bootstrapping Annotated Language Data

List of accepted papers for oral presentations

Motivation and Aims

Provision of large-scale labelled language resources, such as tagged corpora or repositories of pre-classified text documents, is a crucial key to steady progress in an extremely wide spectrum of research, technological and business areas in the HLT sector. The continuously changing demands for language-specific and application-dependent annotated data (e.g. at the syntactic or at the semantic level), indispensable for design validation and efficient software prototyping, however, are daily confronted by the labelled-data bottleneck. Hand-crafted resources are often too costly and time-consuming to be produced at a sustainable pace, and, in some cases, they even exceed the limits of human conscious awareness and descriptive capability.

Possible ways to circumvent, or at least minimise, this problem come from the literature on automatic knowledge acquisition and, more generally, from the machine-learning community. Annotated data are bootstrapped by training a machine-learning classifier with a small sample of pre-annotated data and by using the induced classifier to annotate more data. Co-learning provides an alternative methodology, which essentially consists in iterative cooperation of two or more independent learning systems. Another promising route consists in automatically tracking down recurrent knowledge patterns in unstructured or implicit information sources (such as free texts or machine readable dictionaries) for this information to be moulded into explicit representation structures (e.g. subcategorisation frames, syntactic-semantic templates, ontology hierarchies etc.).

We believe that all these attempts at bootstrapping labelled data are not only of practical interest (for continuous updating, management and validation of dynamic resources), but also point to a bunch of germane theoretical issues. In particular, the workshop intends to focus on the issue of interaction between techniques for inducing structured knowledge from raw data and formal methods of linguistic knowledge representation. Gaining insights into this issue is an essential requirement for explaining the effective use of linguistic knowledge by cognitive agents. Although the cognitive and engineering views of the form and acquisition of linguistic knowledge need not be related, data from neuroscience and psychology are indeed relevant when evaluating different ways of representing information in artificial systems, and different models for linguistic knowledge acquisition.

We encourage in-depth analysis of underlying assumptions of the proposed bootstrapping methods and discussion of possible relevant connections with existing annotation and representation schemes. This investigation is likely to have significant repercussions on the way linguistic resources will be designed, developed and used for applications in the years to come. As the two aspects of knowledge representation and acquisition are profoundly interrelated, progress on both fronts can only be achieved, in our view of things, through a full appreciation of this deep interdependency.

Topics of Interest

Possible themes for participation are:

Development of 'data-driven' annotation/representation schemes;
Dynamic update, customisation and tuning of labelled resources through acquired data;
'Hybrid models' of linguistic knowledge extraction, whereby machine learning methods are integrated with formal structures of knowledge representation;
Incremental linguistic knowledge-bases;
Formal representation and structuring of information flow automatically acquired from texts;
Knowledge acquisition and linguistic resources lifecycle;
Linguistic knowledge acquisition and representation in cognitive tasks.

Important Dates

Deadline for workshop abstract submission	25th February 2002
Notification of acceptance	20th March 2002
Final version of paper for workshop proceedings	20th April 2002
Workshop	1st June 2002 (full day session)

Submissions

The organizers welcome contributions describing existing research related to the topics of the workshop. Each presentation will be 25 minutes long (20 minutes for presentation and 5 minutes for questions and discussion). Submissions should include: title; author(s); affiliation(s); and contact author's e-mail address, postal address, telephone and fax numbers.
Abstracts (maximum 500 words, plain-text format) must be sent to: simo@ilc.pi.cnr.it

The final version of the accepted papers should not be longer than 4,000 words or 10 A4 pages. Instructions for formatting and presentation of the final version will be sent to authors upon notification of acceptance.

Organising Committee

Alessandro Lenci	Università di Pisa (Italy)
Simonetta Montemagni	Istituto di Linguistica Computazionale, CNR (Italy)
Vito Pirrelli	Istituto di Linguistica Computazionale, CNR (Italy)

Provisional Programme Committee

Harald Baayen	(Max Planck Institute for Psycholinguistics, Nijmegen (The Netherlands)
Rens Bod	University of Amsterdam (Holland)
Michael R. Brent	Washington University (USA)
Nicoletta Calzolari	Istituto di Linguistica Computazionale, CNR (Italy)
Jean-Pierre Chanod	Xerox Research Centre Europe, Grenoble (France)
Walter Daelemans	University of Antwerp (Belgium)
Dekang Lin	University of Alberta, Edmonton (Canada)
Horacio Rodriguez	Universidad Politecnica de Catalunya
Fabrizio Sebastiani	Istituto per l'Elaborazione dell'Informazione, CNR (Italy)
Lucy Vanderwende	Microsoft Research, Redmond (USA)
François Yvon	Ecole Nationale Superieure des Telecommunications, Paris (France)
Menno van Zaanen	University of Amsterdam (The Netherlands)

Contact Person

Simonetta Montemagni
Istituto di Linguistica Computazionale (ILC) - CNR
Area della Ricerca CNR
Via Alfieri 1 (San Cataldo)
I-56010 PISA (Italy)
Email: simo@ilc.pi.cnr.it

Workshop Registration Fees

The registration fees for the workshop are:

If you are not attending LREC: 140 EURO
If you are attending LREC: 90 Euro

Accepted Papers

N.	Authors	Title
	Pablo Gamallo, Alexandre Agustini, and Gabriel P. Lopes	A Corpus-Based Approach To Learn Syntactic And Semantic Subcategorisation
	Anja Belz	Learning Grammars For Noun Phrase Extraction By Partition Search
	Necip Fazil Ayan, Bonnie J. Dorr	Generating A Parsing Lexicon From Lexical-Conceptual Structure
	Lavelli, Magnini, Sebastiani	Building Thematic Lexical Resources By Bootstrapping And Machine Learning
	Fermin Moscoso del Prado Martin, Magnus Sahlgren	An Integration Of Vector-Based Semantic Analysis And Simple Recurrent Networks For The Automatic Acquisition Of Lexical Representations From Unlabeled Corpora
	Aoife Cahill, Mairead McCarthy, Josef van Genabith, Andy Way	Automatic Annotation Of The Penn-Treebank With LFG F-Structure Information
	Pavel Kveton, Karel Oliva	Detection Of Errors In Part-Of-Speech Tagged Corpora By Bootstrapping Generalized Negative N-Grams
	Laura Alonso i Alemany, Irene Castell'on Masalles, Llu'is Padr'o Cirera	X-Tractor: A Tool For Extracting Discourse Markers
	Maite Melero	Automatic Acquisition Of Selectional Properties Of Adjectives In Ser/Estar Constructions
	Marisa Jiménez	Using Decision Trees To Predict Human Nouns In Spanish Parsed Text
	Rebecca Hwa, Philip Resnik, and Amy Weinberg	Breaking The Resource Bottleneck For Multilingual Parsing
	Adam Lopez, Mike Nossal, Rebecca Hwa, Philip Resnik	Word-Level Alignment For Multilingual Resource Acquisition
	Rayid Ghani, Rosie Jones	A Comparison Of Efficacy And Assumptions Of Bootstrapping Algorithms For Training Information Extraction Systems
	Bernd Bohnet, Stefan Klatt, and Leo Wanner	A Bootstrapping Approach To Automatic Annotation Of Functional Information To Adjectives With An Application To German
	Kiril Simov, Milen Kouylekov, Alexander Simov	Incremental Specialization of an HPSG-Based Annotation Scheme

Linguistic Knowledge Acquisition and Representation: Bootstrapping Annotated Language Data