LREC 2000 2nd International Conference on Language Resources & Evaluation
 

Previous Paper   Next Paper

Title A Proposal for the Integration of NLP Tools using SGML-Tagged Documents
Authors Artola X. (Dept. of Computer Languages and Systems, University of the Basque Country, 649 P. K., E-20080 Donostia, Basque Country)
de Ilarraza A. Díaz (Faculty of Computer Science University of the Basque Country (UPV/EHU) 649 p.k., 20080 Donostia (The Basque Country))
Ezeiza N. (Faculty of Computer Science University of the Basque Country (UPV/EHU) 649 p.k., 20080 Donostia (The Basque Country))
Gojenola K. (Dept. of Computer Languages and Systems, University of the Basque Country, 649 P. K., E-20080 Donostia, Basque Country)
Maritxalar A. (Dept. of Computer Languages and Systems, University of the Basque Country, 649 P. K., E-20080 Donostia, Basque Country)
Soroa A. (Faculty of Computer Science University of the Basque Country (UPV/EHU) 649 p.k., 20080 Donostia (The Basque Country))
Keywords Feature Structures, Integration of NLP Tools, SGML, TEI-Conformant Feature Structures
Session Session WP6 - Tools in the Written Area
Full Paper 68.ps, 68.pdf
Abstract In this paper we present the strategy used for an integration, in a common framework, of the NLP tools developed for Basque during the last ten years. The documents used as input and output of the different tools contain TEI-conformant feature structures (FS) coded in SGML. These FSs describe the linguistic information that is exchanged among the integrated analysis tools. The tools integrated until now are a lexical database, a tokenizer, a wide-coverage morphosyntactic analyzer, and a general purpose tagger/lemmatizer. In the future we plan to integrate a shallow syntactic parser. Due to the complexity of the information to be exchanged among the different tools, FSs are used to represent it. Feature structures are coded following the TEI’s DTD for FSs, and Feature Structure Definition descriptions (FSD) have been thoroughly defined. The use of SGML for encoding the I/O streams flowing between programs forces us to formally describe the mark-up, and provides software to check that these mark-up hold invariantly in an annotated corpus. A library of Abstract Data Types representing the objects needed for the communication between the tools has been designed and implemented. It offers the necessary operations to get the information from an SGML document containing FSs, and to produce the corresponding output according to a well-defined FSD.