Title

A Unicode-based Environment for Creation and use of Language Resources

Authors

Valentin Tablan (Dept. of Computer Science University of Sheffield Regent Court, 211 Portobello St Sheffield, S1 4DP, UK)

Cristian Ursu (Dept. of Computer Science University of Sheffield Regent Court, 211 Portobello St Sheffield, S1 4DP, UK)

Kalina Bontcheva (Dept. of Computer Science University of Sheffield Regent Court, 211 Portobello St Sheffield, S1 4DP, UK)

Hamish Cunningham (Dept. of Computer Science University of Sheffield Regent Court, 211 Portobello St Sheffield, S1 4DP, UK)

Diana Maynard (Dept. of Computer Science University of Sheffield Regent Court, 211 Portobello St Sheffield, S1 4DP, UK)

Oana Hamza (Dept. of Computer Science University of Sheffield Regent Court, 211 Portobello St Sheffield, S1 4DP, UK)

Tony McEnery (Department of Linguistics and Modern English Language Bowland College, Lancaster University Lancaster, LA1 4YT, UK)

Paul Baker (Department of Linguistics and Modern English Language Bowland College, Lancaster University Lancaster, LA1 4YT, UK)

Mark Leisher (Computing Research Laboratory New Mexico State University Box 30001/MSC 3CRL, Las Cruces, NM 88003-8001, USA)

Session

WO1: LRs Platforms & Standards

Abstract

GATE is a Unicode-aware architecture, development environment and framework for building systems that process human language. It is often thought that the character sets problem has been solved by the arrival of the Unicode standard. This standard is an important advance, but in practice the ability to process text in a large number of the World's languages is still limited. This paper describes work done in the context of the GATE project that makes use of Unicode and plugs some of the gaps for language processing R&D. First we look at storing and decoding of Unicode compliant linguistic resources. The new capabilities for processing textual data and taking advantage of the Unicode standard are detailed next. Finally, the solutions used to add Unicode displaying and editing capabilities for the graphical interface are described.

Keywords

Software architecture, Language engineering, GATE, Language resources, Unicode

Full Paper

215.pdf