Migrating Language Resources from SGML to XML: the Text Encoding Initiative Recommendations


Syd Bauman (1), Alejandro Bia (2), Lou Burnard (3), Tomaž Erjavec (4), Christine Ruotolo (5), Susan Schreibman (6)

(1) Women Writers Project, Brown University, Providence, RI USA; (2) Departamento de Lenguajes y Sistemas Informáticos, Universidad de Alicante, Alicante, España; (3) Oxford University Computing Services, Oxford University, Oxford, England; (4) Department of Knowledge Technologies, Jozef Stefan Institute; Ljubljana, Slovenia; (5) University of Virginia Library University of Virginia, Charlottesville, VA USA; (6) Maryland Institute for Technology in the Humanities, University of Maryland, College Park, MD USA




The Text Encoding Initiative (TEI), established in 1987, has been the largest effort in the area of standardisation of computer encoding of language resources. TEI chose SGML (Standard Generalized Markup Language) as its underlying standard, and in the years before the inception of XML, a number of projects encoded their data according to some SGML DTD, TEI compliant, or otherwise. These projects could now benefit from migrating their data to XML. Apart from validation, the most compelling reason for migration is the scarcity of SGML-aware software and the abundance of XML-based tools and related recommendations. However, despite the fact that XML is a subset of SGML, migration is not a trivial process, especially in the case of large holdings of legacy language resources. This is why in 2002 the TEI Consortium established a Task Force on SGML to XML migration. The TF has now produced a number of reports that simplify and make explicit the conversion of SGML TEI (version P3) to XML TEI (version P4) documents. The reports are also relevant for a general audience of SGML users that are considering migrating their language resources to XML. This paper presents the recommendations made by the TF, concentrating on strategic considerations, the practical guide, and one case study, the conversion of the British National Corpus.


Text Encoding, Markup, TEI, SGML, XML

Language(s) The article is written in English.The target languages for application of the techniques presented in the article: all text encodable languages.
