Highlighting latent structure in documents


H. Folch (1), B. Habert (1), M. Jardino (1), N. Pernelle (2), M.C. Rousset (2), A. Termier (2)

(1) LIMSI (CNRS) - BP 133, Orsay Cedex, 91403, France; (2) LRI (CNRS-Univ. Paris XI), INRIA-Futurs (gemo team)
Univ. Paris XI, Orsay Cedex, 91405, France




Extensible Markup Language (XML) is playing an increasingly important role in the exchange of a wide variety of data on the Web and elsewhere. It is a simple, very flexible text format, used to annotate data by means of markup. XML documents can be checked for syntactic well-formedness and semantic coherence through DTD and schema validation which makes their processing easier. In particular, data with nested structure can be easily represented with embedded tags. This structured representation should be used in information retrieval models which take structure into account. As such, it is meta-data and therefore a contribution to the Semantic Web. However, nowadays, there exists huge quantities of raw texts and the issue is how to find an easy way to provide these texts with sensible XML structure. Here we present an automatic method to extract tree structure from raw texts. This work has been supported by the Paris XI University (BQR2002 project, Paris-XI University).


Clustering, Latent Structure, Tree Extraction, XML, Metadata, Information Retrieval

Language(s) French in the paper but the method is independent of the language
Full Paper