The Lácio-Web: Corpora and Tools to advance Brazilian Portuguese Language Investigations and Computational Linguistic Tools


Sandra Aluisio (1), Gisele Montilha Pinheiro (1), Aline M. P. Manfrin (1), Leandro H. M. de Oliveira (1), Luiz C. Genoves Jr. (1), Stella E. O. Tagnin (2)

(1) NILC/ICMC-USP: Núcleo Interinstitucional de Lingüística Computacional (NILC), ICMC-University of São Paulo, CP 668, 13560-970 São Carlos, SP, Brazil; (2) FFLCH-USP: FFLCH – DLM, University of São Paulo, Av. Prof. Luciano Gualberto, 403, 05508-900 - São Paulo – SP, Brazil




In this paper we discuss the five requirements for building large publicly available corpora which geared the construction of the Lácio-Web corpora and their environments: 1) a comprehensive text typology; 2) text copyright clearance, compilation and annotation scheme; 3) a friendly and didactic interface; 4) the need to serve as support for several types of research; 5) the need to offer an array of associated tools. Also, we present the features that make Lácio-Web corpora interesting and novel as well as the limitations of this project, such as corpora size and balance, and the non-inclusion of spoken texts in the project’s reference corpus.


Written corpora, Brazilian Portuguese, POS annotated corpus, interface issues, text typology, corpora associated tools



