LREC 2000 2nd International Conference on Language Resources & Evaluation  
Home Basic Info Archaeological Zappeion Registration Conference

Conference Papers

Program
Papers
Sessions
Abstracts
Authors
Keywords
Search

Papers by paper title: A B C D E F G H I J K L M N O P Q R S T U V W X Y Z

Papers by ID number: 1-50, 51-100, 101-150, 151-200, 201-250, 251-300, 301-350, 351-377.

List of all papers and abstracts.


Previous Paper   Next Paper  

Title A Web-based Text Corpora Development System
Authors Bohus Dan (Politehnica University of Timisoara, Vasile Parvan 2, 1900 Timisoara, Romania, bd1206@cs.utt.ro)
Boldea Marian (Politehnica University of Timisoara, Vasile Parvan 2, 1900 Timisoara, Romania, boldea@cs.utt.ro)
Keywords Diacritic Characters Restoration, HTML-to-Text Conversion, Morpho-Syntactic Annotation, Part-of-Speech Tagging, Text Corpora
Session Session WP7 - Corpus Projects
Abstract One of the most important starting points for any NLP endeavor is the construction of text corpora of appropriate size and quality. This paper presents a web-based text corpora development system which focuses both on the size and the quality of these corpora. The quantitative problem is solved by using the Internet as a practically limitless source of texts. To ensure a certain quality, we enrich the text with relevant information, to be fit for further use, by treating in an integrated manner the problems of morpho-syntactic annotation, lexical ambiguity resolution, and diacritic characters restoration. Although at this moment it is targeted at texts in Romanian, the system can be adapted to other languages, provided that some appropriate auxiliary resources are available.

 

rdana">