Summary of the paper

Title Classifying Out-of-vocabulary Terms in a Domain-Specific Social Media Corpus
Authors SoHyun Park, Afsaneh Fazly, Annie Lee, Brandon Seibel, Wenjie Zi and Paul Cook
Abstract In this paper we consider the problem of out-of-vocabulary term classification in web forum text from the automotive domain. We develop a set of nine domain- and application-specific categories for out-of-vocabulary terms. We then propose a supervised approach to classify out-of-vocabulary terms according to these categories, drawing on features based on word embeddings, and linguistic knowledge of common properties of out-of-vocabulary terms. We show that the features based on word embeddings are particularly informative for this task. The categories that we predict could serve as a preliminary, automatically-generated source of lexical knowledge about out-of-vocabulary terms. Furthermore, we show that this approach can be adapted to give a semi-automated method for identifying out-of-vocabulary terms of a particular category, automotive named entities, that is of particular interest to us.
Topics Social Media Processing, Lexicon, Lexical Database, Acquisition
Full paper Classifying Out-of-vocabulary Terms in a Domain-Specific Social Media Corpus
Bibtex @InProceedings{PARK16.342,
  author = {SoHyun Park and Afsaneh Fazly and Annie Lee and Brandon Seibel and Wenjie Zi and Paul Cook},
  title = {Classifying Out-of-vocabulary Terms in a Domain-Specific Social Media Corpus},
  booktitle = {Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC 2016)},
  year = {2016},
  month = {may},
  date = {23-28},
  location = {Portoro┼ż, Slovenia},
  editor = {Nicoletta Calzolari (Conference Chair) and Khalid Choukri and Thierry Declerck and Sara Goggi and Marko Grobelnik and Bente Maegaard and Joseph Mariani and Helene Mazo and Asuncion Moreno and Jan Odijk and Stelios Piperidis},
  publisher = {European Language Resources Association (ELRA)},
  address = {Paris, France},
  isbn = {978-2-9517408-9-1},
  language = {english}
