Resources

From QA@L²F
Revision as of 00:46, 19 March 2012 by Pfialho (talk | contribs) (JWNLSimple: abstraction and adaptation layer for WordNet)
Jump to: navigation, search

Which resources do we make available to the community?

Named Entity Recognition in Questions: Towards a Golden Collection

  • A set of nearly 5,500 manually annotated questions to be used as training corpus in machine learning based NER systems (and 500 annotated questions for testing). The named entities in these questions were identified and classified according to the categories: Person, Location and Organization. We extended and particularized the guidelines of the shared task of the Conference on Computational Natural Language Learning (CoNLL) 2003 on NER to face the demands presented by questions.

These corpora are freely available for research purposes. You can download the training corpus here, and the testing corpus here.

Further details on building these question corpora can be found in [5]. We kindly ask you to cite this publication whenever you use the resource.


An English-Portuguese parallel corpus of questions

A corpus of nearly 6.000 questions manually translated into Portuguese, split into train and test sets. Two applications of this corpus are the SMT of questions from/to English to/from Portuguese and Question Classification in Portuguese.

The corpus and translation guidelines will be available soon.


Rule-based Question Classifier

Our rule-based question classifier couples two strategies to obtain its results:

  1. a direct (pattern) match is performed for specific questions: Who is Mozart? is directly mapped into Human:Description;
  2. headwords are identified and mapped into the question classification (by using WordNet): in the question What is Australia's national flower? the headword flower is identified and mapped into the category Entity:Plant.

Here we make available the resources we used in the rule-based question classifier, namely:

  1. Patterns: Question Patterns and Question Tree Patterns
  2. The headrules used in this work are a heavily modified version of those given in Michael Collins' 1999 dissertation (236-238), specifically tailored to extract headwords from questions: Headword Extraction Rules and Headword Extraction Rules, specific to questions. The mapping between the question category and several clusters that aggregate similar synsets together: WordNetMap.xml


Test Collections for Question Answering systems

For the evaluation of Just.Ask, we built a gold corpus with 200 questions and possible correct answers: the gold-QA. We also built a corpus with snippets, representing a snapshot of the Web: the web corpus.

Soon we will make both resources available for the research community.


Question/Answer pairs

To evaluate The-MENTOR, our system that generates multiple-choice tests, we collected a set of 139 natural language questions with their respective correct answers. Some questions were taken from an on-line trivia, others manually created. All questions are factoids pertaining to 10 categories, including:


Coarse-grained Category Fine-grained Category #Q/A pairs
ENTITY ENTITY:CURRENCY 5
ENTITY:SPORT 3
ENTITY:LANGUAGE 4
HUMAN HUMAN:INDIVIDUAL 28
LOCATION LOCATION:CITY 11
LOCATION:COUNTRY 24
LOCATION:MOUNTAIN 2
LOCATION:OTHER 27
LOCATION:STATE 12
NUMERIC NUMERIC:DATE 23

Here you can find the set of 139 questions. Further details on this corpus can be found in [7]. We kindly ask you to cite this publication whenever you use the resource.


JWNLSimple: abstraction and adaptation layer for WordNet

JWNLSimple is an abstraction layer for JWNL, a WordNet framework with database support. It alsos includes an example SQL script describing a possible conversion of SQL lexicons to the JWNL accepted formalism, allowing procedural queries on WordNet like lexicons. JWNLSimple can be downloaded here and the example script here. A demo is also avaliable, for the official English WordNet, here.