Resources

Which corpora do we make available to the community?

Named Entity Recognition in Questions: Towards a Golden Collection

A set of nearly 5,500 manually annotated questions to be used as training corpus in machine learning based NER systems (and 500 annotated questions for testing). The named entities in these questions were identified and classified according to the categories: Person, Location and Organization. We extended and particularized the guidelines of the shared task of the Conference on Computational Natural Language Learning (CoNLL) 2003 on NER to face the demands presented by questions.

These corpora are freely available for research purposes. You can download the training corpus here, and the testing corpus here.

Further details on building these question corpora can be found in [5]. We kindly ask you to cite this publication whenever you use the resource.

An English-Portuguese parallel corpus of questions

A corpus of nearly 6.000 questions manually translated into Portuguese, split into train and test sets. Two applications of this corpus are the SMT of questions from/to English to/from Portuguese and Question Classification in Portuguese.

These corpora are freely available for research purposes. You can download the training corpus here, and the testing corpus here.

Further details on building these question corpora can be found in [14]. We kindly ask you to cite this publication whenever you use the resource.

Rule-based Question Classifier

Our rule-based question classifier couples two strategies to obtain its results:

a direct (pattern) match is performed for specific questions: Who is Mozart? is directly mapped into Human:Description;
headwords are identified and mapped into the question classification (by using WordNet): in the question What is Australia's national flower? the headword flower is identified and mapped into the category Entity:Plant.

Here we make available the resources we used in the rule-based question classifier, namely:

Patterns: Question Patterns and Question Tree Patterns
The headrules used in this work are a heavily modified version of those given in Michael Collins' 1999 dissertation (236-238), specifically tailored to extract headwords from questions: Headword Extraction Rules and Headword Extraction Rules, specific to questions. The mapping between the question category and several clusters that aggregate similar synsets together: WordNetMap.xml

Test Collections for Question Answering systems

For the evaluation of Just.Ask, we built a gold corpus, named GoldWebQA, composed of: a set of snippets retrieved from the Web that contain possible answers to the questions, all the correct answers occurring in the snippets regardless of the format in which they are stated, and the category of the questions, according to Li and Roth's question type category. All the questions were plausible and valid when the snippets were retrieved (representing a snapshot of the Web).

GoldWebQA questions

GoldWebQA Google snippets

GoldWebQA Yahoo snippets

The file containing the GoldWebQA questions is formatted according to the following:

<file> ::= <questions>

<questions> ::= <question> <questions> | ""

<question> ::= "question " <question-number> ". Q:" <question-content> " - " <answers> " - " <question-category>

<answers> ::= "{" <answer-content> "}" <answers> | ""

The files containing the GoldWebQA snippets are formatted according to the following:

<file> ::= <lines>

<lines> ::= <line> <lines> | ""

<line> ::= <snippet> | "\n"

<snippet> ::= <snippet-number> " Question=" <question> "{" <snippet-description> "}"

<snippet-description> ::= "URL=" <url-page-snippet> " Title=" <title-page-snippet> " Content=" <content-snippet> " Rank=" <rank-snippet> "Structured=" <structured>

<structured> ::= "true" | "false"

Further details on building these question corpora can be found in [13]. We kindly ask you to cite this publication whenever you use the resource.

Question/Answer pairs

To evaluate The-MENTOR, our system that generates multiple-choice tests, we collected a set of 139 natural language questions with their respective correct answers. Some questions were taken from an on-line trivia, others manually created. All questions are factoids pertaining to 10 categories, including:

Coarse-grained Category	Fine-grained Category	#Q/A pairs
`ENTITY`	`ENTITY:CURRENCY`	5
	`ENTITY:SPORT`	3
	`ENTITY:LANGUAGE`	4
`HUMAN`	`HUMAN:INDIVIDUAL`	28
`LOCATION`	`LOCATION:CITY`	11
	`LOCATION:COUNTRY`	24
	`LOCATION:MOUNTAIN`	2
	`LOCATION:OTHER`	27
	`LOCATION:STATE`	12
`NUMERIC`	`NUMERIC:DATE`	23

Here you can find the set of 139 questions. Further details on this corpus can be found in [7]. We kindly ask you to cite this publication whenever you use the resource.

What software do we make available to the community?

JWNLSimple

JWNLSimple is an abstraction layer for JWNL, a WordNet framework with database support. It alsos includes an example SQL script describing a possible conversion of SQL lexicons to the JWNL accepted formalism, allowing procedural queries on WordNet like lexicons. JWNLSimple can be downloaded here in a jar file also containing a copy of JWNL with some changes described in the technical report. An example SQL lexicon adaptation script can be downloaded here. A demo is also avaliable, for the official English WordNet, here.

Further details on JWNLSimple can be found in [13]. We kindly ask you to cite this publication whenever you use the resource.

Question Classification

Question classification, a question answering subtask, aims to associate a category to each question, typically representing the semantic class of its answer. Question classification is of major importance to QA, since it can help:

to narrow down the number of possible answer candidates;
to choose an appropriate answer extraction strategy.

Moreover, a misclassified question can hinder the ability to reach a correct answer, because it can lead to wrong assumptions about the question.

The software used to classify questions can be downloaded here

Just.Ask - A multi-pronged approach to question answering

Please refer to Systems.

Resources

Contents

Which corpora do we make available to the community?

Named Entity Recognition in Questions: Towards a Golden Collection

An English-Portuguese parallel corpus of questions

Rule-based Question Classifier

Test Collections for Question Answering systems

Question/Answer pairs

What software do we make available to the community?

JWNLSimple

Question Classification

Just.Ask - A multi-pronged approach to question answering

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

l2f

community

Tools