Difference between revisions of "Resources"

From QA@L²F
Jump to: navigation, search
(Named Entity Recognition in Questions: Towards a Golden Collection)
Line 3: Line 3:
 
==== '''''Named Entity Recognition in Questions: Towards a Golden Collection''''' ====  
 
==== '''''Named Entity Recognition in Questions: Towards a Golden Collection''''' ====  
  
* A set of nearly 5,500 manually annotated questions to be used as training corpus in machine learning based NER systems. The named entities in these questions were identified and classified according to the categories: <code>Person</code>, <code>Location</code> and <code>Organization</code>. We extended and particularized the guidelines of the shared task of the Conference on Computational Natural Language Learning (CoNLL) 2003 on NER to face the demands presented by questions.
+
* A set of nearly 5,500 manually annotated questions to be used as training corpus in machine learning based NER systems (and 500 annotated questions for testing). The named entities in these questions were identified and classified according to the categories: <code>Person</code>, <code>Location</code> and <code>Organization</code>. We extended and particularized the guidelines of the shared task of the Conference on Computational Natural Language Learning (CoNLL) 2003 on NER to face the demands presented by questions.
  
 
These corpora are freely available for research purposes. You can download the training corpus [[Media:Train_5500questions_NEannotated.txt|here]], and the testing corpus [[Media:Test_500questions_NEannotated.txt|here]].
 
These corpora are freely available for research purposes. You can download the training corpus [[Media:Train_5500questions_NEannotated.txt|here]], and the testing corpus [[Media:Test_500questions_NEannotated.txt|here]].
  
 
Further details on building this question corpora can be found in [[Publications#in 2010|[5]]]. We kindly ask you to cite this publication whenever you use the resource.
 
Further details on building this question corpora can be found in [[Publications#in 2010|[5]]]. We kindly ask you to cite this publication whenever you use the resource.
 
  
 
====  '''''Rule-based Question Classifier''''' ====  
 
====  '''''Rule-based Question Classifier''''' ====  

Revision as of 12:14, 25 November 2010

Which resources do we make available to the community?

Named Entity Recognition in Questions: Towards a Golden Collection

  • A set of nearly 5,500 manually annotated questions to be used as training corpus in machine learning based NER systems (and 500 annotated questions for testing). The named entities in these questions were identified and classified according to the categories: Person, Location and Organization. We extended and particularized the guidelines of the shared task of the Conference on Computational Natural Language Learning (CoNLL) 2003 on NER to face the demands presented by questions.

These corpora are freely available for research purposes. You can download the training corpus here, and the testing corpus here.

Further details on building this question corpora can be found in [5]. We kindly ask you to cite this publication whenever you use the resource.

Rule-based Question Classifier

Our rule-based question classifier couples two strategies to obtain its results:

  1. a direct (pattern) match is performed for specific questions: Who is Mozart? is directly mapped into Human:Description;
  2. headwords are identified and mapped into the question classification (by using WordNet): in the question What is Australia's national flower? the headword flower is identified and mapped into the category Entity:Plant.

Here we make available the resources we used in the rule-based question classifier, namely:

  1. Patterns: Question Patterns and Question Tree Patterns
  2. The headrules used in this work are a heavily modified version of those given in Michael Collins' 1999 dissertation (236-238), specifically tailored to extract headwords from questions: Headword Extraction Rules and Headword Extraction Rules, specific to questions. The mapping between the question category and several clusters that aggregate similar synsets together: WordNetMap.xml


Test Collections for Question Answering systems

For the evaluation of Just.Ask, we built a gold corpus with 200 questions and possible correct answers: the gold-QA. We also built a corpus with snippets, representing a snapshot of the Web: the web corpus.

Soon we will make both resources available for the research community.