Answering list questions using co occurrence and clustering
Download
1 / 19

- PowerPoint PPT Presentation


  • 336 Views
  • Updated On :

Answering List Questions using Co-occurrence and Clustering. Majid Razmara and Leila Kosseim Concordia University [email protected] Introduction. Question Answering TREC QA track Question Series Corpora. Target: American Girl dolls

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about '' - victoria


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
Answering list questions using co occurrence and clustering l.jpg

AnsweringList Questions using Co-occurrence and Clustering

Majid Razmara and Leila Kosseim

Concordia University

[email protected]


Introduction l.jpg
Introduction

  • Question Answering

  • TREC QA track

    • Question Series

    • Corpora

Target: American Girl dolls

  • FACTOID: In what year were American Girl dolls first introduced?

  • LIST: Name the historical dolls.

  • LIST: Which American Girl dolls have had TV movies made about them?

  • FACTOID: How much does an American Girl doll cost?

  • FACTOID: How many American Girl dolls have been sold?

  • FACTOID: What is the name of the American Girl store in New York?

  • FACTOID: What corporation owns the American Girl company?

  • OTHER: Other


Hypothesis l.jpg
Hypothesis

  • Answer Instances

    • Have the same semantic entity class

    • Co-occur within sentences, or

    • Occur in different sentences sharing similar context

      • Based on Distributional Hypothesis: “Words occurring in the same contexts tend to have similar meanings” [Harris, 1954].


Slide4 l.jpg

Target 232: "Dulles Airport“ Question 232.6: "Which airlines use Dulles”

Ltw_Eng_20050712.0032 (AQUAINT-2)

United, which operates a hub at Dulles, has six luggage screening machines in its basement and several upstairs in the ticket counter area.

Delta, Northwest, American, British Airways and KLM share four screening machines in the basement.

Ltw_Eng_20060102.0106 (AQUAINT-2)

Independence said its last flight Thursday will leave White Plains, N.Y., bound for DullesAirport.

Flyi suffered from rising jet fuel costs and the aggressive response of competitors, led by United and US Airways.

New York Times (Web)

Continental Airlines sued United Airlines and the committee that oversees operations at Washington Dulles International Airport yesterday, contending that recently installed baggage-sizing templates inhibited competition.

Wikipedia (Web)

At its peak of 600 flights daily, Independence, combined with service from JetBlue and AirTran, briefly made Dulles the largest low-cost hub in the United States.

4


Our approach l.jpg
Our Approach

  • Create an initial candidate list

    • Answer Type Recognition

    • Document Retrieval

    • Candidate Answer Extraction

    • It may also be imported from an external source (e.g. Factoid QA)

  • Extract co-occurrence information

  • Cluster candidates based on their co-occurrence


Answer type recognition l.jpg
Answer Type Recognition

  • 9 Types:

    • Person, Country, Organization, Job, Movie, Nationality, City, State, and Other

  • Lexical Patterns

    • ^ (Name | List | What | Which) (persons | people | men | women | players | contestants | artists | opponents | students) PERSON

    • ^ (Name | List | What | Which) (countries | nations) COUNTRY

  • Syntagmatic Patterns for Other types

    • ^ (WDT | WP | VB | NN) (DT | JJ)* (NNS | NNP | NN | JJ | )* (NNS | NNP | NN | NNPS)(VBN | VBD | VBZ | WP | $)

    • ^ (WDT | WP | VB | NN) (VBD | VBP) (DT | JJ | JJR | PRP$ | IN)* (NNS | NNP | NN | )* (NNS | NNP | NN)

  • Type Resolution


Type resolution l.jpg
Type Resolution

  • Resolves the answer subtype to one of the main types

    • List previous conductors of the Boston Pops.

      • Type: OTHER Sub Type: Conductor  PERSON

  • WordNet's Hypernym Hierarchy


Document retrieval l.jpg

Source

Domain

Document Retrieval

  • Document Collection

    • Source Document Collection

      • Few documents

      • To extract candidates

    • Domain Document Collection

      • Many documents

      • To extract co-occurrence information

  • Query Generation

    • Google Query on Web

    • Lucene Query on Corpora


Candidate answer extraction l.jpg

numHits(“SubType Term” OR “Term SubType”)

numHits(“Term”)

Candidate Answer Extraction

  • Term Extraction

    • Extract all terms that conform to the expected answer type

    • Person, Organization, Job

      • Intersection of several NE taggers: LingPipe, Stanford tagger & Gate NE

      • To get a better precision

    • Country, State, City, Nationality

      • Gazetteer

      • To get a better precision

    • Movie, Other

      • Capitalized and quoted terms

      • Verification of Movie

      • Verification of Other

numHits(GoogleQuery intitle:Term site:www.imdb.com)


Co occurrence information extraction l.jpg
Co-occurrence Information Extraction

  • Domain Collection Documents are split into sentences

  • Each sentence is checked as to whether it contains candidate answers


Hierarchical agglomerative clustering l.jpg
Hierarchical Agglomerative Clustering

  • Steps:

    • Put each candidate term ti in a separate cluster Ci

    • Compute the similarity between each pair of clusters

      • Average Linkage

    • Merge two clusters with highest inter-cluster similarity

    • Update all relations between this new cluster and other clusters

    • Go to step 3 until

      • There are only N clusters, or

      • The similarity is less than a threshold


The similarity measure l.jpg
The Similarity Measure

  • Similarity between each pair of candidates

  • Based on co-occurrence within sentences

  • Using chi-square (2)

  • Shortcoming


Pinpointing the right cluster l.jpg
Pinpointing the Right Cluster

  • Question and target keywords are used as “spies”

  • Spies are:

    • Inserted into the list of candidate answers

    • Are treated as candidate answers, hence

      • their similarity to one another and similarity to candidate answers are computed

      • they are clustered along with candidate answers

  • The cluster with the most number of spies is returned

    • The spies are removed

  • Other approaches


Slide14 l.jpg

Cluster-2

Cluster-9

spain, bangladesh, japan, germany, haiti, nepal, china, sweden, iran, mexico, vietnam, belgium, lebanon, iraq, russia, turkey

Target269:Pakistan earthquakes of October 2005

Question 269.2:What countries were affected by this earthquake?

Cluster-31

oman

pakistan, 2005, afghanistan, octob, u.s, india, affect, earthquak

pakistan, 2005, afghanistan, octob, u.s, india, affect, earthquak

pakistan, 2005, afghanistan, octob, u.s, india, affect, earthquak

Recall = 2/3

Precision = 2/3

F-score = 2/3

14



Evaluation of clustering l.jpg
Evaluation of Clustering

  • Baseline

    • List of candidate answers prior to clustering

  • Our Approach

    • List of candidate answers filtered by the clustering

  • Theoretical Maximum

    • The best possible output of clustering based on the initial list


Evaluation of each question type l.jpg
Evaluation of each Question Type


Future work l.jpg
Future Work

  • Developing a module that verifies whether each candidate is a member of the answer type

    • In case of Movie and Other types

  • Using co-occurrence at the paragraph level rather than the sentence level

    • Anaphora Resolution can be used

    • Another method for similarity measure

      • 2 does not work well with sparse data

      • for example, using Yates correction for continuity (Yates’ 2)

  • Using different clustering approaches

  • Using different similarity measures

    • Mutual Information



ad