Learning surface text patterns for a question answering system
1 / 54

- PowerPoint PPT Presentation

  • Updated On :

Learning Surface Text Patterns for a Question Answering System Deepak Ravichandran Eduard Hovy Information Sciences Institute University of Southern California From Proceedings of the ACL Conference, 2002 Goal Explore power of surface text patterns for open-domain QA systems Why This Paper

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
Download Presentation

PowerPoint Slideshow about '' - jaden

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
Learning surface text patterns for a question answering system l.jpg

Learning Surface Text Patterns for a Question Answering System

Deepak Ravichandran

Eduard Hovy

Information Sciences Institute

University of Southern California

Slide3 l.jpg
Goal System

  • Explore power of surface text patterns for open-domain QA systems

Why this paper l.jpg
Why This Paper System

  • Fall 2001 NLP project - QA system

Winning team l.jpg
Winning Team System

  • Matt Myers & Henry Longmore

    • "If we were asked to design another question answering system, we would keep the same basic system as a foundation. We would then use more patterns and variations of patterns in the NE recognizer. We would use Machine Learning techniques, particularly for learning patterns for the NE recognizer."

Meanwhile back at the batcave l.jpg
Meanwhile, back at the batcave... System

  • Automatic learning of surface text patterns for open-domain question answering

Recent open domain systems l.jpg
Recent Open Domain Systems System

  • External knowledge, tools

    • Named Entity taggers

    • WordNet

    • parsers

    • hand-tagged corpora

    • ontology lists

Recent o d systems cont l.jpg
Recent O-D Systems (cont.) System

  • Recent TREC-10 evaluation

    • winning system used just 1 resource

    • extensive list of surface patterns

    • surprised many

Basic idea l.jpg
Basic Idea System

  • Investigate potential of surface patterns

    • Learn patterns

    • Measure accuracy

Characteristic phrases l.jpg
Characteristic Phrases System

  • "When was <person> born”

    • Typical answers

      • "Mozart was born in 1756.”

      • "Gandhi (1869-1948)...”

    • Suggests phrases like

      • "<NAME> was born in <BIRTHDATE>”

      • "<NAME> ( <BIRTHDATE>-”

    • as Regular Expressions can help locate correct answer

Auto learn patterns from web l.jpg
Auto-learn Patterns from Web System

  • Tagged corpus using AltaVista

  • Hand-crafted examples of each question type

  • Bootstrapping to build large tagged corpus as in Information Extraction (Riloff, 96)

  • Abundance of data on web - reliable statistical estimates

The system l.jpg
The System System

  • Assume sentence is a simple sequence of words

  • Search for repeated word orderings

  • Evidence for useful answer phrases

System cont l.jpg
System (cont.) System

  • Suffix trees to extract substrings of optimal length

  • Suffix trees from Computational Biology (Gusfield, 97)

  • Used to detect DNA sequences

  • Linear time on size of corpus

  • Don't restrict length of substrings

Pattern learning algorithm l.jpg
Pattern Learning Algorithm System

  • Select example for question type

    • BIRTHYEAR questions select "Mozart 1756”

      • "Mozart" is question term

      • "1756" is answer term

  • Submit Q & A terms to AltaVista

  • Require both terms to be present

Pattern learning cont l.jpg
Pattern Learning (cont.) System

  • Download top 1000 documents returned

  • Apply sentence breaker to documents

  • Keep only those sentences with both terms present

Pattern learning cont16 l.jpg
Pattern Learning (cont.) System

  • Terms can be present in various forms

    • e.g. Mozart as:

      • Wolfgang Amadeus Mozart

      • Mozart, Wolfgang Amadeus

      • Amadeus Mozart

      • Mozart

Pattern learning cont17 l.jpg
Pattern Learning (cont.) System

  • Specify ways in which Q term and A term can be specified in text

  • Easy to do for BIRTHDATE

  • Not so for Q types like DEFINITION

    • Many acceptable answers, all answers need to be used to ensure high confidence in precision

Pattern learning cont18 l.jpg
Pattern Learning (cont.) System

  • Process (tokenize, smooth whitespace, remove tags, etc.)

    • simplify input for egrep (or other regular expression tool)

  • Pass sentence through suffix tree constructor

    • finds substrings (and counts) of all lengths

Pattern learning cont19 l.jpg
Pattern Learning (cont.) System

  • Example:

    • “The great composer Mozart (1756-1791) achieved fame at a young age”

    • “Mozart (1756-1791) was a genius”

    • “The whole world would always be indebted to the great music of Mozart (1756-1791)”

  • Longest matching substring for all 3 sentences is "Mozart (1756-1791)”

  • Suffix tree would extract "Mozart (1756-1791)" as an output, with score of 3

Pattern learning cont20 l.jpg
Pattern Learning (cont.) System

  • Filter phrases in suffix tree

  • Keep phrases containing Q & A terms

  • Replace question term with <NAME>

  • Replace answer term with <ANSWER>

Pattern learning cont21 l.jpg
Pattern Learning (cont.) System

  • Repeat with different examples of same question type

    • “Gandhi 1869”, “Newton 1642”, etc.

  • Some patterns learned for BIRTHDATE

    • a. born in <ANSWER>, <NAME>

    • b. <NAME> was born on <ANSWER> ,

    • c. <NAME> ( <ANSWER> -

    • d. <NAME> ( <ANSWER> - )

Pattern learning last one l.jpg
Pattern Learning (last one!) System

  • Strings partly overlapping (c & d) saved separately

    • Separate counts of occurrence frequencies

    • Can distinguish (in this case) between pattern for person still living (d) and more general pattern (c)

Calculate precision l.jpg
Calculate Precision System

  • Submit query to AltaVista using only Q term ("Mozart")

  • Download top 1000 returned documents

  • Segment into sentences as in pattern learning algorithm

  • Keep sentences containing Q term

Calculate precision cont l.jpg
Calculate Precision (cont.) System

  • For each pattern learned, check presence of pattern in sentence

    • pattern with <ANSWER> tag matched by any word

    • pattern with <ANSWER> tag matched by correct A term

      • Mozart was born in <ANY_WORD>

      • Mozart was born in 1756

Calculate precision cont25 l.jpg
Calculate Precision (cont.) System

  • Calculate precision of each pattern

  • P = Ca/Co

    • Ca = total # of patterns w/answer term present

    • Co = total # of patterns w/answer term replaced by any word

  • Keep only patterns matching sufficient # of examples (e.g. >5)

Calculate precision cont26 l.jpg
Calculate Precision (cont.) System

  • Obtain table of Regular Expression patterns

  • 1 table per question type

    • Precision of pattern

    • precision is probability pattern containing answer

    • principle of maximum likelihood estimation

Calculate precision cont27 l.jpg
Calculate Precision (cont.) System

  • BIRTHDATE table:

    • 1.0 <NAME> ( <ANSWER> - )

    • 0.85 <NAME> was born on <ANSWER>,

    • 0.6 <NAME> was born in <ANSWER>

    • 0.59 <NAME> was born <ANSWER>

    • 0.53 <ANSWER> <NAME> was born

    • 0.50 - <NAME> ( <ANSWER>

    • 0.36 <NAME> ( <ANSWER> -

Calculate precision cont28 l.jpg
Calculate Precision (cont.) System

  • Good range of patterns obtained with as few as 10 examples

  • Rather long list difficult to come up with manually

  • Largest number of examples the system required to get a good range of patterns?

Calculate precision cont29 l.jpg
Calculate Precision (cont.) System

  • Precision of patterns learned from one QA-pair calculated for other examples of same question type

  • Helps eliminate dubious patterns

    • Contents of two or more sites are the same

    • Same document appears in search engine output for learning & precision stages

Finding answers l.jpg
Finding Answers System

  • To new questions!

  • Use existing QA system (Hovy et al., 2002b;2001)

  • Determine type of new question

  • Identify Question term

Finding answers cont l.jpg
Finding Answers (cont.) System

  • Create query from Q term & do IR

    • use answer document corpus such as TREC-10 or web search

  • Segment returned documents into sentences & process as before

  • Replace Q term by Q tag

    • e.g. <NAME> in case of BIRTHYEAR type

Finding answers cont32 l.jpg
Finding Answers (cont.) System

  • Using pattern table developed for Q type, search for presence of each pattern

  • Select words matching <ANSWER> as potential answer

  • Sort answers by pattern's precision scores

  • Discard duplicate answers (string compare)

  • Return top 5

Experiments l.jpg
Experiments System

  • 6 different Q types

    • from Webclopedia QA Typology (Hovy et al., 2002a)


      • LOCATION

      • INVENTOR



      • WHY-FAMOUS

Experiments cont l.jpg
Experiments (cont.) System

  • (BIRTHYEAR - previously shown)


    • 1.0 <ANSWER> invents <NAME>

    • 1.0 the <NAME> was invented by <ANSWER>

    • 1.0 <ANSWER> invented the <NAME> in

  • all have precision of 1.0

Experiments cont35 l.jpg
Experiments (cont.) System


    • 1.0 when <ANSWER> discovered <NAME>

    • 1.0 <ANSWER>'s discovery of <NAME>

    • 0.9 <NAME> was discovered by <ANSWER> in


    • 1.0 <NAME> and related <ANSWER>

    • 1.0 form of <ANSWER>, <NAME>

    • 0.94 as <NAME>, <ANSWER> and

  • Experiments cont36 l.jpg
    Experiments (cont.) System


      • 1.0 <ANSWER> <NAME> called

      • 1.0 laureate <ANSWER> <NAME>

      • 0.71 <NAME> is the <ANSWER> of


    • 1.0 <ANSWER>'s <NAME>

    • 1.0 regional : <ANSWER> : <NAME>

    • 0.92 near <NAME> in <ANSWER>

  • Experiments cont37 l.jpg
    Experiments (cont.) System

    • For each Q type, extract questions from TREC-10 set

    • Run through testing phase (precision)

    • Two sets of experiments

    Experiments cont38 l.jpg
    Experiments (cont.) System

    • Set one

      • TREC corpus is input

      • IR done by IR component of their QA system (Lin, 2002)

    • Set two

      • Web is input

      • IR performed by AltaVista

    Results l.jpg
    Results System

    • Measured by Mean Reciprocal Rank (?)

    • TREC

      • Question type # of Q's MRR

      • BIRTHYEAR 8 0.48

      • INVENTOR 6 0.17

      • DISCOVERER 4 0.13

      • DEFINITION 102 0.34

      • WHY-FAMOUS 3 0.33

      • LOCATION 16 0.75

    Results cont l.jpg
    Results (cont.) System

    • Web

      • Q type # of Q’s MRR

      • BIRTHYEAR 8 0.69

      • INVENTOR 6 0.58

      • DISCOVERER 4 0.88

      • DEFINITION 102 0.39

      • WHY-FAMOUS 3 0.00

      • LOCATION 16 0.86

    Results cont41 l.jpg
    Results (cont.) System

    • System performs better on web data than on TREC corpus

    • Abundant web data makes it easier for system to locate answers with high precision scores

    • TREC corpus does not have enough candidate answers with high precision score

      • must settle for answers from low precision patterns

    • WHY-FAMOUS exception - may be due to small # of test Q's

    Shortcomings extensions l.jpg
    Shortcomings & Extensions System

    • Need for POS &/or semantic types

      • "Where are the Rocky Mountains?”

      • "Denver's new airport, topped with white fiberglass cones in imitation of the Rocky Mountains in the background , continues to lie empty”

      • <NAME> in <ANSWER>

  • NE tagger &/or ontology could enable system to determine "background" is not a location

  • Shortcomings cont l.jpg
    Shortcomings... (cont.) System

    • DEFINITION Q's - match term too general, though correct technically

      • "What is nepotism?”

      • <ANSWER>, <NAME>

      • "...in the form of widespread bureaucratic abuses: graft, nepotism...”

      • "What is sonar?”

      • <NAME> and related <ANSWER>

      • "...while its sonar and related underseas systems are built...”

    Shortcomings cont44 l.jpg
    Shortcomings... (cont.) System

    • Long distance dependencies

      • "Where is London?”

      • "London, which has one of the most busiest airports in the world, lies on the banks of the river Thames”

      • would require pattern like:<QUESTION>, (<any_word>)*, lies on <ANSWER>

    • Abundance & variety of Web data helps system to find an instance of patterns w/o losing answers to long distance dependencies

    Shortcomings cont45 l.jpg
    Shortcomings... (cont.) System

    • More info in patterns regarding length of expected answer phrase

      • Searches in range of 50 bytes of answer phrase to capture pattern

      • fails under some conditions

        • "When was Lyndon B. Johnson born?”

        • "...lost to democratic Sen. Lyndon B. Johnson, who ran for both re-election and the vice presidency”

        • <NAME> <ANSWER> -

    Shortcomings cont46 l.jpg
    Shortcomings... (cont.) System

    • Lacks info that <ANSWER> in this case should be exactly replaced by 1 word

    • Could extend system to search for answer in range of 1-2 chunks

      • basic English phrases, NP, VP, PP, etc.

    Shortcomings cont47 l.jpg
    Shortcomings... (cont.) System

    • System doesn't work for Q types requiring multiple words from question to be in answer

      • "In which county does the city of Long Beach lie?”

      • "Long Beach is situated in Los Angeles County”

      • required pattern:<QUESTION_TERM_1> situated in <ANSWER> <QUESTION_TERM_2>

    Shortcomings cont48 l.jpg
    Shortcomings... (cont.) System

    • Performance of system depends greatly on having only 1 anchor word

    • Multiple anchor points

      • would help eliminate candidate answers

      • require all anchor words be present in candidate answer sentence

    Shortcomings cont49 l.jpg
    Shortcomings... (cont.) System

    • Does not use case

      • "What is micron?”

      • "...a spokesman for Micron, a maker of semiconductors, said SIMMs are..."

  • If Micron had been capitalized in question, would be a perfect answer

  • Shortcomings cont50 l.jpg
    Shortcomings... (cont.) System

    • Canonicalization of words

      • BIRTHDATE for Gandhi: 1869; Oct. 2, 1869; 2nd October 1869; October 2 1869; 02 October 1869; etc.

    • Use date tagger to cluster all variations and tag with same term

    • Extend idea to smooth out variations in Q term for names: Gandhi, Mahatma Gandhi, Mohandas Karamchand Gandhi, etc.

    Conclusion l.jpg
    Conclusion System

    • Web results easily outperform TREC results

    • Suggests need to integrate outputs from Web & TREC

    • Word count to help eliminate unlikely answers


      • ? DEFINITION

    Conclusion cont l.jpg
    Conclusion (cont.) System

    • But what about DEFINITION?

      • 102 Q's in TREC Corpus and in Web

      • Most Q's of all types

      • MRR-TREC == 0.34

      • MRR-Web == 0.39

      • All other Q's have # < 20, most < 10

  • If enough Q's are asked, will difference in performance on Web data vs. TREC data diminish?

  • Conclusion cont53 l.jpg
    Conclusion (cont.) System

    • Simplicity - "perfect" for multilingual system QA

      • Low resource requirement - no NE taggers, no parsers, no ontologies, etc.

      • No adaptation of these to new language required

      • Need to create manual training terms & use appropriate web search engine

    Regular expressions from ask iggy l.jpg
    Regular Expressions from ask_iggy System

    • “place called\\s+($cap_pattern+)” “home called\\s+($cap_pattern+)”

    • “at\\s((the)?\\s+($cap_pattern+))” “to\\s+($cap_pattern+)”

    • “place\\s+in\\s+($cap_pattern+)called\\s+($cap_pattern+)”

    • “in\\s+($cap_pattern+)” “up\\s+($cap_pattern+)”

    • “left\\s+($cap_pattern+)” “(($cap_pattern+)[Ii]slands)”

    • “(northern|southern|eastern|western)\\s+($cap_pattern+)”

    • “from\\s+($cap_pattern+)” “far\\s+as\\s+($cap_pattern+)”

    • “place\\s+in\\s+($cap_pattern+)” “home\\s+town”

    • “city\\s+of\\s+($cap_pattern+)”

    • “middle\\s+of\\s+((the)?\\s+($cap_pattern+))”

    • “(($cap_pattern+)[Ii]slands\\s+of\\s+($cap_pattern+))”

    • “place{1,1}d?\\s+near\\s+($cap_pattern+)”

    • “((above|over)\\s+($cap_pattern+))”