an analysis of the askmsr question answering system n.
Skip this Video
Loading SlideShow in 5 Seconds..
An Analysis of the AskMSR Question-Answering System PowerPoint Presentation
Download Presentation
An Analysis of the AskMSR Question-Answering System

Loading in 2 Seconds...

play fullscreen
1 / 54

An Analysis of the AskMSR Question-Answering System - PowerPoint PPT Presentation

  • Uploaded on

An Analysis of the AskMSR Question-Answering System. Eric Brill, Susan Dumais, and Michelle Banko Microsoft Research. From Proceedings of the EMNLP Conference, 2002. Goals. Evaluate contributions of components Explore strategies for predicting when answers are incorrect.

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
Download Presentation

PowerPoint Slideshow about 'An Analysis of the AskMSR Question-Answering System' - olinda

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
an analysis of the askmsr question answering system

An Analysis of the AskMSR Question-Answering System

Eric Brill, Susan Dumais, and Michelle Banko

Microsoft Research

  • Evaluate contributions of components
  • Explore strategies for predicting when answers are incorrect
askmsr what sets it apart
AskMSR – What Sets It Apart
  • Dependency on data redundancy
  • No sophisticated linguistic analyses
    • Of questions
    • Of answers
trec question answering track
TREC Question Answering Track
  • Fact-based, short-answer questions
    • How many calories are there in a Big Mac?
    • Who killed Abraham Lincoln?
    • How tall is Mount Everest?
  • 562 – In case you’re wondering
  • Motivation for much of recent work in QA
other approaches
Other Approaches
  • POS tagging
  • Parsing
  • Named Entity extraction
  • Semantic relations
  • Dictionaries
  • WordNet
askmsr approach
AskMSR Approach
  • Web – “gigantic data repository”
  • Different from other systems using web
    • Simplicity & efficiency
      • No complex parsing
      • No entity extraction
      • For queries or best matching web pages
      • No local caching
  • Claim: techniques used in approach to short-answer tasks are more broadly applicable
some qa difficulties
Some QA Difficulties
  • Single, small information source
    • Likely only 1 answer exists
  • Source with small # of answer formulations
    • Complex relations between Q & A
      • Lexical, syntactic, semantic relations
      • Anaphora, synonymy, alternate syntactic formulations, indirect answers make this difficult
answer redundancy
Answer Redundancy
  • Greater answer redundancy in source
    • More likely: simple relation between Q & A exists
    • Less likely: need to deal with difficulties facing NLP systems
query reformulation
Query Reformulation
  • Rewrite question
    • Substring of declarative answer
    • Weighted
    • “when was the paper clip invented?”  “the paper clip was invented”
  • Produce less precise rewrites
    • Greater chance of matching
    • Backoff to simple ANDing of non-stop words
query reformulation cont
Query Reformulation (cont.)
  • String based manipulations
  • No parser
  • No POS tagging
  • Small lexicon for possible POS and morphological variants
  • Created rewrite rules by hand
  • Chose associated weights by hand
n gram mining
N-gram Mining
  • Formulate rewrite for search engine
  • Collect and analyze page summaries
  • Why use summaries?
    • Efficiency
    • Contain search terms, plus some context
  • N-grams collected from summaries
n gram mining cont
N-gram Mining (Cont.)
  • Extract 1-, 2-, 3-grams from summary
    • Score by weight of rewrite that retrieved it
  • Sum scores across all summaries with n-gram
  • No frequency within summary
  • Final score for n-gram
    • Weights associated with rewrite rules
    • # of unique summaries it is in
n gram filtering
N-gram Filtering
  • Use handwritten filter rules
  • Question type assignment
    • e.g. who, what, how
  • Choose set of filters based on q-type
  • Rescore n-grams based on presence of features relevant to filters
n gram filtering cont
N-gram Filtering (Cont.)
  • 15 simple filters
    • Based on human knowledge
      • Question types
      • Answer domain
    • Surface string features
      • Capitalization
      • Digits
      • Handcrafted regular expression patterns
n gram tiling
N-gram Tiling
  • Merge similar answers
  • Create longer answers from overlapping smaller answer fragments
    • “A B C”, “B C D”  “A B C D”
  • Greedy algorithm
    • Start w/ top-scoring n-gram, check lower scoring n-grams for tiling potential
      • If can be tiled, replace higher-scoring n-gram with tiled n-gram, remove lower-scoring n-gram
    • Stop when can no longer tile
  • First 500 TREC-9 queries
  • Use scoring patterns provided by NIST
    • Modified some patterns to accommodate web answers not in TREC
    • More specific answers allowed
      • Edward J. Smith vs. Edward Smith
    • More general answers not allowed
      • Smith vs. Edward Smith
    • Simple substitutions allowed
      • 9 months vs. nine months
experiments cont
Experiments (cont.)
  • Time differences between Web & TREC
    • “Who is the president of Bolivia?”
    • Did NOT modify answer key
    • Would make comparison w/earlier TREC results impossible (instead of difficult?)
  • Changes influence absolute scores, not relative performance
experiments cont1
Experiments (cont.)
  • Automatic runs
    • Start w/queries
    • Generate ranked list of 5 answers
  • Use Google as search engine
    • Query-relevant summaries for n-gram mining efficiency
  • Answers are max. of 50 bytes long
    • Typically shorter
basic system performance
“Basic” System Performance
  • Backwards notion of basic
    • Current system, all modules implemented
    • Default settings
  • Mean Reciprocal Rank (MRR) – 0.507
  • 61% of questions answered correctly
  • Average answer length – 12 bytes
  • Impossible to compare precisely with TREC-9 groups, but still very good performance
query rewrite contribution
Query Rewrite Contribution
  • More precise queries – higher weights
  • All rewrites equal – MRR drops 3.6%
  • Only backoff AND – MRR drops 11.2%
  • Rewrites capitalize on web redundancy
  • Could use more specific regular expression matching
n gram filtering contribution
N-gram Filtering Contribution
  • 1-, 2-, 3-grams from 100 best-matching summaries
  • Filter by question type
      • “How many dogs pull a sled in the Iditarod?”
      • Question prefers a number
      • Run, Alaskan, dog racing, many mush ranked lower than pool of 16 dogs (correct answer)
  • No filtering – MRR drops 17.9%
n gram tiling contribution
N-gram Tiling Contribution
  • Benefits of tiling
    • Substrings take up only 1 answer slot
      • e.g. San, Francisco, San Francisco
    • Longer answers can never be found with only tri-grams
      • e.g. “light amplification by [stimulated] emission of radiation”
  • No tiling – MRR drops 14.2%
component combinations
Component Combinations
  • Only weighted sum of occurrences of1-, 2-, 3-grams – MRR drops 47.5%
  • Simple statistical system
    • No linguistic knowledge or processing
    • Only AND queries
    • Filtering – no, (statistical) tiling – yes
    • MRR drops 33% to 0.338
component combinations1
Component Combinations
  • Statistical system –good performance?
    • Reasonable on absolute scale?
    • One TREC-9 50 byte run performed better
  • All components contribute to accuracy
    • Precise weights of rewrites unimportant
    • N-gram tiling – a “poor man’s named-entity recognizer”
    • Biggest contribution from filters/selection
component combinations2
Component Combinations
  • Claim: “Because of the effectiveness of our tiling algorithm…we do not need to use any named entity recognition components.”
    • By having filters with capitalization info (section 2.3, 2ndparagraph), aren’t they doing some NE recognition?
component problems cont
Component Problems (cont.)
  • No correct answer in top 5 hypotheses
  • 23% of errors – not knowing units
    • How fast can Bill’s Corvette go? mph or k/h
  • 34% (Time, Correct) – time problems or answer not in TREC-9 answer key
  • 16% from shortcomings in n-gram tiling
  • Number retrieval (5%) – query limitation
component problems cont1
Component Problems (cont.)
  • 12% - beyond current system paradigm
    • Can’t be fixed with minor enhancements
    • Is this really so? or have they been easy on themselves in error attribution?
  • 9% - no discussion
knowing when
Knowing When…
  • Some cost for answering incorrectly
  • System can choose to not answer instead of giving incorrect answer
    • How likely hypothesis is correct?
  • TREC – no distinction between wrong answer and no answer
  • Deploy real system – trade-off between precision & recall
knowing when cont
Knowing When…(cont.)
  • Answer is ad-hoc combination of hand tuned weights
  • Is it possible to induce useful precision-recall (ROC) curve when answers don’t have meaningful probabilities?
  • What is an ROC (Receiver Operating Characteristic) curve?
  • From (Hinrich Schütze, co-author of Foundations of Statistical Natural Language Processing)
determining likelihood
Determining Likelihood
  • Ideal – determine likelihood of correct answer based only on question
  • If possible, can skip such questions
  • Use decision tree based on set of features from question string
      • 1-, 2-grams, type
      • sentence length, longest word length
      • # capitalized words, # stop words
      • Ratio of stop words to non-stop words
decision tree diagnostic tool
Decision Tree/Diagnostic Tool
  • Performs worst on how questions
  • Performs best on short who questions w/many stop words
  • Induce ROC curve from decision tree
    • Sort leaf nodes from highest probability of being correct to lowest
    • Gain precision by not answering questions with highest probability of error
decision tree query results
Decision Tree–Query Results
  • Decision Tree trained on TREC-9
  • Tested on TREC-10
  • Overfits training data – insufficient generalization
answer correctness score
Answer Correctness/Score
  • Ad-hoc score based on
    • # of retrieved passages n-gram occurs in
    • weight of rewrite used to retrieve passage
    • what filters apply
    • effects of n-gram tiling
  • Correlation between whether answer appears in top 5 output and…
correct answer in top 5
Correct Answer In Top 5
  • …and score of system’s first ranked answer
    • Correlation coefficient: 0.363
    • No time-sensitive q’s: 0.401
  • …and score of first ranked answer minus second
    • Correlation coefficient: 0.270
other likelihood indicators
Other Likelihood Indicators
  • Snippets gathered for each question
    • AND queries
    • More refined exact string match rewrites
  • MRR and snippets
    • All snippets from AND: 0.238
    • 11 to 100 from non-AND: 0.612
    • 100 to 400 from non-AND: 0.628
      • But wasn’t MRR for “base” system 0.507?
another decision tree
Another Decision Tree
  • Features of first DT, plus
    • Score of #1 answer
    • State of system in processing
      • Total # of matching passages
      • # of non-AND matching passages
      • Filters applied
      • Weight of best rewrite rule yielding matching passages
      • Others
decision tree all
Decision Tree–All
  • Gives useful ROC curve on test data
  • Outperformed by Answer #1 Score
  • Though outperformed by simpler ad-hoc technique, still useful as diagnostic tool
  • Novel approach to QA
  • Careful analysis of contributions of major system components
  • Analysis of factors behind errors
  • Approach for learning when system is likely to answer incorrectly
    • Allowing system designers to decide when to trade recall for precision
my conclusions
My Conclusions
  • Claim: techniques used in approach to short-answer tasks are more broadly applicable
  • Reality: “We are currently exploring whether these techniques can be extended beyond short answer QA to more complex cases of information access.”
my conclusions cont
My Conclusions (cont.)
  • “…we do not need to use any named entity recognition components.”
    • Filters w/capitalization info = NE recognition
  • 12% of errors beyond system paradigm
    • Still wonder–is this really so?
  • 9% of errors–no discussion
  • Ad hoc method outperforms Decision Tree
    • Did they merely do a good job of designing system, of assigning weights, etc.?
    • Did they get lucky?