Download
the integration of lexical knowledge and external resources for qa n.
Skip this Video
Loading SlideShow in 5 Seconds..
The Integration of Lexical Knowledge and External Resources for QA PowerPoint Presentation
Download Presentation
The Integration of Lexical Knowledge and External Resources for QA

The Integration of Lexical Knowledge and External Resources for QA

0 Views Download Presentation
Download Presentation

The Integration of Lexical Knowledge and External Resources for QA

- - - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

  1. The Integration of Lexical Knowledge and External Resources for QA Hui YANG, Tat-Seng Chua {yangh,chuats}@comp.nus.edu.sg Pris, School of Computing National University of Singapore

  2. Presentation Outline • Introduction • Pris QA System Design • Result and Analysis • Conclusion • Future Work

  3. Open Domain QA • Find answers to open-domain NLP questions by searching a large collection of documents • Question Processing • May involve question re-formulation • To find answer type • Query Expansion • To overcome concept mis-match between query & info base • Search for Candidate Answers • Documents, paragraphs, or sentences • Disambiguation • Ranking (or re-ranking) of answers • Location of exact answers

  4. Current Research Trends • Web-based QA • the Web redundancy • Probabilistic algorithm • Linguistic-based QA • part-of-speech tagging • syntactic parsing • semantic relations • named entity extraction • dictionaries • WordNet, etc

  5. System Overview • Question Classification • Question Parsing • Query Formulation • Document Retrieval • Candidate Sentence Retrieval • Answer Extraction

  6. Question Analysis Query Formulation By External Knowledge Question Classification Original Content Words Web Q Question Parsing WordNet Expanded Content Words Candidate sentences Relevant TREC doc A Answer Extraction Document Retrieval Sentence Ranking Reduce # of expanded content words

  7. Question Classification • Based on question focus and answer type • 7 main classes • HUM, LOC, TME, NUM, OBJ, DES, UNKNOWN • E.g. “Which city is the capital of Canada ? ” (Q-class: LOC) • E.g. “Which state is the capital of Canada in? ” (Q-class: LOC) • 54 sub-classes • E.g. under LOC (location), we have 14 sub-classes: • LOC_PLANET: 1 • LOC_CITY: 18 • LOC_CONTINENT: 3 • LOC_COUNTRY: 18 • LOC_COUNTY: 3 • LOC_STATE: 3 • LOC_PROVINCE: 2 • LOC_TOWN: 2 • LOC_RIVER: 3 • LOC_LAKE: 2 • LOC_MOUNTAIN: 1 • LOC_OCEAN: 2 • LOC_ISLAND: 3 • LOC_BASIC: 3

  8. Question Parsing • Content Words : q(0) • Nouns, adjectives, numbers, some verbs • E.g. “What mythical Scottish town appears for one day every 100 years ?” • Q-class: LOC_TOWN • q(0) : (mythical,Scottish,town,appears,one,day,100,years) • Base Noun Phrases : n • n : (“mythical Scottish town”) • Head of the 1st Noun Phrase: h • h : (town) • Quotation Words: u • E.g. “What was the original name before " The Star Spangled Banner“ ? ” • u: (“The Star Spangled Banner”)

  9. Query Formulation I • Use original Content Words as query to search the Web (e.g. Google) • Find new terms which have high correlation with the original query • Use WordNet to find the Synsets and Glosses of original query terms • Rank new query terms based on both Web and WordNet • Form new boolean query

  10. Query Formulation II • Original query q(0) = (q1(0),q2(0),…,qk(0) ) • Use Web as generalized resource • From q(0) , retrieve top N documents •  qi(0)q(0), extract nearby non-trivial words in one sentence or n words away to get wi • Rank wikwi by computing its probability of correlation with qi(0) # instances of (wik/\qi (0)) Prob(wik) = ---------------------------------- # instances of (wik\/qi (0)) • Merge all wi to form Cq for q(0)

  11. Query Formulation III • Use WordNet as generalized resource •  qi(0)q(0), extract terms that are lexically related to qi(0) by locating them in • Gloss Gi • Synset Si • For q(0), we get Gq andSq • Re-rank wikwi by considering lexical relations •  wikCq, if wik Gi, wik increases  if wik Si, wik increases , (0<<<1) • Get q(1) = q(0) + {top m terms from Cq}

  12. Document Retrieval • 1,033,461 documents from • AP newswire, 1998-2000 • New York Times newswire, 1998-2000 • Xinhua News Agency, 1996-2000 • MG Tool • Boolean search to retrieve the top N documents (N = 50) •  tk q(1) , Q = (t1 t2 … tn)

  13. Candidate Sentence Retrieval • sent j in the top N documents, match with : • quotation words: • Wuj = % of term overlap between u and Sentj • noun phrases: • Wnj = % of phrase overlap between n and Sentj • head of first noun phrase: • Whj = 1 if there is a match and 0 otherwise • original content words: • Wcj = % of term overlap between q(0) and Sentj • expanded content words: • Wej = % of term overlap between q(1-0) and Sentj , where q(1-0) = q(1) - q(0) • Final score , • where αi=1, Wij{ Wuj , Wnj , Whj , Wcj , Wej }.

  14. Answer Extraction I • Fine-grained NE tagging for the top K sentences • For each sentence, extract the string which matches the Question Class • E.g. “Who is Tom Cruise married to ?” • Q-class: HUM_BASIC • Top ranked Candidate Sentence: • Actor <HUM_PERSON Tom Cruise> and his wife <HUM_PERSONNicole Kidman > accepted `` substantial '' libel damages on <TME_DATE Thursday> from a <LOC_COUNTRY British> newspaper that reported he was gay and that their marriage was a sham to cover it up . • Answer string: Nicole Kidman

  15. Answer Extraction II • For some questions, we cannot find any answer • reduce the # of expanded query terms and repeat the Document Retrieval, Candidate Sentence Retrieval and Answer Extraction • The whole process lasts for N iterations (N=5) • If we still cannot find an exact answer, NIL is considered as the answer • increase recall step by step while preserving precision

  16. Evaluation in TREC 2002 • uninterpolated average precision sum for i=1 to 500 (#-correct-up-to-question-i/i) -------------------------------------------------------------- 500 • We answer correctly 290 questions • Score 0.61

  17. Total Num Q Our Num of Q 45 40 35 30 25 Num Q with Correct Answer 20 15 10 5 0 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39 41 43 45 47 49 51 Num of Runs with Correct Answers Result Analysis I

  18. Result Analysis II • Recognize no answers (NIL) • Precision : 41 / 170 = 0.241 • Recall : 41 / 46 = 0.891 • Non-Nil answers • Precision: 249/330 = 0.755 • Recall: 249/444 = 0.561 • Overall Recall is low compare to precision – because Boolean search is strict.

  19. Result Analysis III

  20. Conclusion • Integration of both Lexical Knowledge and External Resources • Detailed Question Classification • Use of Fine-grained Named Entities for Question Answering • Successive Constraint Relaxation

  21. Future Work • Refining our terms correlation by considering a combination of local context, global context and lexical correlations • Exploring the structured use of external knowledge using the semantic perceptron net • Developing template-based answer selection • Longer-term research plan : Interactive QA, analysis and opinion questions

  22. Thank You !