The Integration of Lexical Knowledge and External Resources for QA

The Integration of Lexical Knowledge and External Resources for QA Hui YANG, Tat-Seng Chua {yangh,chuats}@comp.nus.edu.sg Pris, School of Computing National University of Singapore

Presentation Outline • Introduction • Pris QA System Design • Result and Analysis • Conclusion • Future Work

Open Domain QA • Find answers to open-domain NLP questions by searching a large collection of documents • Question Processing • May involve question re-formulation • To find answer type • Query Expansion • To overcome concept mis-match between query & info base • Search for Candidate Answers • Documents, paragraphs, or sentences • Disambiguation • Ranking (or re-ranking) of answers • Location of exact answers

Current Research Trends • Web-based QA • the Web redundancy • Probabilistic algorithm • Linguistic-based QA • part-of-speech tagging • syntactic parsing • semantic relations • named entity extraction • dictionaries • WordNet, etc

System Overview • Question Classification • Question Parsing • Query Formulation • Document Retrieval • Candidate Sentence Retrieval • Answer Extraction

Question Analysis Query Formulation By External Knowledge Question Classification Original Content Words Web Q Question Parsing WordNet Expanded Content Words Candidate sentences Relevant TREC doc A Answer Extraction Document Retrieval Sentence Ranking Reduce # of expanded content words

Question Classification • Based on question focus and answer type • 7 main classes • HUM, LOC, TME, NUM, OBJ, DES, UNKNOWN • E.g. “Which city is the capital of Canada ? ” (Q-class: LOC) • E.g. “Which state is the capital of Canada in? ” (Q-class: LOC) • 54 sub-classes • E.g. under LOC (location), we have 14 sub-classes: • LOC_PLANET: 1 • LOC_CITY: 18 • LOC_CONTINENT: 3 • LOC_COUNTRY: 18 • LOC_COUNTY: 3 • LOC_STATE: 3 • LOC_PROVINCE: 2 • LOC_TOWN: 2 • LOC_RIVER: 3 • LOC_LAKE: 2 • LOC_MOUNTAIN: 1 • LOC_OCEAN: 2 • LOC_ISLAND: 3 • LOC_BASIC: 3

Question Parsing • Content Words : q(0) • Nouns, adjectives, numbers, some verbs • E.g. “What mythical Scottish town appears for one day every 100 years ?” • Q-class: LOC_TOWN • q(0) : (mythical,Scottish,town,appears,one,day,100,years) • Base Noun Phrases : n • n : (“mythical Scottish town”) • Head of the 1st Noun Phrase: h • h : (town) • Quotation Words: u • E.g. “What was the original name before " The Star Spangled Banner“ ? ” • u: (“The Star Spangled Banner”)

Query Formulation I • Use original Content Words as query to search the Web (e.g. Google) • Find new terms which have high correlation with the original query • Use WordNet to find the Synsets and Glosses of original query terms • Rank new query terms based on both Web and WordNet • Form new boolean query

Query Formulation II • Original query q(0) = (q1(0),q2(0),…,qk(0) ) • Use Web as generalized resource • From q(0) , retrieve top N documents •  qi(0)q(0), extract nearby non-trivial words in one sentence or n words away to get wi • Rank wikwi by computing its probability of correlation with qi(0) # instances of (wik/\qi (0)) Prob(wik) = ---------------------------------- # instances of (wik\/qi (0)) • Merge all wi to form Cq for q(0)

Query Formulation III • Use WordNet as generalized resource •  qi(0)q(0), extract terms that are lexically related to qi(0) by locating them in • Gloss Gi • Synset Si • For q(0), we get Gq andSq • Re-rank wikwi by considering lexical relations •  wikCq, if wik Gi, wik increases  if wik Si, wik increases , (0<<<1) • Get q(1) = q(0) + {top m terms from Cq}

Document Retrieval • 1,033,461 documents from • AP newswire, 1998-2000 • New York Times newswire, 1998-2000 • Xinhua News Agency, 1996-2000 • MG Tool • Boolean search to retrieve the top N documents (N = 50) •  tk q(1) , Q = (t1 t2 … tn)

Candidate Sentence Retrieval • sent j in the top N documents, match with : • quotation words: • Wuj = % of term overlap between u and Sentj • noun phrases: • Wnj = % of phrase overlap between n and Sentj • head of first noun phrase: • Whj = 1 if there is a match and 0 otherwise • original content words: • Wcj = % of term overlap between q(0) and Sentj • expanded content words: • Wej = % of term overlap between q(1-0) and Sentj , where q(1-0) = q(1) - q(0) • Final score , • where αi=1, Wij{ Wuj , Wnj , Whj , Wcj , Wej }.

Answer Extraction I • Fine-grained NE tagging for the top K sentences • For each sentence, extract the string which matches the Question Class • E.g. “Who is Tom Cruise married to ?” • Q-class: HUM_BASIC • Top ranked Candidate Sentence: • Actor <HUM_PERSON Tom Cruise> and his wife <HUM_PERSONNicole Kidman > accepted `` substantial '' libel damages on <TME_DATE Thursday> from a <LOC_COUNTRY British> newspaper that reported he was gay and that their marriage was a sham to cover it up . • Answer string: Nicole Kidman

Answer Extraction II • For some questions, we cannot find any answer • reduce the # of expanded query terms and repeat the Document Retrieval, Candidate Sentence Retrieval and Answer Extraction • The whole process lasts for N iterations (N=5) • If we still cannot find an exact answer, NIL is considered as the answer • increase recall step by step while preserving precision

Evaluation in TREC 2002 • uninterpolated average precision sum for i=1 to 500 (#-correct-up-to-question-i/i) -------------------------------------------------------------- 500 • We answer correctly 290 questions • Score 0.61

Total Num Q Our Num of Q 45 40 35 30 25 Num Q with Correct Answer 20 15 10 5 0 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39 41 43 45 47 49 51 Num of Runs with Correct Answers Result Analysis I

Result Analysis II • Recognize no answers (NIL) • Precision : 41 / 170 = 0.241 • Recall : 41 / 46 = 0.891 • Non-Nil answers • Precision: 249/330 = 0.755 • Recall: 249/444 = 0.561 • Overall Recall is low compare to precision – because Boolean search is strict.

Result Analysis III

Conclusion • Integration of both Lexical Knowledge and External Resources • Detailed Question Classification • Use of Fine-grained Named Entities for Question Answering • Successive Constraint Relaxation

Future Work • Refining our terms correlation by considering a combination of local context, global context and lexical correlations • Exploring the structured use of external knowledge using the semantic perceptron net • Developing template-based answer selection • Longer-term research plan : Interactive QA, analysis and opinion questions

Thank You !

The Integration of Lexical Knowledge and External Resources for QA

The Integration of Lexical Knowledge and External Resources for QA

Presentation Transcript

Descartes and Hume on knowledge of the external world

External QA for TB - Botswana Data

CSCOPE: Integration of Resources and Curriculum

Text Corpora and Lexical Resources

Banner and External Application Integration

USGS EXTERNAL QA REPORT 2012

Data models and the (blind ?) query of lexical resources

Mobilization of External Resources

Integration of hardware, intelligence and knowledge

KNOWLEDGE AND DATA INTEGRATION FOR MODELLING OF RISK

Evaluating External Resources

LOCKE ON KNOWLEDGE OF THE EXTERNAL WORLD

Knowledge Integration

External Observatory Integration:

Acquisition of Lexical Knowledge for NLP

Evaluating the Inferential Utility of Lexical-Semantic Resources

ESTABLISHING EXTERNAL QA SYSTEM IN SLOVENIA

Knowledge Resources

Transforming the Representation of Lexical Knowledge

INTEGRATION OF QA/ISM

Bachelor of Knowledge Integration

External review of QA agencies: the role of the students