150 likes | 334 Views
BBN AQUA. Scott Miller & Ralph Weischedel 13 June 2002. BBN’s Approach. Theme: Use statistical learning algorithms for document retrieval, entity recognition , & proposition recognition Mechanism Analyze the question Reduce question to propositions, entities and a bag of words
E N D
BBN AQUA Scott Miller & Ralph Weischedel 13 June 2002
BBN’s Approach • Theme: Use statistical learning algorithms for document retrieval, entity recognition, & proposition recognition • Mechanism • Analyze the question • Reduce question to propositions, entities and a bag of words • Predict the type of the answer • Retrieve passages using document retrieval based on the propositional components and the bag of words • Find the most relevant (in document retrieval sense) passage(s) that contain an answer of the right type and satisfying the propositions. • Return the answer and the document
Frequency of Q Types Defined Types
Question Classification • Developed initial statistically trainable classifier • Offers language independence • Collecting and annotating training data by type • > 2,000 annotated to date
Spotting Answers via IdentiFinder • Same categories as the question-answer types, except no • Reason • Definition • Use • Biography • Cause-Effect-Influence • Other
Status • Singly annotated 1M word Treebank by types and subtypes for names and descriptions • Current IdentiFinder performance • IdentiFinder easily trainable for other languages, e.g., Arabic and Chinese
Spotting Answers via SIFT Parser • Parse a ‘paragraph’ that is relevant • Find the following structures that involve the person/thing to be described • Appositive • Lt. Gen. Colin L. Powell • Colin Powell, outgoing chairman of the US Joint Chiefs of Staff, • Copula • Colin L. Powell became the first black to serve as White House national security adviser • Select the appropriate description • SIFT parser easily trainable for other languages where a treebank is under development, e.g., Arabic and Chinese
What is Proposition Indexing • A shallow semantic representation • Deeper than bags of words • But broad enough to cover all the text • Characterizes documents by • The entities they contain • Propositions involving those entities • Resolves all references to entities • Whether named, described, or pronominal • Represents all propositions that are directly stated in the text
Why Proposition Indexing • Question • Who is Ayman al-Zawahri? • Source text Osama bin Laden, his top deputy and a man identified as a Sept. 11 attacker were shown in a brief video aired Monday by the pan-Arab satellite station Al-Jazeera. It wasn't clear when the tapes were made. The deputy, Ayman al-Zawahri, was shown claiming the Sept. 11 attacks as a ``great victory.'' • Answer • Osama bin Laden’s deputy.
(shown lobj:e2 in:e1) … (video e1) (brief e1) (*person* e2) (*name* e2 “Osama bin Laden”) (*person* e3) (deputy e3 of:e2) (*person* e4) (man e4) … (*name* e3 “Ayman al-Zawahri”) Proposition Example • Osama bin Laden, his top deputy and a man identified as a Sept. 11 attacker were shown in a brief video aired Monday by the pan-Arab satellite station Al-Jazeera. It wasn't clear when the tapes were made. • The deputy, Ayman al-Zawahri, was shown claiming the Sept. 11 attacks as a ``great victory.'' ANSWER
Status • Annotated full UPenn 1M word Treebank by semantic categories • Names • Descriptions • Quantity/numerical expressions • Received initial 100k word PropBank additions to Treebank • Hypothesized an initial generative model
Proposition Recognition Strategy • Start with a lexicalized, probabilistic (LPCFG) parsing model • Distinguish names by replacing NP labels with NPP • Extend the model to • Predict argument labels for clauses • Resolve references to entities
Extended Parse lobj lsbj E3 PER lsbj lobj E1 PER NP NPP NPP NP E2 GPE NPP NP E1 S ousted VP NP VP SBAR PP S led NP VP WHNP NP NP , was ousted led Sharif who , Muscharraf 12 by , * Pakistan Pervez General October Nawaz Army Pakistani
Conclusions • Basis for long-term infrastructure underway • Training data for rich set of categories & subcategories on million word treebank • IdentiFinder retrained on rich set of categories • Developed trainable question classifier • Collected and annotated questions by types • Parser able to process collection • Remaining steps in CY2002 • Improve system performance • Participate in end-to-end evaluation in TREC QA 2002 • Develop(ing) proposition recognition while awaiting PropBank • Prepare for pilot AQUAINT evaluation • Develop model of important, non-redundant information • Integrate proposition recognition