1 / 25

Relevance Models for QA Project Update University of Massachusetts, Amherst

Relevance Models for QA Project Update University of Massachusetts, Amherst. AQUAINT meeting December, 2002 Bruce Croft and James Allan, PIs. UMass AQUAINT Project Status. Question answering using language models Carried out more experiments using basic LM approach

marius
Download Presentation

Relevance Models for QA Project Update University of Massachusetts, Amherst

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Relevance Models for QAProject UpdateUniversity of Massachusetts, Amherst AQUAINT meeting December, 2002 Bruce Croft and James Allan, PIs

  2. UMass AQUAINT Project Status • Question answering using language models • Carried out more experiments using basic LM approach • Developed new model(s) and starting more experiments • Moved experiments to LEMUR toolkit • Query triage • Studied Clarity measure for questions • Question answering with semi-structured data • Developed HMM and CRF-based table extractors • More experiments on question answering with table structure • Answer updating • Experiments with time-based questions

  3. QA using LM • P(Answer|Question) can be estimated many ways • Could be done directly, but usually will involve intermediate steps such as documents, question classes • Initially focused on answer passages, but “extracted” answers can be modeled • Can model “templates” as well as n-gram answer models • Can also introduce cross-lingual QA through P(Alang1|Qlang2) • Every approach requires training data • “answer mining” for answer models/templates • incorporating user feedback

  4. Query Triage • Given a question, what can we infer from it? • Query vs. question • Quality (does it need to be made more precise) • Type (likely form of answers and granularity) • Human intermediation (should it be directed to a human expert?) • Previous work developed “Clarity” measure for queries and tested on TREC ad-hoc data • Demonstrated high correlation with performance • Threshold can be set automatically • Current research focuses on TREC QA data

  5. Predicting Question Performance • Basic result: We can predict question performance (with some qualifications) • Did not work for some TREC question classes • For example: • What is the date of Bastille Day? • TREC-9P Clarity score 2.49 • What time of year do most people fly? • TREC-9P Clarity score 0.76

  6. Passages ranked by P(A|Q) ... Passages, A ... “do”, “day”, “what” “bastille” “the” “paris” # “celebrate” “assmann” Log P Clarity Score terms Clarity score computation Question Q, text retrieve model question- related language model passage collection language Compute divergence

  7. Clarity Example (for queries) 56.12 "Show me predictions for changes in the prime lending rate and any changes made in the prime lending rates" (clar. 2.85) Top 6 terms in query model: 1. "bank" 2. "hong" 3. "kong" 4. "rate" 5. "lend" 6. "prime" pqLog2(pq/pc) 56.08 "What adjustments should be made when federal action occurs?" (clar. 0.37) Top 6 terms in query model: 1. "adjust" 2. "federal" 3. "action" 4. "land" 5. "occur" 6. "hyundai" term rank

  8. Test System • Passages: • Two sentences, overlapping • from top retrieved docs for all questions • Measuring performance: • Question Likelihood used to rank passages • Average precision (rather than MRR) • Top 8 documents to estimate Clarity scores

  9. Precision vs. Clarity (Time Qs) What is the date of Bastille Day? Average Precision What is Martin Luther King Jr 's real birthday? What time of year do most people fly? Clarity Score

  10. Correlation by Question Type

  11. Correlation Analysis • Strong on average • Allows prediction of question performance • Variation with question type • Two bad (R<0) cases: Amount and Person • Amount: only has 33 questions, only a few bad Qs • Person: 93 questions, plenty of bad Qs to analyze What’s going on?

  12. 1 Ave. Precision 0 Clarity Score 3 0 Predictive Mistakes Two kinds of mistakes: • High clarity, low average precision • E.g. What is Martin Luther King Jr 's real birthday? • Answerless, coherent, very likely context in collection • Rare (good thing for the method) • Low clarity, high average precision • Various kinds of bad luck • Often coupled with few relevant passages • Many examples in Person case…

  13. Precision vs. Clarity (Person Qs) 1 0.8 Ave. Precision 0.6 0.4 0.2 0 0.8 1.2 1.6 2 2.4 2.8 Clarity Score • 15 “really bad” mistakes • “Really bad” ≡clarity score < 30 %-ile and ave. precision > 70 %-ile • 8 with many relevant answer passages ( > 50 ) • 5 (one-third) are slight variants of Who created “The Muppets”? • 2 variants of What king signed the Magna Carta? • 1 other question with plenty of relevants • 7 with few relevant answer passages • E.g. Silly Putty was invented by whom?, 2 rels

  14. QA using Tables • Developed and tested QUASM demonstration system using non-LM techniques • extraction of tabular structure • answer passages constructed from extracted data and metadata • extension of question types for “statistical” data • failure analysis • Major focus now is to develop probabilistic framework for whole process • tabular structure extraction • answer passage representation • P(Answer|Question)

  15. QuASM – Lessons Learned • Much harder to find answers in tables than in text • Table extraction is the key issue • Representation of answer passages also very important • what is an answer passage for tables? • e.g. too much metadata can cause poor retrieval

  16. Table Extraction • Heuristics do a good job of identifying tables • 97.8% percent of lines labeled correctly as in or out of table • Small labeling errors, however, can lead to poor retrieval • Current algorithm for extracting header information too permissive

  17. Text Table Transformation <h4><pre><font color=maroon> Number and Percent of Children under 19 Years of Age, at or below 200 Percent of Poverty, by State: Three-Year Averages for 1997, 1998, and 1999. (Numbers in Thousands)</font> _________________________________________________________________________________ | AT OR BELOW | AT OR BELOW 200% OF POVERTY | Total children | 200% OF POVERTY | WITHOUT HEALTH INSURANCE | under 19 years, |____________________________|_____________________________| all income levels | Standard Standard| Standard Standard | |Number error Pct. error |Number error Pct. error | ______________________|____________________________|_____________________________| Alabama....... 1,114 | 499 45.8 44.6 3.1 | 106 21.3 9.6 1.8 | Alaska........ 215 | 63 6.4 29.4 2.5 | 18 3.4 8.3 1.5 | Arizona....... 1,430 | 730 54.7 51.1 2.7 | 272 33.6 19.0 2.1 | Arkansas...... 740 | 377 30.5 50.5 2.9 | 111 16.5 14.7 2.0 |

  18. Missed part of title due to lack of indentation Extraneous text Text Table Transformation - Problems <QA_SECTION> <TITLE> (Numbers in Thousands)</font> </TITLE> <CAPTIONS> | AT OR BELOW | AT OR BELOW 200% OF POVERTY | Total children | 200% OF POVERTY | WITHOUT HEALTH INSURANCE | under 19 years, |____________________________|_____________________________| all income levels | Standard Standard| Standard Standard | |Number error Pct. error |Number error Pct. error | </CAPTIONS> <ROW> Alabama....... </ROW> <COLUMN> AT OR BELOW 200% OF POVERTY ____________________________ Standard Number </COLUMN>. | 499 </QA_SECTION>

  19. Features 3 Cells 2 Gaps Mostly Letters Mostly Digits Header Like Dashes Starts with Spaces Consecutive Spaces All White Space New Labeling NONTABLE BLANKLINE TITLE SUPERHEADER TABLEHEADER SUBHEADER DATAROW SEPARATOR SECTIONHEADER SECTIONDATAROW TABLEFOOTNOTE TABLECAPTION Line Tags

  20. Text Table Extraction Model Non-Table Super Header Subheader Title Table Header Data Row Finite State Machine (hidden Markov process) <100001000> <111001000> <110101000> <111101000> <010100100> <110100100> Non-Table Title Super Header Table Header Data Row Data Row Visible feature vectors probabilistically infer state sequence.

  21. Features 3 Cells 2 Gaps Mostly Letters Mostly Digits Header Like Dashes Starts with Spaces Consecutive Spaces All White Space Features for Table Extraction These features are not independent • Many correlations • Overlapping and long-distance dependencies • Observations from the past and future

  22. Hidden Markov Models Non-Table Title Super Header Table Header Data Row Data Row <100001000> <111001000> <110101000> <111101000> <010100100> <110100100> Observations are conditioned on state • HMMs are the standard sequence model • They are a generative model of the sequence • Generative models do not easily handle non-independent features.

  23. Conditional Random Fields Non-Table Title Super Header Table Header Data Row Data Row <100001000> <111001000> <110101000> <111101000> <010100100> <110100100> State sequence is conditioned on entire observation sequence. • A conditional model: • Can examine features, but is not responsible for generating them. • Doesn’t have to explicitly model their dependencies. • Has the ability to handle many arbitrary features with the full power of finite state automata.

  24. Results Label six test documents, total of 5817 lines.

  25. Summary of Plans • Testing a probabilistic model for QA • Refining the Clarity measure for questions • Finer-grain table extraction and QA tests • Time-dependent language models

More Related