Question answering over implicitly structured web content
Download
1 / 25

Question Answering over Implicitly Structured Web Content - PowerPoint PPT Presentation


  • 65 Views
  • Uploaded on

Question Answering over Implicitly Structured Web Content. Eugene Agichtein* Emory University Chris Burges Microsoft Research Eric Brill Microsoft Research * Research done while at Microsoft Research. Questions are Problematic for Web Search.

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about ' Question Answering over Implicitly Structured Web Content' - wing-zimmerman


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
Question answering over implicitly structured web content

Question Answering over Implicitly Structured Web Content

Eugene Agichtein* Emory University

Chris Burges Microsoft Research

Eric Brill Microsoft Research

* Research done while at Microsoft Research


Questions are problematic for web search
Questions are Problematic for Web Search

  • What was the name of president Fillmore’s cat?

  • Who invented crocs?

Agichtein et al., WI 2007


Web search what was the name of president fillmore s cat
Web search: What was the name of president Fillmore’s cat?

Agichtein et al., WI 2007


Web question answering
Web Question Answering

Why are questions problematic for web search engines?

  • Search engines treat questions as keyword queries, ignoring the semantic relationships between words, and the explicitly stated information need

  • Poor performance for long (> 5 terms) queries

  • Problem exacerbated when common keywords are included

Agichtein et al., WI 2007


… and millions more of other tables and lists …

Agichtein et al., WI 2007


Implicitly structured web content
Implicitly Structured Web Content

  • HTML Tables, Lists

    • Product descriptions

    • Example: Lists of favorite things, “top 10” lists, etc.

  • HTML Syntax (sometimes) reflects semantics

    • Authors imply semantic relationships, entity types by grouping

    • Can infer information about ambiguous entities from others in the same column

  • Millions of HTML tables, lists on the “surface” web alone

    • No common schema

    • Keyword queries: primary access method.

    • How to exploit this structured content for good (e.g., for Question Answering) at web scale?

Agichtein et al., WI 2007


Related work
Related Work

  • Web Question Answering

    • AskMSR (TREC 2001)  Aranea (TREC 2003)

    • Mulder (WWW 2001)

    • A No-Frills Architecture for Lightweight Answer Retrieval (WWW 2007)

  • Web-scale Information Extraction

    • QXtract (ICDE 2003): learn keyword queries to retrieve content

    • KnowItAll (WWW 2004): minimal supervision, larger scale

    • TextRunner (IJCAI 2007): single pass scan, disambiguate at query time

    • Towards Domain-Independent Information Extraction from Web Tables (WWW 2007)

Agichtein et al., WI 2007


Our system tqa overview
Our System TQA: Overview

  • Indexall promising HTML tables

  • Translate a question into select/project query

  • Select table rows, project candidate answers

  • Rank candidate answers

  • Return top K answers

Agichtein et al., WI 2007


Tableqa indexing
TableQA: Indexing

  • Crawl the Web

  • Identify “promising” tables (heuristic, could be improved)

  • Extract metadata for each table

    • Context

    • Document content

    • Document metadata

  • Index extracted metadata

Agichtein et al., WI 2007


Table metadata
Table Metadata

Combines information about the source document, and table context

Agichtein et al., WI 2007


Tqa question processing
TQA Question Processing

Agichtein et al., WI 2007


Table qa querying overview
Table QA: Querying Overview

Agichtein et al., WI 2007


Features for ranking candidate answers
Features for Ranking Candidate Answers

Agichtein et al., WI 2007


Ranking answer candidates
Ranking Answer Candidates

  • Frequency-based (AskMSR):

  • Heuristic weight assignment (AskMSR improved)

  • Neither is robust or general

Agichtein et al., WI 2007


Ranking answer candidates cont
Ranking Answer Candidates (cont)

  • Solution: machine learning-based ranking

    • Naïve Bayes:

      Score(answer) =

    • RankNet(Burges et al. 2005):scalable Neural Net implementation:

      • Optimized for ranking– predicting an ordering of items, not scores for each

      • Trains on pairs (where first point is to be ranked higher or equal to second)

      • Uses cross entropy costandgradient descent to set weights

Agichtein et al., WI 2007


Some implementation details
Some Implementation Details

  • Lucene, distributed indices (20M tables per index)

  • NLP Tools:

    • MS internal Named Entity tagger (many free ones exist)

    • Porter Stemmer

  • Relatively light-weight architecture:

    • Client (question processing): desktop machine

    • Table index server: dual-processor, 8 Gb RAM, WinNT

Agichtein et al., WI 2007


Experimental setup
Experimental Setup

  • Queries: TREC QA 2002, 2003 questions

  • Corpus: 100M web pages (a “random” subset of an MSN Search crawl, from 2005)

  • Evaluation: TREC QA factoid patterns

    • “Minimal” regular expressions to match only right answers

    • Not comprehensive (based on judgement pool)

Agichtein et al., WI 2007


Evaluation metrics
Evaluation Metrics

  • MRR (mean reciprocal rank):

    • MRR @ K = , averaged over all questions

  • Recall @ K:

    • The fraction of the questions for which a system returned a correct answer ranked at or above K.

Agichtein et al., WI 2007


Results 1 accuracy vs corpus size
Results (1): Accuracy vs. Corpus Size

Agichtein et al., WI 2007


Results 2 comparing ranking methods
Results (2): Comparing Ranking Methods

If output consumed by another system, large K ok

Agichtein et al., WI 2007


Results 3 accuracy on hard questions
Results (3): Accuracy on Hard Questions

  • TQA can retrieve answer in top 100 when best QA system not able to return any answer

Agichtein et al., WI 2007


Result summary
Result Summary

  • Requires indexing more than 150M tables before respectable accuracy achieved

  • Performance was around median on TREC 2002, 2003 benchmarks

  • Can be helpful for questions difficult for traditional QA systems

Agichtein et al., WI 2007


Promising directions for future work
Promising Directions for Future Work

  • Craw-time: aggressive pruning/classification

  • Index-time: Integration of related tables

  • Query-time: taxonomies integration/hypernimy

  • User behavior modeling

    • Past clickthrough to rerank candidate tables, answers

    • Query reformulation

Agichtein et al., WI 2007


Conclusions
Conclusions

  • Implicitly structured web content can be useful for web question answering

  • We demonstrated scalability of a lightweight table-based web QA approach

  • Much room for improvement, future research

Agichtein et al., WI 2007


Thank you
Thank you!

Questions?

E-mail: [email protected]

Plug: User Interactions for Web Question Answering: http://www.mathcs.emory.edu/~eugene/uqa/

  • E. Agichtein, E. Brill, S. Dumais, Mining user behavior to improve web search ranking, SIGIR 2006

  • E. Agichtein, User Behavior Mining and Information Extraction: Towards closing the gap, IEEE Data Engineering Bulletin, Dec. 2006

  • E. Agichtein, C. Castillo, D. Donato, A. Gionis, and G. Mishne, Finding High Quality Content in Social Media with applications to Community-based Question Answering, to appear WSDM 2008

Agichtein et al., WI 2007


ad