1 / 33

Methods for Domain-Independant Information Extraction from the Web

Methods for Domain-Independant Information Extraction from the Web. An Experimental Comparison [Etzioni et al., 2004]. Outline. Introduction Paper structure KnowItAll System Rule Learning (RL) Subclass Extraction (SE) List Extraction (LE) Experiments Conclusion. Outline.

Download Presentation

Methods for Domain-Independant Information Extraction from the Web

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Methods for Domain-Independant Information Extraction from the Web An Experimental Comparison [Etzioni et al., 2004]

  2. Outline • Introduction • Paper structure • KnowItAll System • Rule Learning (RL) • Subclass Extraction (SE) • List Extraction (LE) • Experiments • Conclusion Methods for Domain-Independant Information Extraction from the Web.

  3. Outline • Introduction • Paper structure • KnowItAll System • Rule Learning (RL) • Subclass Extraction (SE) • List Extraction (LE) • Experiments • Conclusion Methods for Domain-Independant Information Extraction from the Web.

  4. Introduction • Information extraction from the web (~web mining) • A good prerequisite for this talk: Information granularity 1 information 1 locations. (job posting) 10 information100 locations (HP digital camera) 1,000 infos 100,000 locations (cities of the world) fine coarse Methods for Domain-Independant Information Extraction from the Web.

  5. Outline • Introduction • Paper structure • KnowItAll System • Rule Learning (RL) • Subclass Extraction (SE) • List Extraction (LE) • Experiments • Conclusion Methods for Domain-Independant Information Extraction from the Web.

  6. Outline • Introduction • Paper structure • KnowItAll System • Rule Learning (RL) • Subclass Extraction (SE) • List Extraction (LE) • Experiments • Conclusion Methods for Domain-Independant Information Extraction from the Web.

  7. Paper’ structure • Presentation of an existing WebMining system • Author’ intuition of a « Recall problem » • Proposition of three possible improvements • Definition of a metric for the « quantification of success » • Evaluation of proposed improvements Methods for Domain-Independant Information Extraction from the Web.

  8. Outline • Introduction • Paper structure • KnowItAll System • Rule Learning (RL) • Subclass Extraction (SE) • List Extraction (LE) • Experiments • Conclusion Methods for Domain-Independant Information Extraction from the Web.

  9. Outline • Introduction • Paper structure • KnowItAll System • Rule Learning (RL) • Subclass Extraction (SE) • List Extraction (LE) • Experiments • Conclusion Methods for Domain-Independant Information Extraction from the Web.

  10. 1 2 3 4 KnowItAll System • Autonomous, domain-independant system that extract facts, concepts, and relationships from the Web. Focus (e.g.: city) Patterns instanciation: NP1 such as NP2 = « city such as »Plural(NP1) such as NP2-List = « cities such as » Search + passage retrieval: … a city such asSudbury, at north of the Great Lakes……cities such asChicago, New York, Atlanta and Orlando … Assessor: PMI-IR  Hits(Atlanta AND city) / Hits (Atlanta) Methods for Domain-Independant Information Extraction from the Web.

  11. Outline • Introduction • Paper structure • KnowItAll System • Rule Learning (RL) • Subclass Extraction (SE) • List Extraction (LE) • Experiments • Conclusion Methods for Domain-Independant Information Extraction from the Web.

  12. Outline • Introduction • Paper structure • KnowItAll System • Rule Learning (RL) • Subclass Extraction (SE) • List Extraction (LE) • Experiments • Conclusion Methods for Domain-Independant Information Extraction from the Web.

  13. Rule Learning (RL) • Goal: increase the recall of KnowItAll “city, such as Boston”“mega-city such as Mexico”“within a city, such as Rice University” Patterns Facts (with likelihood) PMI(Boston, city) = 0,60PMI(Mexico, city) = 0,56PMI(Rice University, city) = 0,24 Methods for Domain-Independant Information Extraction from the Web.

  14. Rule Learning (RL) of Boston Collegethe Boston Globe a Boston Parking Spaceheadhquartered in BostonCrime in Mexico continues Mexico City Hotels headhquartered in Mexico Facts (most probable) New patterns Headhquartered in NP Methods for Domain-Independant Information Extraction from the Web.

  15. Rule Learning (RL) • Estimating rule quality • Heuristic 1: remove all substring that appear in a single seed. • Heuristic 2: rule precision = • c is the number of time the rule match a seed • n is the number of time the rule match a known negative example • k / m is the prior estimate of the rule (PMI tests) Methods for Domain-Independant Information Extraction from the Web.

  16. Outline • Introduction • Paper structure • KnowItAll System • Rule Learning (RL) • Subclass Extraction (SE) • List Extraction (LE) • Experiments • Conclusion Methods for Domain-Independant Information Extraction from the Web.

  17. Outline • Introduction • Paper structure • KnowItAll System • Rule Learning (RL) • Subclass Extraction (SE) • List Extraction (LE) • Experiments • Conclusion Methods for Domain-Independant Information Extraction from the Web.

  18. Subclass Extraction (SE) • Goal: increase the recall of KnowItAll Focus: scientist Pattern: « scientist such as NP» … scientist such as Arthur Noyes … scientist such as Isaac Newton … scientist such as Sandra Steingraber Methods for Domain-Independant Information Extraction from the Web.

  19. Subclass Extraction (SE) • Using found facts, apply the reverse pattern: «N such as Arthur Noyes » « chemist such as Arthur Noyes » « biologist such as Sandra Steingraber » • Assess subclasses by PMI trick and morphology test (« ist ») Methods for Domain-Independant Information Extraction from the Web.

  20. Outline • Introduction • Paper structure • KnowItAll System • Rule Learning (RL) • Subclass Extraction (SE) • List Extraction (LE) • Experiments • Conclusion Methods for Domain-Independant Information Extraction from the Web.

  21. Outline • Introduction • Paper structure • KnowItAll System • Rule Learning (RL) • Subclass Extraction (SE) • List Extraction (LE) • Experiments • Conclusion Methods for Domain-Independant Information Extraction from the Web.

  22. List Extraction (LE) • Goal: increase the recall of KnowItAll Find web pages with set (k=4) of random facts. « chicago AND boston AND mexico AND buenos aires » repeat 5,000-10,000 times In each document, try to find « a list » Methods for Domain-Independant Information Extraction from the Web.

  23. List Extraction (LE) Use a web page « wrapper » i.e. a classifier that identify positive nodes (element of the list) and negative nodes (all the remaining html markup) Methods for Domain-Independant Information Extraction from the Web.

  24. List Extraction (LE) • Quality of new fact == number of list in which it appears! • PMI can also be use to assess the quality (LE+A) Methods for Domain-Independant Information Extraction from the Web.

  25. Outline • Introduction • Paper structure • KnowItAll System • Rule Learning (RL) • Subclass Extraction (SE) • List Extraction (LE) • Experiments • Conclusion Methods for Domain-Independant Information Extraction from the Web.

  26. Outline • Introduction • Paper structure • KnowItAll System • Rule Learning (RL) • Subclass Extraction (SE) • List Extraction (LE) • Experiments • Conclusion Methods for Domain-Independant Information Extraction from the Web.

  27. Experiments • How to calculate the recall improvement? • Cannot calculate the true recall (unknown) • Can use the size of the set of facts • But how to make sure the set is pure? • Sort facts by probability • Use only high-quality facts (e.g.: prob > 0.9) • Manually assert a sample Methods for Domain-Independant Information Extraction from the Web.

  28. Experiments Methods for Domain-Independant Information Extraction from the Web.

  29. Experiments Methods for Domain-Independant Information Extraction from the Web.

  30. Outline • Introduction • Paper structure • KnowItAll System • Rule Learning (RL) • Subclass Extraction (SE) • List Extraction (LE) • Experiments • Conclusion Methods for Domain-Independant Information Extraction from the Web.

  31. Outline • Introduction • Paper structure • KnowItAll System • Rule Learning (RL) • Subclass Extraction (SE) • List Extraction (LE) • Experiments • Conclusion Methods for Domain-Independant Information Extraction from the Web.

  32. Conclusion • KnowItAll is an Information extraction system (coarse IE) • The only input is a 1-word « focus » (city, scientist, movie, …) • Pattern instanciation, passage retrieval, PMI-IR test • RL, SE and LE improve extraction recall • Overall LE gives the greatest improvement • SE was notably good on the « scientist » task Methods for Domain-Independant Information Extraction from the Web.

  33. Conclusion http://knowitall-1.cs.washington.edu/dbinterface/knowitall2/default.asp Methods for Domain-Independant Information Extraction from the Web.

More Related