1 / 38

Deriving a Web-Scale Commonsense Fact Database

Deriving a Web-Scale Commonsense Fact Database . Niket Tandon Gerard de Melo Gerhard Weikum Max Planck Institute for Informatics. Aug 11, 2011. Some trivial facts. Apples are green, red, juicy, sweet but not fast, funny. Parks and meadows are green or lively but not black or slow.

enye
Download Presentation

Deriving a Web-Scale Commonsense Fact Database

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Deriving a Web-Scale Commonsense Fact Database Niket TandonGerard de Melo Gerhard Weikum Max Planck Institute for Informatics Aug 11, 2011

  2. Some trivial facts.. Apples are green, red, juicy, sweet but not fast, funny Parks and meadows are green or lively but not black or slow Keys are kept in pocket but not in theair Question: How do computers know? Solution: Build a commonsense knowledge base 2

  3. Introduction What is the problem? • Harvest commonsense facts from text: • Flower is soft, hasProperty(flower,soft) • Room is part of house partOf(room, house) Why is it hard? • Rarely mentioned in text • Noise with natural language text What is required to tackle the problem? Web Scale corpus but.. Web Scale corpus is hard to get! • Use Web Scale N-grams => poses interesting research challenges

  4. Message of the talk • N-grams simulate larger corpus • Existing information extraction models must be carefully adapted for harvesting facts.

  5. Agenda 1 Introduction 2 Pattern based Information extraction model 3 Web N-grams 4 Pattern ranking 5 Extraction and ranking facts 5

  6. Agenda 1 Introduction 2 Pattern based Information extraction model X isvery Y 3 Web N-grams 4 Pattern ranking (ice, cold) (flower, beautiful) 5 Extraction and ranking facts (fire, hot) 6

  7. Good seeds => (Good) Patterns

  8. Good seeds => (Good) Patterns

  9. Good seeds => (Good) Patterns

  10. Good seeds => (Good) Patterns

  11. Good patterns => (Good) tuples

  12. Model

  13. State of the art - Pattern based IE • Dipre - Brin ‘98 • Snowball - Agichtein et al. ‘00 • KnowItAll - Etzioni et al. ’04 • Observations: • Low Recall on easily available corpora (large corpus is difficult to get) • Low Precision when applied to our corpus

  14. Agenda 1 Introduction Introduction 2 Pattern based Information extraction model Pattern based Information extraction model 3 Web N-grams Web N-grams 4 Pattern ranking Pattern ranking 5 Extraction and ranking facts Extraction and ranking facts The corpus we use to extract facts is Web Scale N-grams 14

  15. Web-scaleN-grams • N-gram: sequence of N consecutive word tokens • e.g.the apples are very red • Web-scale N-gram statistics derived from trillion words • e.g. the apples are very red 12000 • Google N-grams, Microsoft N-grams, Yahoo N-grams • N-gram dataset limitations • Length <= 5 • cannot! => the apple that he was eating was very red • But... • Most commonsense relations fit this small context • Sheer Volume of data 15

  16. Example of Commonsense Relations

  17. Our approach • Use ConceptNet data as seeds to harvest commonsense facts from Google N-Grams corpus • ConceptNet: MIT’s common sense knowledge base constructed by crowd-sourcing and further processed • We take very large number of seeds • Avoids drift with iterations • We consider variations of seeds for nouns (plural form) • [key, pocket] , [keys, pocket] ,[keys, pockets] • Gives very large number of potential patterns, but most are noise • Constrain patterns by Part Of Speech tags. • X<noun> is very Y<adjective> • Need to carefully rank potential patterns 17

  18. Agenda 1 Introduction 2 Pattern based Information extraction model 3 Web N-grams 4 Pattern ranking 5 Extraction and ranking facts One dirty fish spoils the whole pond! 18

  19. Existing Pattern Ranking Approach: PMI PMI score for pattern (p) with matching seeds(x,y) 19

  20. Pattern ranking – Observation 1 • Score based on Observation 1 • employs gradient (!threshold) Power-Law curve s(x) ~ axk 20

  21. Pattern ranking – Observation 2 • Score of a pattern: 21

  22. Pattern ranking – Our Approach • Combine scores based on observations 1 and 2 using logistic function • Combined Pattern Score: 22

  23. Improvement over PMI in Pattern ranking (isA relation) San Francisco and other cities

  24. Agenda 1 Introduction 2 Pattern based Information extraction model 3 Web N-grams 4 Pattern ranking 5 Extraction and ranking facts 24

  25. Estimate fact confidence : simple approach Good tuples match many patterns Pattern id Frequency • Matches several patterns • Matches few patterns • Matches few patterns Gives low recall but high precision

  26. Our Fact Ranking Approach • These pattern count feature vectors are used to learn a Decision Tree. • Gives facts with estimated confidence 26

  27. Recap: Model

  28. Experimental Setup • Test Data (true and false labels): Randomly chosen high confidence facts from ConceptNet • Precision and Recall computed using 10-fold cross-validation over the test data • Classifier used: Decision Trees with Adaptive Boosting

  29. Results – more than 200 million facts extracted Extension of ConceptNet by orders of magnitude

  30. Further Directions • Tune system towards higher precision to release high-quality knowledge base • Applications enabled by commonsense knowledge Base 30

  31. Take home message • N-grams simulate a larger corpus • N-grams embed patterns and frequency • Novel pattern ranking adapted for N-gram corpus • PMI not the best choice in our case • Extracted Fact Matrix extends Conceptnet by more than 200x! 31

  32. Thank you!ntandon@mpi-inf.mpg.de hasProperty(flower , *)

  33. Additional slides follow

  34. Inaccuracies in Conceptnet • Properties are wrongly compounded • HasProperty(apple, green yellow red)[usually] 1 • Score of zero to correct tuples • HasProperty(wallet, black)[] 0 • Negated scores are infact commonsense facts • HasProperty(jeans, blue)[not] 1 • Confusing polarity for machine consumption • HasProperty(jeans, blue)[not] 1 • HasProperty(jeans, blue)[often] 1 • HasProperty(jeans, blue)[usually] 1 • Wrongly labeled as hasProperty. • HasProperty(literature, book)[] 1 • Some are just facts but not commonsense • HasProperty(high point gibraltar, rock gibraltar 426 m)[] 1

  35. Related Work

  36. Synthetic training data generation • Seeds overlap Matrix • JaccardSim(a,b) • If Sim ~ 0 , unrelated relations • Combine Seeds from unrelated relations to generate incorrect or negative tuples

  37. All results

More Related