1 / 44

Subbalakshmi Iyer

YAGO:A Large Ontology from Wikipedia and WordNet Fabian M. Suchanek , Gjergji Kasneci , Gerhard Weikum. Subbalakshmi Iyer. Motivation for an Ontology. Natural Language communication Automated text translation Finding information on internet Computer-processable collection of knowledge.

Download Presentation

Subbalakshmi Iyer

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.


Presentation Transcript

  1. YAGO:A Large Ontology from Wikipedia and WordNetFabian M. Suchanek, GjergjiKasneci, Gerhard Weikum Subbalakshmi Iyer

  2. Motivation for an Ontology • Natural Language communication • Automated text translation • Finding information on internet • Computer-processable collection of knowledge

  3. What is an Ontology? • An ontology is the description of a domain, its classes and properties and relationships between those classes by means of a formal language. • collection of knowledge about the world, a knowledge base • Example ontologies: • large taxonomies categorizing Web sites (such as on Yahoo!) • categorizations of products for sale and their features (such as on Amazon.com)

  4. Uses of Ontologies • Machine Translation • Word Sense Disambiguation • Document Classification • Question Answering • Entity and fact-oriented Web Search

  5. What is Yago • Yet Another Great Ontology • Part of Yago-Naga project • Goal to build a knowledge base that is • Large Scale • Domain-independent • Automatic Construction • High Accuracy • Uses Wikipedia and WordNet

  6. More about YAGO • 2 million entities • 20 million facts • Facts represented as RDF triples • Accuracy of 95% • Examples: • Elvis Presley isA singer • singer subClassOf person • Elvis Presley bornOnDate 1935-01-08 • Elvis Presley bornIn Tupelo • Tupelo locatedIn Mississippi(state) • Mississippi(state) locatedIn USA

  7. The YAGO model • Slight extension of RDFS • Represents knowledge as • Entities • Classes • Relations • Facts • Properties of relations like transitivity • Simple and decidable model

  8. Knowledge Representation in YAGO • All objects are entities • e.g. Elvis Presley, Grammy Award • 2 entities can stand in a relationship • e.g. hasWonAward • Elvis Presley hasWonAward Grammy Award • The triple of entity, relationship, entity is a fact • e.g. Elvis Presley hasWonAward Grammy Award is a fact

  9. Knowledge Representation in YAGO -2 • Numbers, dates and strings are also entities. • Elvis Presley BornInYear 1935 • Words are entities • “Elvis” means Elvis Presley • Entity is instance of class • Elvis Presley Type Singer • Classes are also entities • Singer Type class

  10. Knowledge Representation in YAGO- 3 • Classes have hierarchies • Singer SubClassOf Person • Relations are also entities • subClassOf Type atr • Each fact has a fact identifier • #1 FoundIn Wikipedia

  11. Key Contributions of YAGO • Information Extraction from Wikipedia • Infoboxes • Category Pages • Combination with WordNet • Taxonomy • Quality Control • Canonicalization • Type Checking

  12. Information Extraction -1 • Entities from Wikipedia • Each page title is candidate entity • Wiki Markup Language • Wikipedia dump as of September, 2008

  13. Information Extraction - WML

  14. Information Extraction Techniques • Infobox Harvesting • Wikipedia Infoboxes • Word-Level Techniques • Wikipedia Redirects • Category Harvesting • Wikipedia Categories • Type Extraction • Wikipedia Categories, WordNet Classes

  15. 1. Information Extraction from Wikipedia – Infobox Harvesting Wikipedia Infobox

  16. Attribute Map Infobox Bor B B Born: January 8, 1935 Attribute Relation Inverse Manifold Indirect …… Born bornOnDate … Relation Map Relation Domain Range … bornOnDate person yagoDate … Elvis Presley bornOnDate January 8, 1935

  17. Attribute Map Infobox Attribute Relation Inverse Manifold Indirect Bor B B Died: August 16, 1977 …… Died diedOnDate … Relation Map Relation Domain Range … diedOnDate person yagoDate … Elvis Presley diedOnDate August 16, 1977

  18. Attribute Map Infobox Attribute Relation Inverse Manifold Indirect Bor B B Genre: Rock and Roll …… Genre isOfGenre … Relation Map Relation Domain Range … isOfGenre entity yagoClass … Elvis Presley isOfGenre Rock and Roll

  19. Attribute Map Infobox Bor B B Birth Name: Elvis Aaron Presley Attribute Relation Inverse Manifold Indirect …… birth name means … Relation Map Relation Domain Range … means yagoWord entity … Elvis Aaron Presley means Elvis Presley

  20. Manifold Attributes • Some attributes may have multiple values • e.g. a person may have multiple children • Multiple facts are generated • e.g. one hasChild fact for each child

  21. Indirect Attributes - 1 Attribute Map • Some attributes do not concern article entity, but another fact • e.g attribute GDP does not concern the article entity i.e. Republic of Singapore, but year 2008 • Therefore, facts generated: • Singapore hasGDP 238.755 billion • #14 during 2008 • Singapore hasGDP 238.755 billion during 2008 Attribute Relation Inverse Manifold Indirect …… gdp ppp hasGDP gdp year during

  22. Indirect Attributes - 2 Singapore Infobox

  23. Type of Infobox American Pie Tesla Roadster Released October, 1971 Format vinyl record Genre Folk Rock Length 8:33 mins Label United Artists Writer Don McLean Manufacturer Tesla Motors Production 2008-present Class Roadster Length 3,946 mm Width 1,873 mm Height 1,127 mm Song Infobox Car Infobox

  24. Type of Infobox: Attribute Map Attribute Map Attribute Relation Inverse Manifold Indirect …… car #length hasLength … song #length hasDuration … Song Infobox Car Infobox Tesla Roadster hasLength 3946 American Pie hasDuration 8:33

  25. Information Extraction - Word Level Techniques • Wikipedia Redirects • virtual redirect page for “Presley, Elvis“ links to “Elvis Presley” • Each redirect gives ‘means’ fact • e.g. “Presley, Elvis“ means Elvis Presley • Parsing Person Names • extract the name components • establish relations givenNameOf and familyNameOf e.g. Presley familyNameOf Elvis Presley Elvis givenNameOf Elvis Presley

  26. Wikipedia Categories Categories: Presidents of the United States | Lists of office-holders | Lists of Presidents Categories: Rift Valleys | North Sea | Rivers of Germany | Articles needing translation from German Wikipedia | Rivers of Netherlands Categories: Canadian Singers| Canadian male singers| 1959 births | English-language singers | Living people | Grammy Award Winners | Portrait photographers

  27. Facts created from Wikipedia Categories • Rhine locatedIn Germany • Bryan Adams bornOnDate 1959 • Bryan Adams hasWonAward Grammy Award • Abraham Lincoln politicianOf United States

  28. Information Extraction - Category Harvesting • Relational Categories RegularExpression Relation ([0-9]f3,4g) births ([0-9]f3,4g) deaths ([0-9]f3,4g) establishments ([0-9]f3,4g) books|novels MountainsjRivers in (.*) PresidentsjGovernors of (.*) (.*) winners [A-Za-z]+ (.*) winners bornOnDate diedOnDate establishedOnDate writtenOnDate locatedIn politicianOf hasWonPrize hasWonPrize Table: Some Category Heuristics

  29. 2. Connecting Wikipedia and WordNet – What is WordNet • Lexical database for the English language • Created at the Cognitive Science Laboratory of Princeton University • Groups English words into sets of synonyms called synsets • Provides short, general definitions • Provides hypernym/hyponym relations • e.g. canine is hypernym, dog is hyponym

  30. Connecting Wikipedia and WordNet – Type Extraction • Goal: create class hierarchy • e.g. singer subClassOf performer performer subClassOf artist • hyponymy relation from WordNet • Wikipedia class ‘American people in Japan’ is subclass of WordNet class ‘person’

  31. Classifications of Categories • Conceptual Categories • e.g. Albert Einstein is in ‘Naturalized citizens of the United States’ • Administrative Categories • e.g. Albert Einstein is in ‘Articles with unsourced statements’ • Relational Information • 1879 births • Thematic Vicinity • Physics

  32. Identification of Conceptual Categories • Only conceptual categories are used • Shallow linguistic parsing of category names • e.g. category ‘American people in Japan’ • Break category into • pre-modifier - ‘American’ • head - ‘people’ • post-modifier - ‘in Japan’ • If head is plural, then category is conceptual category • Extract class from Wikipedia category • Connect to class from WordNet • e.g. the Wikipedia class ‘American people in Japan’ has to be made a subclass of the WordNet class ‘person’

  33. Algorithm Function wiki2wordnet(c) Input: Wikipedia category name c Output: WordNet synset 1 head =headCompound(c) 2 pre =preModifier(c) 3 post =postModifier(c) 4 head =stem(head) 5 If there is a WordNet synset s for pre + head 6 return s 7 If there are WordNet synsets s1, … , sn for head 8 (ordered by their frequency for head) 9 return s1 10 fail

  34. Explanation of Algorithm • Input: American people in Japan • pre-modifier : American • Head : people • Post-modifier : in Japan • Stem(head) : person • If there is a WordNet synset for ‘American person’ • return that synset • If there are s1, …, sn synsets for ‘person’ • (Ordered by frequency for ‘person’) • Return s1 • Fail • Output: person • Result: American People in Japan subClassOf person

  35. Fig.: WordNet search for “person” Fig.: WordNet search for ‘American Person’

  36. Exceptions • Complete hierarchy of classes • Upper classes from WordNet • Leaves from Wikipedia • 2 dozen cases failed • Categories with head compound “capital” • In Wikipedia, it means “capital city” • In WordNet, it means “financial asset” • These cases were corrected manually

  37. 3. Quality Control • Canonicalization • Each fact and each entity reference unique • an entity is always referred to by the same identifier in all facts in YAGO • Type Checking • eliminates individuals that do not have class • eliminates facts that do not respect domain and range constraints • an argument of a fact in YAGO is always an instance of the class required by the relation

  38. Canonicalization - 1 • Redirect Resolution • infobox heuristics deliver facts that have Wikipedia entities (i.e. Wikipedia links) as arguments • These links may not be correct Wikipedia page identifiers • Check if each argument is correct Wikipedia identifier • Replace by correct, redirected identifier • E.g. Hermitage Museum locatedIn St. Petersburg • Hermitage Museum locatedIn Saint Petersburg

  39. Canonicalization - 2 • Removal of Duplicate facts • Sometimes, 2 heuristics deliver the same fact. • canonicalization eliminates one of them • e.g., category ‘1935 births’ yields the fact: • Elvis Presley bornOnDate 1935 • Infobox attribute ‘Born: January 8, 1935’ yields the fact: • Elvis Presley bornOnDate January 8, 1935

  40. Type Checking - 1 • Reductive Type Checking • Sometimes class of entity cannot be determined • Such facts are discarded • e.g. Wikipedia entities that have been proposed for an article, but that do not have a page yet • Inductive Type Checking • Type constraints can be used to generate facts • e.g. Elvis Presley bornOnDate January 8, 1935 • So, Elvis Presley is a person • Regular expression check to ensure entity name pattern of given name and family name

  41. Type Checking - 2 • Type Coherence Checking • Sometimes, classification yields wrong results • e.g. Abraham Lincoln is instance of 13 classes • 12 are subclasses of class ‘person’; e.g. lawyer, president • 13th class is class ‘cabinet’ • Class hierarchy of YAGO is partitioned into branches • e.g. locations, artifacts, people, other physical • entities, and abstract entities • Branch that most types lead to, is determined • Other types are purged

  42. References • YAGO:ALarge Ontology from Wikipedia andWordNet Fabian M. Suchanek, Gjergji Kasneci, GerhardWeikum Max-Planck-Institute for Computer Science, Saarbruecken, Germany • Automated Construction and Growth of a Large Ontology Fabian M. Suchanek Thesis for obtaining the title of Doctor of Engineering of the Faculties of Natural Sciences and Technology of Saarland University • Wikipedia http://en.wikipedia.org/wiki/Main_Page • WordNet http://wordnet.princeton.edu/

  43. Thank You, Any Questions?

More Related