440 likes | 584 Views
YAGO:A Large Ontology from Wikipedia and WordNet Fabian M. Suchanek , Gjergji Kasneci , Gerhard Weikum. Subbalakshmi Iyer. Motivation for an Ontology. Natural Language communication Automated text translation Finding information on internet Computer-processable collection of knowledge.
E N D
YAGO:A Large Ontology from Wikipedia and WordNetFabian M. Suchanek, GjergjiKasneci, Gerhard Weikum Subbalakshmi Iyer
Motivation for an Ontology • Natural Language communication • Automated text translation • Finding information on internet • Computer-processable collection of knowledge
What is an Ontology? • An ontology is the description of a domain, its classes and properties and relationships between those classes by means of a formal language. • collection of knowledge about the world, a knowledge base • Example ontologies: • large taxonomies categorizing Web sites (such as on Yahoo!) • categorizations of products for sale and their features (such as on Amazon.com)
Uses of Ontologies • Machine Translation • Word Sense Disambiguation • Document Classification • Question Answering • Entity and fact-oriented Web Search
What is Yago • Yet Another Great Ontology • Part of Yago-Naga project • Goal to build a knowledge base that is • Large Scale • Domain-independent • Automatic Construction • High Accuracy • Uses Wikipedia and WordNet
More about YAGO • 2 million entities • 20 million facts • Facts represented as RDF triples • Accuracy of 95% • Examples: • Elvis Presley isA singer • singer subClassOf person • Elvis Presley bornOnDate 1935-01-08 • Elvis Presley bornIn Tupelo • Tupelo locatedIn Mississippi(state) • Mississippi(state) locatedIn USA
The YAGO model • Slight extension of RDFS • Represents knowledge as • Entities • Classes • Relations • Facts • Properties of relations like transitivity • Simple and decidable model
Knowledge Representation in YAGO • All objects are entities • e.g. Elvis Presley, Grammy Award • 2 entities can stand in a relationship • e.g. hasWonAward • Elvis Presley hasWonAward Grammy Award • The triple of entity, relationship, entity is a fact • e.g. Elvis Presley hasWonAward Grammy Award is a fact
Knowledge Representation in YAGO -2 • Numbers, dates and strings are also entities. • Elvis Presley BornInYear 1935 • Words are entities • “Elvis” means Elvis Presley • Entity is instance of class • Elvis Presley Type Singer • Classes are also entities • Singer Type class
Knowledge Representation in YAGO- 3 • Classes have hierarchies • Singer SubClassOf Person • Relations are also entities • subClassOf Type atr • Each fact has a fact identifier • #1 FoundIn Wikipedia
Key Contributions of YAGO • Information Extraction from Wikipedia • Infoboxes • Category Pages • Combination with WordNet • Taxonomy • Quality Control • Canonicalization • Type Checking
Information Extraction -1 • Entities from Wikipedia • Each page title is candidate entity • Wiki Markup Language • Wikipedia dump as of September, 2008
Information Extraction Techniques • Infobox Harvesting • Wikipedia Infoboxes • Word-Level Techniques • Wikipedia Redirects • Category Harvesting • Wikipedia Categories • Type Extraction • Wikipedia Categories, WordNet Classes
1. Information Extraction from Wikipedia – Infobox Harvesting Wikipedia Infobox
Attribute Map Infobox Bor B B Born: January 8, 1935 Attribute Relation Inverse Manifold Indirect …… Born bornOnDate … Relation Map Relation Domain Range … bornOnDate person yagoDate … Elvis Presley bornOnDate January 8, 1935
Attribute Map Infobox Attribute Relation Inverse Manifold Indirect Bor B B Died: August 16, 1977 …… Died diedOnDate … Relation Map Relation Domain Range … diedOnDate person yagoDate … Elvis Presley diedOnDate August 16, 1977
Attribute Map Infobox Attribute Relation Inverse Manifold Indirect Bor B B Genre: Rock and Roll …… Genre isOfGenre … Relation Map Relation Domain Range … isOfGenre entity yagoClass … Elvis Presley isOfGenre Rock and Roll
Attribute Map Infobox Bor B B Birth Name: Elvis Aaron Presley Attribute Relation Inverse Manifold Indirect …… birth name means … Relation Map Relation Domain Range … means yagoWord entity … Elvis Aaron Presley means Elvis Presley
Manifold Attributes • Some attributes may have multiple values • e.g. a person may have multiple children • Multiple facts are generated • e.g. one hasChild fact for each child
Indirect Attributes - 1 Attribute Map • Some attributes do not concern article entity, but another fact • e.g attribute GDP does not concern the article entity i.e. Republic of Singapore, but year 2008 • Therefore, facts generated: • Singapore hasGDP 238.755 billion • #14 during 2008 • Singapore hasGDP 238.755 billion during 2008 Attribute Relation Inverse Manifold Indirect …… gdp ppp hasGDP gdp year during
Indirect Attributes - 2 Singapore Infobox
Type of Infobox American Pie Tesla Roadster Released October, 1971 Format vinyl record Genre Folk Rock Length 8:33 mins Label United Artists Writer Don McLean Manufacturer Tesla Motors Production 2008-present Class Roadster Length 3,946 mm Width 1,873 mm Height 1,127 mm Song Infobox Car Infobox
Type of Infobox: Attribute Map Attribute Map Attribute Relation Inverse Manifold Indirect …… car #length hasLength … song #length hasDuration … Song Infobox Car Infobox Tesla Roadster hasLength 3946 American Pie hasDuration 8:33
Information Extraction - Word Level Techniques • Wikipedia Redirects • virtual redirect page for “Presley, Elvis“ links to “Elvis Presley” • Each redirect gives ‘means’ fact • e.g. “Presley, Elvis“ means Elvis Presley • Parsing Person Names • extract the name components • establish relations givenNameOf and familyNameOf e.g. Presley familyNameOf Elvis Presley Elvis givenNameOf Elvis Presley
Wikipedia Categories Categories: Presidents of the United States | Lists of office-holders | Lists of Presidents Categories: Rift Valleys | North Sea | Rivers of Germany | Articles needing translation from German Wikipedia | Rivers of Netherlands Categories: Canadian Singers| Canadian male singers| 1959 births | English-language singers | Living people | Grammy Award Winners | Portrait photographers
Facts created from Wikipedia Categories • Rhine locatedIn Germany • Bryan Adams bornOnDate 1959 • Bryan Adams hasWonAward Grammy Award • Abraham Lincoln politicianOf United States
Information Extraction - Category Harvesting • Relational Categories RegularExpression Relation ([0-9]f3,4g) births ([0-9]f3,4g) deaths ([0-9]f3,4g) establishments ([0-9]f3,4g) books|novels MountainsjRivers in (.*) PresidentsjGovernors of (.*) (.*) winners [A-Za-z]+ (.*) winners bornOnDate diedOnDate establishedOnDate writtenOnDate locatedIn politicianOf hasWonPrize hasWonPrize Table: Some Category Heuristics
2. Connecting Wikipedia and WordNet – What is WordNet • Lexical database for the English language • Created at the Cognitive Science Laboratory of Princeton University • Groups English words into sets of synonyms called synsets • Provides short, general definitions • Provides hypernym/hyponym relations • e.g. canine is hypernym, dog is hyponym
Connecting Wikipedia and WordNet – Type Extraction • Goal: create class hierarchy • e.g. singer subClassOf performer performer subClassOf artist • hyponymy relation from WordNet • Wikipedia class ‘American people in Japan’ is subclass of WordNet class ‘person’
Classifications of Categories • Conceptual Categories • e.g. Albert Einstein is in ‘Naturalized citizens of the United States’ • Administrative Categories • e.g. Albert Einstein is in ‘Articles with unsourced statements’ • Relational Information • 1879 births • Thematic Vicinity • Physics
Identification of Conceptual Categories • Only conceptual categories are used • Shallow linguistic parsing of category names • e.g. category ‘American people in Japan’ • Break category into • pre-modifier - ‘American’ • head - ‘people’ • post-modifier - ‘in Japan’ • If head is plural, then category is conceptual category • Extract class from Wikipedia category • Connect to class from WordNet • e.g. the Wikipedia class ‘American people in Japan’ has to be made a subclass of the WordNet class ‘person’
Algorithm Function wiki2wordnet(c) Input: Wikipedia category name c Output: WordNet synset 1 head =headCompound(c) 2 pre =preModifier(c) 3 post =postModifier(c) 4 head =stem(head) 5 If there is a WordNet synset s for pre + head 6 return s 7 If there are WordNet synsets s1, … , sn for head 8 (ordered by their frequency for head) 9 return s1 10 fail
Explanation of Algorithm • Input: American people in Japan • pre-modifier : American • Head : people • Post-modifier : in Japan • Stem(head) : person • If there is a WordNet synset for ‘American person’ • return that synset • If there are s1, …, sn synsets for ‘person’ • (Ordered by frequency for ‘person’) • Return s1 • Fail • Output: person • Result: American People in Japan subClassOf person
Fig.: WordNet search for “person” Fig.: WordNet search for ‘American Person’
Exceptions • Complete hierarchy of classes • Upper classes from WordNet • Leaves from Wikipedia • 2 dozen cases failed • Categories with head compound “capital” • In Wikipedia, it means “capital city” • In WordNet, it means “financial asset” • These cases were corrected manually
3. Quality Control • Canonicalization • Each fact and each entity reference unique • an entity is always referred to by the same identifier in all facts in YAGO • Type Checking • eliminates individuals that do not have class • eliminates facts that do not respect domain and range constraints • an argument of a fact in YAGO is always an instance of the class required by the relation
Canonicalization - 1 • Redirect Resolution • infobox heuristics deliver facts that have Wikipedia entities (i.e. Wikipedia links) as arguments • These links may not be correct Wikipedia page identifiers • Check if each argument is correct Wikipedia identifier • Replace by correct, redirected identifier • E.g. Hermitage Museum locatedIn St. Petersburg • Hermitage Museum locatedIn Saint Petersburg
Canonicalization - 2 • Removal of Duplicate facts • Sometimes, 2 heuristics deliver the same fact. • canonicalization eliminates one of them • e.g., category ‘1935 births’ yields the fact: • Elvis Presley bornOnDate 1935 • Infobox attribute ‘Born: January 8, 1935’ yields the fact: • Elvis Presley bornOnDate January 8, 1935
Type Checking - 1 • Reductive Type Checking • Sometimes class of entity cannot be determined • Such facts are discarded • e.g. Wikipedia entities that have been proposed for an article, but that do not have a page yet • Inductive Type Checking • Type constraints can be used to generate facts • e.g. Elvis Presley bornOnDate January 8, 1935 • So, Elvis Presley is a person • Regular expression check to ensure entity name pattern of given name and family name
Type Checking - 2 • Type Coherence Checking • Sometimes, classification yields wrong results • e.g. Abraham Lincoln is instance of 13 classes • 12 are subclasses of class ‘person’; e.g. lawyer, president • 13th class is class ‘cabinet’ • Class hierarchy of YAGO is partitioned into branches • e.g. locations, artifacts, people, other physical • entities, and abstract entities • Branch that most types lead to, is determined • Other types are purged
References • YAGO:ALarge Ontology from Wikipedia andWordNet Fabian M. Suchanek, Gjergji Kasneci, GerhardWeikum Max-Planck-Institute for Computer Science, Saarbruecken, Germany • Automated Construction and Growth of a Large Ontology Fabian M. Suchanek Thesis for obtaining the title of Doctor of Engineering of the Faculties of Natural Sciences and Technology of Saarland University • Wikipedia http://en.wikipedia.org/wiki/Main_Page • WordNet http://wordnet.princeton.edu/