Joint work with Georgiana Ifrim, Gjergji Kasneci, Thomas Neumann, Maya Ramanath, Fabian Suchanek

Joint work with Georgiana Ifrim, Gjergji Kasneci, Thomas Neumann, Maya Ramanath, Fabian Suchanek

Vision Opportunity: Turn the Web (and Web 2.0 and Web 3.0 ...) into the world‘s most comprehensive knowledge base • Approach: • 1) harvest and combine • hand-crafted knowledge sources • (Semantic Web, ontologies) • automatic knowledge extraction • (Statistical Web, text mining) • social communities and human computing • (Social Web, Web 2.0) • 2) express knowledge queries, search, and rank • 3) everything efficient and scalable

Why Google and Wikipedia Are Not Enough Answer „knowledge queries“ such as: proteins that inhibit proteases and other human enzymes connection between Thomas Mann and Goethe German Nobel prize winner who survived both world wars and all of his four children German universities with world-class computer scientists politicians who are also scientists

Why Google and Wikipedia Are Not Enough Which politicians are also scientists ? • What is lacking? • Information is not Knowledge. • Knowledge is not Wisdom. • Wisdom is not Truth • Truth is not Beauty. • Beauty is not Music. • Music is the best. • (Frank Zappa) • extract facts from Web pages • capture user intention by concepts, entities, relations

NAGA Example Query: $x isa politician $x isa scientist Results: Benjamin Franklin Paul Wolfowitz Angela Merkel …

Related Work Cimple DBlife Libra TextRunner START Answers Avatar information extraction & ontology building Web entity search & QA UIMA Hakia Powerset Freebase EntityRank Cyc DBpedia semistructured IR & graph search TopX XQ-FT Yago Naga Tijah SPARQL DBexplorer Banks SWSE

Outline  Motivation Information Extraction & Knowledge Harvesting (YAGO) • • Ranking for Search over Entity-Relation Graphs (NAGA) • Efficient Query Processing (RDF-3X) • Conclusion

Information Extraction (IE): Text to Records Person BirthDate BirthPlace ... Max Planck 4/23, 1858 Kiel Albert Einstein 3/14, 1879 Ulm Mahatma Gandhi 10/2, 1869 Porbandar Person ScientificResult Max Planck Quantum Theory Person Collaborator Max Planck Albert Einstein Max Planck Niels Bohr Constant Value Dimension Planck‘s constant 6.2261023 Js • extracted facts often • have confidence < 1 • DB with uncertainty (probabilistic DB) expensive and error-prone combine NLP, pattern matching, lexicons, statistical learning

High-Quality Knowledge Sources General-purpose ontologies and thesauri: WordNet family • 200 000 concepts and relations; • can be cast into • description logics or • graph, with weights for relation strengths • (derived from co-occurrence statistics) scientist, man of science (a person with advanced knowledge) => cosmographer, cosmographist => biologist, life scientist => chemist => cognitive scientist => computer scientist ... => principal investigator, PI … HAS INSTANCE => Bacon, Roger Bacon …

Exploit Hand-Crafted Knowledge Wikipedia, WordNet, and other lexical sources {{Infobox_Scientist | name = Max Planck | birth_date = [[April 23]], [[1858]] | birth_place = [[Kiel]], [[Germany]] | death_date = [[October 4]], [[1947]] | death_place = [[Göttingen]], [[Germany]] | residence = [[Germany]] | nationality = [[Germany|German]] | field = [[Physicist]] | work_institution = [[University of Kiel]]</br> [[Humboldt-Universität zu Berlin]]</br> [[Georg-August-Universität Göttingen]] | alma_mater = [[Ludwig-Maximilians-Universität München]] | doctoral_advisor = [[Philipp von Jolly]] | doctoral_students = [[Gustav Ludwig Hertz]]</br> … | known_for = [[Planck's constant]], [[Quantum mechanics|quantum theory]] | prizes = [[Nobel Prize in Physics]] (1918) …

Exploit Hand-Crafted Knowledge Wikipedia, WordNet, and other lexical sources

YAGO: Yet Another Great Ontology[F. Suchanek, G. Kasneci, G. Weikum: WWW‘07] • Turn Wikipedia into explicit knowledge base (semantic DB); • keep source pages as witnesses • Exploit hand-crafted categoriesand infobox templates • Represent facts as explicit knowledge triples: • relation (entity1, entity2) • (in FOL, compatible with RDF, OWL-lite, XML, etc.) • Map (and disambiguate) relations into WordNet concept DAG relation entity1 entity2 Examples: bornIn isInstanceOf City Max_Planck Kiel Kiel

YAGO Knowledge Base[F. Suchanek et al.: WWW’07] Entities Facts KnowItAll 30 000 SUMO 20 000 60 000 WordNet 120 000 80 000 Cyc 300 000 5 Mio. TextRunner n/a 8 Mio. YAGO 1.7 Mio. 15 Mio. DBpedia 1.9 Mio. 103 Mio. Freebase ??? ??? Accuracy  95% Entity subclass subclass Person concepts Location subclass Scientist subclass subclass subclass subclass City Country Biologist Physicist instanceOf instanceOf Erwin_Planck Nobel Prize bornIn Kiel hasWon FatherOf individuals diedOn bornOn October 4, 1947 Max_Planck April 23, 1858 means means means “Max Karl Ernst Ludwig Planck” “Dr. Planck” “Max Planck” words Online access and download at http://www.mpi-inf.mpg.de/~suchanek/yago/

Wikipedia Harvesting: Difficulties & Solutions • instanceOf relation: isleading and difficult category names • („disputed articles“, „particle physics“, „American Music of the 20th Century“, • „Nobel laureates in physics“, „naturalized citizens of the United States“, … ) •  noun group parser: ignore when head word in singular • isA relation: mapping categories onto WordNet classes: • „Nobel laureates in physics“  Nobel_laureates, „people from Kiel“  person •  map to (singular of) head; exploit synsets and statistics • Entity name ambiguities: • „St. Petersburg“, „Saint Petersburg“, „M31“, „NGC224“  means ... •  exploit Wikipedia redirects & disambiguations, WN synsets • type checking for scrutinizing candidates: • accept fact candidate only if arguments have proper classes • marriedTo (Max Planck, quantum physics)  Person  Person

Higher-Order Facts in YAGO validIn validIn 1990-2008 1949-1989 facts about facts represented by reification as first-order facts e314159 validIn CapitalOf e314159 1990-2008 Berlin Germany Arnold Schwarzen- egger instanceOf Actor validIn instanceOf 1987-2008 validIn Politician 2003-2008 CapitalOf CapitalOf Bonn Berlin Germany

Ongoing Work: YAGO for Easier IE NP VP PP NP NP PP NP NP NP PP NP VP NP PP NP NP NP Cologne lies on the banks of the Rhine People in Cairo like wine from the Rhine valley Mp Js Os AN Ss MVp DMc Mp Dg Jp Js Sp Mvp Ds Js YAGO knows (almost) all (interesting) entities leverage for discovering & extracting new facts in NL texts IE with dependency parser is expensive ! river city The cityof Paris was founded on an island in the Seine in 300 BC isa isa runs Through Paris Seine locatedIn France locatedIn locatedIn • can filter out many uninteresting sentences • can quickly identify relation arguments • can eliminate many fact candidates by type checking • can focus on specific properties like time Europe

Outline  Motivation  Information Extraction & Knowledge Harvesting (YAGO) • Ranking for Search over Entity-Relation Graphs (NAGA) • Efficient Query Processing (RDF-3X) • Conclusion

NAGA: Graph Search [G. Kasneci et al.: ICDE‘08] Graph-based search on YAGO-style knowledge bases with built-in ranking based on confidence and informativeness discovery queries connectedness queries * isa Thomas Mann German novelist isa isa Goethe politician $x scientist complex queries (with regular expressions) inField wonPrize isa computer science $p $x scientist worksAt | graduatedFrom locatedIn* $u university Germany isa capitalOf queries over reified facts isa city $c Germany validIn 1988

Search Results Without Ranking q: Fisher isa scientist Fisher isa $x $@Fisher = Ronald_Fisher $@scientist = scientist_109871938 $X = alumnus_109165182 $@Fisher = Irving_Fisher $@scientist = scientist_109871938 $X = social_scientist_109927304 $@Fisher = James_Fisher $@scientist = scientist_10981938 $X = ornithologist_109711173 $@Fisher = Ronald_Fisher $@scientist = scientist_109871938 $X = theorist_110008610 $@Fisher = Ronald_Fisher $@scientist = scientist_109871938 $X = colleague_109301221 $@Fisher = Ronald_Fisher $@scientist = scientist_109871938 $X = organism_100003226 … mathematician_109635652 —subClassOf—> scientist_109871938 Alumni_of_Gonville_and_Caius_College,_Cambridge —subClassOf—> alumnus_109165182 "Fisher" —familyNameOf—> Ronald_Fisher Ronald_Fisher —type—> Alumni_of_Gonville_and_Caius_College,_Cambridge Ronald_Fisher —type—> 20th_century_mathematicians "scientist" —means—> scientist_109871938

Ranking with Statistical Language Model q: Fisher isa scientist Fisher isa $x $@Fisher = Ronald_Fisher $@scientist = scientist_109871938 $X = mathematician_109635652 $@Fisher = Ronald_Fisher $@scientist = scientist_109871938 $X = statistician_109958989 $@Fisher = Ronald_Fisher $@scientist = scientist_109871938 $X = president_109787431 $@Fisher = Ronald_Fisher $@scientist = scientist_109871938 $X = geneticist_109475749 $@Fisher = Ronald_Fisher $@scientist = scientist_109871938 $X = scientist_109871938 … Score: 7.184462521168058E-13 mathematician_109635652 —subClassOf—> scientist_109871938 "Fisher" —familyNameOf—> Ronald_FisherRonald_Fisher —type—> 20th_century_mathematicians "scientist" —means—> scientist_109871938 20th_century_mathematicians —subClassOf—> mathematician_109635652  statistical language model for result graphs Online access at http://www.mpi-inf.mpg.de/~kasneci/naga/

Ranking Factors • Confidence: • Prefer results that are likely to be correct • Certainty of IE • Authenticity and Authority of Sources bornIn (Max Planck, Kiel) from „Max Planck was born in Kiel“ (Wikipedia) livesIn (Elvis Presley, Mars) from „They believe Elvis hides on Mars“ (Martian Bloggeria) • Informativeness: • Prefer results that are likely important • May prefer results that are likely new to user • Frequency in answer • Frequency in corpus (e.g. Web) • Frequency in query log q: isa (Einstein, $y) isa (Einstein, scientist) isa (Einstein, vegetarian) q: isa ($x, vegetarian) isa (Einstein, vegetarian) isa (Al Nobody, vegetarian) • Compactness: • Prefer results that are tightly connected • Size of answer graph vegetarian Tom Cruise isa isa bornIn Einstein won 1962 won Nobel Prize Bohr diedIn

NAGA Ranking Model Following the paradigm of statistical language models (used in speech recognition and modern IR), applied to graphs For query q with fact templates q1 … qnbornIn ($x, Frankfurt) rank result graphs g with facts g1 … gn bornIn (Goethe, Frankfurt) by decreasing likelihoods: using generative mixture model background model reflect informativeness weights subqueries Ex.: bornIn ($x, Germany) & wonAward ($x, Nobel)

NAGA Ranking Model: Informativeness Estimate P[qi | gi] for qi = (x*, r, z) with var x* (analogously for other cases) bornIn (GW, Frankfurt) Ex.: bornIn ($x, Frankfurt) bornIn (Goethe, Frankfurt) isa (Einstein, physicist) Ex.: isa (Einstein, $z) bornIn (Einstein, vegetarian) Estimate on knowledge graph: Estimate on Web (exploit redundancy): vegetarian freq (Einstein, isa, physicist) vs. freq (Einstein, isa, vegetarian) isa Albert Einstein isa physicist

NAGA Example Query: $x isa politician $x isa scientist Results: Benjamin Franklin Paul Wolfowitz Angela Merkel …

User Study for Quality Assessment (1) Benchmark: • 55 queries from TREC QA 2005/2006 Examples: 1) In what country is Luxor? 2) Discoveries of the 20th Century? • 12 queries from work on SphereSearch Examples: 1) In which movies did a governor act? 2) Firstname of politician Rice? • 18 regular expression queries by us Example: What do Albert Einstein and Niels Bohr have in common? Competitors: NAGA vs. Google, Yahoo! Answers, BANKS (IIT Bombay), START (MIT)

User Study for Quality Assessment (2) • Quality Measures: • Precision@1 • NDCG: normalized discounted cumulative gain • based on ratings highly relevant (2), somewhat relevant (1), irrelevant (0) • with Wilson confidence intervals at  = 0.95

Outline  Motivation  Information Extraction & Knowledge Harvesting (YAGO)  Ranking for Search over Entity-Relation Graphs (NAGA) • Efficient Query Processing (RDF-3X) • Conclusion

Why RDF? Why a New Engine? Poland Nobel Prize Chemistry Maria Sklodowska inCountry Warsaw bornOn 1852 wonAward Henri Becquerel bornAs bornIn advsior 1908 diedOn bornOn Marie Curie 1867 Alma Mater U Paris won Award wonAward 1934 marriedTo diedOn Pierre Curie won Award Nobel Prize Physics • RDF triples (subject – property/predicate – value/object): • (id1, Name, „Marie Curie“), (id1, bornAs, „Maria Sklobodowska“), (id1, bornOn, 1867), • (id1, bornIn, id2), (id2, Name, „Warsaw“), (id2, locatedIn, id3), (id3, Name, „Poland“), • (id1, marriedTo, id4), (id4, Name, „Pierre Curie“), (id1, wonAward, id5), (id4, wonAward, id5), … • pay-as-you-go: schema-agnostic or schema later • RDF triples form fine-grained (ER) graph • queries bound to need many star-joins and long chain-joins • physical design critical, but hardly predictable workload

SPARQL Query Language SPJ combinations of triple patterns Ex:: Select ?c Where { ?p isa scientist . ?p bornIn ?t . ?p hasWon ?a . ?t inCountry ?c . ?a Name NobelPrize } options for filter predicates, duplicate handling, wildcard join, etc. Ex:: Select Distinct ?c Where { ?p ?r1 ?t . ?t ?r2 ?c . ?c isa <country> . ?p bornOn ?b . Filter (?b > 1945) } support for RDFS: types

RDF & SPARQL Engines Person S Name bornOn bornIn … id1 Marie C 1867 id2 id2 Henri B 1852 id9 … … .,. S P O id1 Name Marie Curie id1 bornOn 1867 id1 bornIn id2 id2 Name Warsaw id2 Country id11 id1 Advisor id5 … … .,. Town id2 Warsaw id11 … … .,. S Name Country choice of physical design is crucial giant triples table clustered property tables (+ leftover table) (vert. partitioned) property tables bornOn S O id1 1867 id id5 1852 … … Advisor S O id1 id5 … … id2 Warsaw id11 … … .,. SESAME / OpenRDF YARS2 (DERI) Jena (HP Labs) Oracle RDF_MATCH C-Store (MIT) MonetDB (CWI) column stores + physical design wizard ! + materialized views

RDF-3X: a RISC-style Engine[T. Neumann, G. Weikum: VLDB 2008] • Design rationale: • RDF-specific engine (not an RXORDBMS) • Simplify operations • Reduce implementation choices • Optimize for common case • Eliminate tuning knobs • Key principles: • Mapping dictionary for encoding all literals into ids • Exhaustive indexing of id triples • Index-only store, high compression • QP mostly merge joins with order-preservation • Very fast DP-based query optimizer • Frequent-paths synopses, property-value histograms

RDF-3X Indexing • index all collation orders of subject-property-object id triples: • SPO, SOP, OSP, OPS, PSO, POS • directly stored in clustered B+ trees • high compression:  indexes < original data • can choose any order for scan & join • additionally index count-aggregated projections in all orders: • SP, SO, OS, OP, PS, PO – with counter for each entry • enables efficient bookkeeping for duplicates • also index projections S, P, O with count-aggregation also need two mapping indexes: literal  id, id  literal,

RDF-3X Query Optimization v1 v4 v6 a1 a4 a6 • Principles: • optimizing join orders is key (star joins, long join chains) • should exploit exhaustive indexes and order-preservation • support merge-joins and hash-joins Bottom-up dynamic programming for exhaustive plan enumeration (< 100ms for 20 joins) • Cost model based on selectivity estimation from • histograms for each of the 6 SPO orderings (approx. equi-depth) • frequent join paths (property sequences) for stars and chains Example Query: p1 p2 p3 p4 p5 ?x1 ?x2 ?x3 ?x4 ?x5 ?x6

Experimental Evaluation: Setup • Setup and competitors: • 2GHz dual core, 2 GB RAM, 30MB/s disk, Linux • column-store property tables by Abadi et al., using MonetDB • triples store with SPO, POS, PSO indexes, using PostgreSQL Datasets: 1) Barton library catalog: 51 Mio. triples (4.1 GB) 2) YAGO knowledge base: 40 Mio. triples (3.1 GB) 3) Librarything social-tagging excerpt: 30 Mio. triples (1.8 GB) Select ?t Where { ?b hasTitle ?t . ?u romance ?b . ?u love ?b . ?u mystery ?b . ?u suspense ?b . ?u crimeNovel ?c . ?u hasFriend ?f . ?f ... } Benchmark queries (7 or 8 per dataset) in the spirit of: 1) counts of French library items (books, music, etc.), with creator, publisher, language, etc. 2) scientist from Poland with French advisor who both won awards 3) books tagged with romance, love, mystery, suspense by users who like crime novels and have friends who ...

Experimental Evaluation: Results DB sizes [GB]: Barton Yago LibThing RDF-3X 2.8 2.7 1.6 MonetDB 1.6-2.0 1.1-2.4 0.7-6.9 PostgreSQL 8.7 7.5 5.7 DB load times [min]: Barton Yago LibThing RDF-3X 13 25 20 MonetDB 11 21 4 PostgreSQL 30 25 20 Geometric means for query run-times [sec] for warm (cold) cache Barton Yago LibThing RDF-3X 0.4 (5.9) 0.04 (0.7) 0.13 (0.89) MonetDB 3.8 ( 26.4) 54.6 (78.2) 4.39 (8.16) PostgreSQL 64.3 (167.8) 0.56 (10.6) 30.4 (93.9)

Outline  Motivation  Information Extraction & Knowledge Harvesting (YAGO)  Ranking for Search over Entity-Relation Graphs (NAGA)  Efficient Query Processing (RDF-3X) • Conclusion

Summary & Outlook lift world‘s best information sources (Wikipedia, Web, Web 2.0) to the level of explicit knowledge (ER-oriented facts) 1) buildingknowledge graphs: combine semantic & statistical& social IE sources (for scholarly Web, digital libraries, enterprise know-how) challenges in consistency vs. uncertainty, long-term evolution 2) heterogeneity & uncertain IE necessitate ranking new ranking models (e.g. statistical LM for graphs) 3) efficiency and scalability challenges for search & ranking (top-k queries) and updates

Thank You !

Joint work with Georgiana Ifrim, Gjergji Kasneci, Thomas Neumann, Maya Ramanath, Fabian Suchanek