slide1 l.
Download
Skip this Video
Loading SlideShow in 5 Seconds..
Joint work with Georgiana Ifrim, Gjergji Kasneci, Thomas Neumann, Maya Ramanath, Fabian Suchanek PowerPoint Presentation
Download Presentation
Joint work with Georgiana Ifrim, Gjergji Kasneci, Thomas Neumann, Maya Ramanath, Fabian Suchanek

Loading in 2 Seconds...

play fullscreen
1 / 38

Joint work with Georgiana Ifrim, Gjergji Kasneci, Thomas Neumann, Maya Ramanath, Fabian Suchanek - PowerPoint PPT Presentation


  • 336 Views
  • Uploaded on

Joint work with Georgiana Ifrim, Gjergji Kasneci, Thomas Neumann, Maya Ramanath, Fabian Suchanek Vision Opportunity: Turn the Web (and Web 2.0 and Web 3.0 ...) into the world‘s most comprehensive knowledge base Approach: 1) harvest and combine hand-crafted knowledge sources

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Joint work with Georgiana Ifrim, Gjergji Kasneci, Thomas Neumann, Maya Ramanath, Fabian Suchanek' - libitha


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
slide1

Joint work with

Georgiana Ifrim, Gjergji Kasneci, Thomas Neumann,

Maya Ramanath, Fabian Suchanek

vision
Vision

Opportunity:

Turn the Web (and Web 2.0 and Web 3.0 ...) into

the world‘s most comprehensive knowledge base

  • Approach:
  • 1) harvest and combine
    • hand-crafted knowledge sources
    • (Semantic Web, ontologies)
    • automatic knowledge extraction
    • (Statistical Web, text mining)
    • social communities and human computing
    • (Social Web, Web 2.0)
  • 2) express knowledge queries, search, and rank
  • 3) everything efficient and scalable
why google and wikipedia are not enough
Why Google and Wikipedia Are Not Enough

Answer „knowledge queries“ such as:

proteins that inhibit proteases and other human enzymes

connection between Thomas Mann and Goethe

German Nobel prize winner who survived both world wars

and all of his four children

German universities with world-class computer scientists

politicians who are also scientists

slide4

Why Google and Wikipedia Are Not Enough

Which politicians

are also scientists ?

  • What is lacking?
  • Information is not Knowledge.
  • Knowledge is not Wisdom.
  • Wisdom is not Truth
  • Truth is not Beauty.
  • Beauty is not Music.
  • Music is the best.
  • (Frank Zappa)
  • extract facts from Web pages
  • capture user intention by

concepts, entities, relations

slide5

NAGA Example

Query:

$x isa politician

$x isa scientist

Results:

Benjamin Franklin

Paul Wolfowitz

Angela Merkel

related work
Related Work

Cimple

DBlife

Libra

TextRunner

START

Answers

Avatar

information

extraction &

ontology

building

Web

entity

search

& QA

UIMA

Hakia

Powerset

Freebase

EntityRank

Cyc

DBpedia

semistructured IR

& graph search

TopX

XQ-FT

Yago

Naga

Tijah

SPARQL

DBexplorer

Banks

SWSE

outline
Outline

Motivation

Information Extraction & Knowledge Harvesting (YAGO)

Ranking for Search over Entity-Relation Graphs (NAGA)

Efficient Query Processing (RDF-3X)

Conclusion

slide8

Information Extraction (IE): Text to Records

Person BirthDate BirthPlace ...

Max Planck 4/23, 1858 Kiel

Albert Einstein 3/14, 1879 Ulm

Mahatma Gandhi 10/2, 1869 Porbandar

Person ScientificResult

Max Planck Quantum Theory

Person Collaborator

Max Planck Albert Einstein

Max Planck Niels Bohr

Constant Value Dimension

Planck‘s constant 6.2261023 Js

  • extracted facts often
  • have confidence < 1
  • DB with uncertainty

(probabilistic DB)

expensive and

error-prone

combine NLP, pattern matching, lexicons, statistical learning

high quality knowledge sources
High-Quality Knowledge Sources

General-purpose ontologies and thesauri: WordNet family

  • 200 000 concepts and relations;
  • can be cast into
  • description logics or
  • graph, with weights for relation strengths
  • (derived from co-occurrence statistics)

scientist, man of science

(a person with advanced knowledge)

=> cosmographer, cosmographist

=> biologist, life scientist

=> chemist

=> cognitive scientist

=> computer scientist

...

=> principal investigator, PI

HAS INSTANCE => Bacon, Roger Bacon

exploit hand crafted knowledge
Exploit Hand-Crafted Knowledge

Wikipedia, WordNet, and other lexical sources

{{Infobox_Scientist

| name = Max Planck

| birth_date = [[April 23]], [[1858]]

| birth_place = [[Kiel]], [[Germany]]

| death_date = [[October 4]], [[1947]]

| death_place = [[Göttingen]], [[Germany]]

| residence = [[Germany]]

| nationality = [[Germany|German]]

| field = [[Physicist]]

| work_institution = [[University of Kiel]]</br>

[[Humboldt-Universität zu Berlin]]</br>

[[Georg-August-Universität Göttingen]]

| alma_mater = [[Ludwig-Maximilians-Universität München]]

| doctoral_advisor = [[Philipp von Jolly]]

| doctoral_students =

[[Gustav Ludwig Hertz]]</br>

| known_for = [[Planck's constant]],

[[Quantum mechanics|quantum theory]]

| prizes = [[Nobel Prize in Physics]] (1918)

exploit hand crafted knowledge11
Exploit Hand-Crafted Knowledge

Wikipedia, WordNet, and other lexical sources

yago yet another great ontology f suchanek g kasneci g weikum www 07
YAGO: Yet Another Great Ontology[F. Suchanek, G. Kasneci, G. Weikum: WWW‘07]
  • Turn Wikipedia into explicit knowledge base (semantic DB);
  • keep source pages as witnesses
  • Exploit hand-crafted categoriesand infobox templates
  • Represent facts as explicit knowledge triples:
  • relation (entity1, entity2)
  • (in FOL, compatible with RDF, OWL-lite, XML, etc.)
  • Map (and disambiguate) relations into WordNet concept DAG

relation

entity1

entity2

Examples:

bornIn

isInstanceOf

City

Max_Planck

Kiel

Kiel

yago knowledge base f suchanek et al www 07
YAGO Knowledge Base[F. Suchanek et al.: WWW’07]

Entities Facts

KnowItAll 30 000

SUMO 20 000 60 000

WordNet 120 000 80 000

Cyc 300 000 5 Mio.

TextRunner n/a 8 Mio.

YAGO 1.7 Mio. 15 Mio.

DBpedia 1.9 Mio. 103 Mio.

Freebase ??? ???

Accuracy  95%

Entity

subclass

subclass

Person

concepts

Location

subclass

Scientist

subclass

subclass

subclass

subclass

City

Country

Biologist

Physicist

instanceOf

instanceOf

Erwin_Planck

Nobel Prize

bornIn

Kiel

hasWon

FatherOf

individuals

diedOn

bornOn

October 4, 1947

Max_Planck

April 23, 1858

means

means

means

“Max Karl Ernst Ludwig Planck”

“Dr. Planck”

“Max Planck”

words

Online access and download at http://www.mpi-inf.mpg.de/~suchanek/yago/

wikipedia harvesting difficulties solutions
Wikipedia Harvesting: Difficulties & Solutions
  • instanceOf relation: isleading and difficult category names
  • („disputed articles“, „particle physics“, „American Music of the 20th Century“,
  • „Nobel laureates in physics“, „naturalized citizens of the United States“, … )
  •  noun group parser: ignore when head word in singular
  • isA relation: mapping categories onto WordNet classes:
  • „Nobel laureates in physics“  Nobel_laureates, „people from Kiel“  person
  •  map to (singular of) head; exploit synsets and statistics
  • Entity name ambiguities:
  • „St. Petersburg“, „Saint Petersburg“, „M31“, „NGC224“  means ...
  •  exploit Wikipedia redirects & disambiguations, WN synsets
  • type checking for scrutinizing candidates:
  • accept fact candidate only if arguments have proper classes
  • marriedTo (Max Planck, quantum physics)  Person  Person
higher order facts in yago
Higher-Order Facts in YAGO

validIn

validIn

1990-2008

1949-1989

facts about facts represented by reification as first-order facts

e314159

validIn

CapitalOf

e314159

1990-2008

Berlin

Germany

Arnold

Schwarzen-

egger

instanceOf

Actor

validIn

instanceOf

1987-2008

validIn

Politician

2003-2008

CapitalOf

CapitalOf

Bonn

Berlin

Germany

ongoing work yago for easier ie
Ongoing Work: YAGO for Easier IE

NP

VP

PP

NP

NP

PP

NP

NP

NP

PP

NP

VP

NP

PP

NP

NP

NP

Cologne lies on the banks of the Rhine

People in Cairo like wine from the Rhine valley

Mp

Js

Os

AN

Ss

MVp

DMc

Mp

Dg

Jp

Js

Sp

Mvp

Ds

Js

YAGO knows (almost) all (interesting) entities

leverage for discovering & extracting new facts in NL texts

IE with dependency parser is expensive !

river

city

The cityof Paris was founded on

an island in the Seine in 300 BC

isa

isa

runs

Through

Paris

Seine

locatedIn

France

locatedIn

locatedIn

  • can filter out many uninteresting sentences
  • can quickly identify relation arguments
  • can eliminate many fact candidates by type checking
  • can focus on specific properties like time

Europe

outline17
Outline

Motivation

Information Extraction & Knowledge Harvesting (YAGO)

Ranking for Search over Entity-Relation Graphs (NAGA)

Efficient Query Processing (RDF-3X)

Conclusion

naga graph search g kasneci et al icde 08
NAGA: Graph Search [G. Kasneci et al.: ICDE‘08]

Graph-based search on YAGO-style knowledge bases

with built-in ranking based on confidence and informativeness

discovery queries

connectedness queries

*

isa

Thomas

Mann

German novelist

isa

isa

Goethe

politician

$x

scientist

complex queries (with regular expressions)

inField

wonPrize

isa

computer

science

$p

$x

scientist

worksAt |

graduatedFrom

locatedIn*

$u

university

Germany

isa

capitalOf

queries over reified facts

isa

city

$c

Germany

validIn

1988

search results without ranking
Search Results Without Ranking

q: Fisher isa scientist

Fisher isa $x

$@Fisher = Ronald_Fisher

$@scientist = scientist_109871938

$X = alumnus_109165182

$@Fisher = Irving_Fisher

$@scientist = scientist_109871938

$X = social_scientist_109927304

$@Fisher = James_Fisher

$@scientist = scientist_10981938

$X = ornithologist_109711173

$@Fisher = Ronald_Fisher

$@scientist = scientist_109871938

$X = theorist_110008610

$@Fisher = Ronald_Fisher

$@scientist = scientist_109871938

$X = colleague_109301221

$@Fisher = Ronald_Fisher

$@scientist = scientist_109871938

$X = organism_100003226

mathematician_109635652   —subClassOf—>   scientist_109871938 Alumni_of_Gonville_and_Caius_College,_Cambridge   —subClassOf—>   alumnus_109165182 "Fisher"   —familyNameOf—>   Ronald_Fisher Ronald_Fisher   —type—>   Alumni_of_Gonville_and_Caius_College,_Cambridge Ronald_Fisher   —type—>   20th_century_mathematicians "scientist"   —means—>   scientist_109871938

ranking with statistical language model
Ranking with Statistical Language Model

q: Fisher isa scientist

Fisher isa $x

$@Fisher = Ronald_Fisher

$@scientist = scientist_109871938

$X = mathematician_109635652

$@Fisher = Ronald_Fisher

$@scientist = scientist_109871938

$X = statistician_109958989

$@Fisher = Ronald_Fisher

$@scientist = scientist_109871938

$X = president_109787431

$@Fisher = Ronald_Fisher

$@scientist = scientist_109871938

$X = geneticist_109475749

$@Fisher = Ronald_Fisher

$@scientist = scientist_109871938

$X = scientist_109871938

Score: 7.184462521168058E-13 mathematician_109635652   —subClassOf—>   scientist_109871938 "Fisher"   —familyNameOf—>   Ronald_FisherRonald_Fisher   —type—>   20th_century_mathematicians "scientist"   —means—>   scientist_109871938 20th_century_mathematicians   —subClassOf—>   mathematician_109635652

 statistical language model

for result graphs

Online access at http://www.mpi-inf.mpg.de/~kasneci/naga/

ranking factors
Ranking Factors
  • Confidence:
  • Prefer results that are likely to be correct
    • Certainty of IE
    • Authenticity and Authority of Sources

bornIn (Max Planck, Kiel) from

„Max Planck was born in Kiel“

(Wikipedia)

livesIn (Elvis Presley, Mars) from

„They believe Elvis hides on Mars“

(Martian Bloggeria)

  • Informativeness:
  • Prefer results that are likely important
  • May prefer results that are likely new to user
    • Frequency in answer
    • Frequency in corpus (e.g. Web)
    • Frequency in query log

q: isa (Einstein, $y)

isa (Einstein, scientist)

isa (Einstein, vegetarian)

q: isa ($x, vegetarian)

isa (Einstein, vegetarian)

isa (Al Nobody, vegetarian)

  • Compactness:
  • Prefer results that are tightly connected
    • Size of answer graph

vegetarian

Tom Cruise

isa

isa

bornIn

Einstein

won

1962

won

Nobel Prize

Bohr

diedIn

naga ranking model
NAGA Ranking Model

Following the paradigm of statistical language models

(used in speech recognition and modern IR), applied to graphs

For query q with fact templates q1 … qnbornIn ($x, Frankfurt)

rank result graphs g with facts g1 … gn bornIn (Goethe, Frankfurt)

by decreasing likelihoods:

using

generative

mixture model

background

model

reflect

informativeness

weights subqueries

Ex.: bornIn ($x, Germany) &

wonAward ($x, Nobel)

naga ranking model informativeness
NAGA Ranking Model: Informativeness

Estimate

P[qi | gi]

for qi = (x*, r, z) with var x*

(analogously

for other cases)

bornIn (GW, Frankfurt)

Ex.: bornIn ($x, Frankfurt)

bornIn (Goethe, Frankfurt)

isa (Einstein, physicist)

Ex.: isa (Einstein, $z)

bornIn (Einstein, vegetarian)

Estimate on knowledge graph:

Estimate on Web

(exploit redundancy):

vegetarian

freq (Einstein, isa, physicist)

vs.

freq (Einstein, isa, vegetarian)

isa

Albert Einstein

isa

physicist

slide24

NAGA Example

Query:

$x isa politician

$x isa scientist

Results:

Benjamin Franklin

Paul Wolfowitz

Angela Merkel

user study for quality assessment 1
User Study for Quality Assessment (1)

Benchmark:

  • 55 queries from TREC QA 2005/2006

Examples: 1) In what country is Luxor?

2) Discoveries of the 20th Century?

  • 12 queries from work on SphereSearch

Examples: 1) In which movies did a governor act?

2) Firstname of politician Rice?

  • 18 regular expression queries by us

Example: What do Albert Einstein and Niels Bohr have in common?

Competitors:

NAGA vs.

Google, Yahoo! Answers,

BANKS (IIT Bombay), START (MIT)

user study for quality assessment 2
User Study for Quality Assessment (2)
  • Quality Measures:
  • Precision@1
  • NDCG: normalized discounted cumulative gain
  • based on ratings highly relevant (2), somewhat relevant (1), irrelevant (0)
  • with Wilson confidence intervals at  = 0.95
outline27
Outline

Motivation

Information Extraction & Knowledge Harvesting (YAGO)

Ranking for Search over Entity-Relation Graphs (NAGA)

Efficient Query Processing (RDF-3X)

Conclusion

why rdf why a new engine
Why RDF? Why a New Engine?

Poland

Nobel Prize Chemistry

Maria Sklodowska

inCountry

Warsaw

bornOn

1852

wonAward

Henri Becquerel

bornAs

bornIn

advsior

1908

diedOn

bornOn

Marie Curie

1867

Alma

Mater

U Paris

won

Award

wonAward

1934

marriedTo

diedOn

Pierre Curie

won

Award

Nobel Prize Physics

  • RDF triples (subject – property/predicate – value/object):
  • (id1, Name, „Marie Curie“), (id1, bornAs, „Maria Sklobodowska“), (id1, bornOn, 1867),
  • (id1, bornIn, id2), (id2, Name, „Warsaw“), (id2, locatedIn, id3), (id3, Name, „Poland“),
  • (id1, marriedTo, id4), (id4, Name, „Pierre Curie“), (id1, wonAward, id5), (id4, wonAward, id5), …
  • pay-as-you-go: schema-agnostic or schema later
  • RDF triples form fine-grained (ER) graph
  • queries bound to need many star-joins and long chain-joins
  • physical design critical, but hardly predictable workload
sparql query language
SPARQL Query Language

SPJ combinations of triple patterns

Ex:: Select ?c Where {

?p isa scientist . ?p bornIn ?t . ?p hasWon ?a .

?t inCountry ?c . ?a Name NobelPrize }

options for filter predicates, duplicate handling, wildcard join, etc.

Ex:: Select Distinct ?c Where { ?p ?r1 ?t . ?t ?r2 ?c . ?c isa <country> .

?p bornOn ?b . Filter (?b > 1945) }

support for RDFS: types

rdf sparql engines
RDF & SPARQL Engines

Person

S Name bornOn bornIn …

id1 Marie C 1867 id2

id2 Henri B 1852 id9

… … .,.

S P O

id1 Name Marie Curie

id1 bornOn 1867

id1 bornIn id2

id2 Name Warsaw

id2 Country id11

id1 Advisor id5

… … .,.

Town

id2 Warsaw id11

… … .,.

S Name Country

choice of physical design is crucial

giant triples table

clustered property tables

(+ leftover table)

(vert. partitioned)

property tables

bornOn

S O

id1 1867

id

id5 1852

… …

Advisor

S O

id1 id5

… …

id2 Warsaw id11

… … .,.

SESAME / OpenRDF

YARS2 (DERI)

Jena (HP Labs)

Oracle RDF_MATCH

C-Store (MIT)

MonetDB (CWI)

column stores

+ physical design wizard !

+ materialized views

rdf 3x a risc style engine t neumann g weikum vldb 2008
RDF-3X: a RISC-style Engine[T. Neumann, G. Weikum: VLDB 2008]
  • Design rationale:
    • RDF-specific engine (not an RXORDBMS)
    • Simplify operations
    • Reduce implementation choices
    • Optimize for common case
    • Eliminate tuning knobs
  • Key principles:
    • Mapping dictionary for encoding all literals into ids
    • Exhaustive indexing of id triples
    • Index-only store, high compression
    • QP mostly merge joins with order-preservation
    • Very fast DP-based query optimizer
    • Frequent-paths synopses, property-value histograms
rdf 3x indexing
RDF-3X Indexing
  • index all collation orders of subject-property-object id triples:
  • SPO, SOP, OSP, OPS, PSO, POS
    • directly stored in clustered B+ trees
    • high compression:  indexes < original data
    • can choose any order for scan & join
  • additionally index count-aggregated projections in all orders:
  • SP, SO, OS, OP, PS, PO – with counter for each entry
    • enables efficient bookkeeping for duplicates
    • also index projections S, P, O with count-aggregation

also need two mapping indexes:

literal  id, id  literal,

rdf 3x query optimization
RDF-3X Query Optimization

v1

v4

v6

a1

a4

a6

  • Principles:
  • optimizing join orders is key (star joins, long join chains)
  • should exploit exhaustive indexes and order-preservation
  • support merge-joins and hash-joins

Bottom-up dynamic programming

for exhaustive plan enumeration (< 100ms for 20 joins)

  • Cost model based on selectivity estimation from
  • histograms for each of the 6 SPO orderings (approx. equi-depth)
  • frequent join paths (property sequences) for stars and chains

Example

Query:

p1

p2

p3

p4

p5

?x1

?x2

?x3

?x4

?x5

?x6

experimental evaluation setup
Experimental Evaluation: Setup
  • Setup and competitors:
  • 2GHz dual core, 2 GB RAM, 30MB/s disk, Linux
  • column-store property tables by Abadi et al., using MonetDB
  • triples store with SPO, POS, PSO indexes, using PostgreSQL

Datasets:

1) Barton library catalog: 51 Mio. triples (4.1 GB)

2) YAGO knowledge base: 40 Mio. triples (3.1 GB)

3) Librarything social-tagging excerpt: 30 Mio. triples (1.8 GB)

Select ?t Where {

?b hasTitle ?t .

?u romance ?b .

?u love ?b .

?u mystery ?b .

?u suspense ?b .

?u crimeNovel ?c .

?u hasFriend ?f .

?f ... }

Benchmark queries (7 or 8 per dataset) in the spirit of:

1) counts of French library items (books, music, etc.),

with creator, publisher, language, etc.

2) scientist from Poland with French advisor who both won awards

3) books tagged with romance, love, mystery, suspense

by users who like crime novels and have friends who ...

experimental evaluation results
Experimental Evaluation: Results

DB sizes [GB]:

Barton Yago LibThing

RDF-3X 2.8 2.7 1.6

MonetDB 1.6-2.0 1.1-2.4 0.7-6.9

PostgreSQL 8.7 7.5 5.7

DB load times [min]:

Barton Yago LibThing

RDF-3X 13 25 20

MonetDB 11 21 4

PostgreSQL 30 25 20

Geometric means for query run-times [sec]

for warm (cold) cache

Barton Yago LibThing

RDF-3X 0.4 (5.9) 0.04 (0.7) 0.13 (0.89)

MonetDB 3.8 ( 26.4) 54.6 (78.2) 4.39 (8.16)

PostgreSQL 64.3 (167.8) 0.56 (10.6) 30.4 (93.9)

outline36
Outline

Motivation

Information Extraction & Knowledge Harvesting (YAGO)

Ranking for Search over Entity-Relation Graphs (NAGA)

Efficient Query Processing (RDF-3X)

Conclusion

summary outlook
Summary & Outlook

lift world‘s best information sources (Wikipedia, Web, Web 2.0)

to the level of explicit knowledge (ER-oriented facts)

1) buildingknowledge graphs:

combine semantic & statistical& social IE sources

(for scholarly Web, digital libraries, enterprise know-how)

challenges in consistency vs. uncertainty, long-term evolution

2) heterogeneity & uncertain IE necessitate ranking

new ranking models (e.g. statistical LM for graphs)

3) efficiency and scalability challenges

for search & ranking (top-k queries) and updates