Knowledge
This presentation is the property of its rightful owner.
Sponsored Links
1 / 96

Gerhard Weikum Max Planck Institute for Informatics http://www.mpi-inf.mpg.de/~weikum/ PowerPoint PPT Presentation


  • 40 Views
  • Uploaded on
  • Presentation posted in: General

Knowledge Harvesting f rom Text and Web Sources. Part 2: Search and Ranking of Knowledge. Gerhard Weikum Max Planck Institute for Informatics http://www.mpi-inf.mpg.de/~weikum/. The Web Speaks to Us. Source: DB & IR methods for knowledge discovery. Communications of

Download Presentation

Gerhard Weikum Max Planck Institute for Informatics http://www.mpi-inf.mpg.de/~weikum/

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Gerhard weikum max planck institute for informatics http www mpi inf mpg de weikum

KnowledgeHarvesting

from Text and Web Sources

Part 2: Searchand Ranking ofKnowledge

Gerhard Weikum

Max Planck Institute forInformatics

http://www.mpi-inf.mpg.de/~weikum/


The web speaks to us

The Web Speaks to Us

Source:

DB & IR methods for

knowledge discovery.

Communications of

the ACM 52(4), 2009

  • Web2012 containsmore DB-style datathanever

  • gettingbetteratmakingstructuredcontentexplicit:

  • entities, classes (types), relationships

  • but no (hopefor) schemahere!


Structure now

Structure Now!

Paris

IE-enriched Web pageswith

embeddedentitiesandfacts

Knowledgebaseswithfacts

from Web andIE witnesses

Nicolas

Sarkozy

Nicolas

Sarkozy

Bob

Dylan

Bob

Dylan

Joan

Baez

Bob

Dylan

France

Grammy

Champs

Elysees

Carla

Bruni

ActorAward

Christoph Waltz Oscar

Sandra Bullock Oscar

Sandra Bullock Golden Raspberry

CompanyCEO

Google Eric Schmidt

Yahoo Overture

Facebook FriendFeed

Software AG IDS Scheer

MovieReportedRevenue

Avatar $ 2,718,444,933

The Reader $ 108,709,522

Facebook FriendFeed

Software AG IDS Scheer


Distributed structure linking open data

Distributed Structure: Linking Open Data

30 Bio. triples

500 Mio. links

http://richard.cyganiak.de/2007/10/lod/lod-datasets_2011-09-19_colored.png


Distributed structure linking open data1

Distributed Structure: Linking Open Data

yago/wordnet: Artist109812338

rdfs:subclassOf

rdfs:subclassOf

yago/wordnet:Actor109765278

rdf:type

yago/wikicategory:ItalianComposer

rdf:type

imdb.com/name/nm0910607/

dbpedia.org/resource/Ennio_Morricone

prop:actedIn

prop: composedMusicFor

imdb.com/title/tt0361748/

dbpprop:citizenOf

dbpedia.org/resource/Rome

owl:sameAs

owl:sameAs

rdf.freebase.com/ns/en.rome

data.nytimes.com/51688803696189142301

owl:sameAs

geonames.org/3169070/roma

Coord

N 41° 54' 10'' E 12° 29' 2''


Entity search

Entity Search

http://entitycube.research.microsoft.com/


Semantic search with entities classes relationships

Semantic Search with Entities, Classes, Relationships

propertiesofentity

US presidentwhen Barack Obama was born?

Nobel laureatewhooutlivedtwoworldwarsand all hischildren?

instancesofclasses

Politicianswhoare also scientists?

European composerswhowonthe Oscar?

Chinese femaleastronauts?

relationships

FIFA 2010 finalistswhoplayed in a Champions League final?

German footballclubsthatwonagainst Real Madrid?

multiple entities

Commonalities & relationshipsamong:

Li Lianjie, Steve Jobs, Carla Bruni, Plato, Andres Iniesta?

applications

Enzymes thatinhibit HIV?

Antidepressantsthatinterferewithblood-pressuredrugs?

German philosophersinfluencedby William of Ockham?


Quiz time

Quiz Time

Rank theFolllowing Numbers

25*109

# RDF triples in LOD

5

6

4

3

1

2

600*106

# Wikipedia links

12*106

# followersof Lady Gaga

1015

# ants on the Earth

80*109

# Yuan by Mark Zuckerberg

3*109

# Google queries per day

2-8


Outline

Outline

Motivation

SearchingforEntities & Relations

Informative Ranking

Efficient Query Processing

User Interface

Wrap-up

...


Rdf structure diversity no schema

RDF: Structure, Diversity, No Schema

SPO triples (statements, facts):

(EnnioMorricone, bornIn, Rome)

(Rome, locatedIn, Italy)

(JavierNavarrete, birthPlace, Teruel)

(Teruel, locatedIn, Spain)

(EnnioMorricone, composed, l‘Arena)

(JavierNavarrete, composerOf, aTale)

(uri1, hasName, EnnioMorricone)

(uri1, bornIn, uri2)

(uri2, hasName, Rome)

(uri2, locatedIn, uri3)

bornIn (EnnioMorricone, Rome)

locatedIn(Rome, Italy)

Rome

Rome

Italy

EnnioMorricone

bornIn

locatedIn

Rome

City

instanceOf

  • SPO triples: Subject – Property/Predicate – Object/Value)

  • pay-as-you-go: schema-agnosticorschemalater

  • RDF triples form fine-grained ER graph

  • popularforLinked Data, comp.biology (UniProt, KEGG, etc.)

  • open-sourceengines: Jena, Sesame, RDF-3X, etc.


Facts about facts

Facts about Facts

facts:

(EnnioMorricone, composed, l‘Arena)

(JavierNavarrete, composerOf, aTale)

(Berlin, capitalOf, Germany)

(Madonna, marriedTo, GuyRitchie)

(NicolasSarkozy, marriedTo, CarlaBruni)

facts:

1:

2:

3:

4:

5:

temporal facts:

6: (1, inYear, 1968)

7: (2, inYear, 2006)

8: (3, validFrom, 1990)

9: (4, validFrom, 22-Dec-2000)

10: (4, validUntil, Nov-2008)

11: (5, validFrom, 2-Feb-2008)

provenance:

12: (1, witness, http://www.last.fm/music/Ennio+Morricone/)

13: (1, confidence, 0.9)

14: (4, witness, http://en.wikipedia.org/wiki/Guy_Ritchie)

15: (4, witness, http://en.wikipedia.org/wiki/Madonna_(entertainer))

16: (10, witness, http://www.intouchweekly.com/2007/12/post_1.php)

17: (10, confidence, 0.1)

  • temporal annotations, witnesses/sources, confidence, etc.

  • can refer to reified facts via fact identifiers

  • (approx. equiv. to RDF quadruples: Col  Sub  Prop  Obj)


Distributed structure linking open data2

Distributed Structure: Linking Open Data

yago/wordnet: Artist109812338

rdfs:subclassOf

rdfs:subclassOf

yago/wordnet:Actor109765278

rdf:type

yago/wikicategory:ItalianComposer

rdf:type

imdb.com/name/nm0910607/

dbpedia.org/resource/Ennio_Morricone

prop:actedIn

prop: composedMusicFor

imdb.com/title/tt0361748/

dbpprop:citizenOf

dbpedia.org/resource/Rome

owl:sameAs

owl:sameAs

rdf.freebase.com/ns/en.rome

data.nytimes.com/51688803696189142301

owl:sameAs

geonames.org/3169070/roma

Coord

N 41° 54' 10'' E 12° 29' 2''


Sparql query language

SPARQL Query Language

SPJ combinations of triple patterns

(triples with S,P,O replaced by variable(s))

Select ?p, ?c Where {

?p instanceOf Composer .

?p bornIn ?t . ?t inCountry ?c . ?c locatedIn Europe .

?p hasWon ?a .?a Name AcademyAward . }

Semantics:

return all bindings to variables that match all triple patterns

(subgraphs in RDF graph that are isomorphic to query graph)

+ filter predicates, duplicate handling, RDFS types, etc.

Select Distinct ?c Where {

?p instanceOf Composer .

?p bornIn ?t . ?t inCountry ?c . ?c locatedIn Europe .

?p hasWon ?a .?a Name ?n .

?p bornOn ?b . Filter (?b > 1945) . Filter(regex(?n, “Academy“) . }


Querying the structured web

Querying the Structured Web

flexible

subgraph

matching

Structure but noschema: SPARQL well suited

wildcardsforproperties (relaxed joins):

Select ?p, ?c Where {

?p instanceOf Composer .

?p ?r1 ?t . ?t ?r2 ?c . ?c isa Country . ?c locatedIn Europe . }

Extension: transitive paths [K. Anyanwu et al.: WWW‘07]

Select ?p, ?c Where {

?p instanceOf Composer .

?p ??r ?c . ?c isa Country . ?c locatedIn Europe .

PathFilter(cost(??r) < 5) .

PathFilter (containsAny(??r,?t ) . ?t isa City . }

Extension: regular expressions[G. Kasneci et al.: ICDE‘08]

Select ?p, ?c Where {

?p instanceOf Composer .

?p (bornIn | livesIn | citizenOf) locatedIn* Europe . }


Querying facts text

Querying Facts & Text

Problem: not everything is triplified

  • Consider witnesses/sources

  • (provenance meta-facts)

  • Allow text predicates with

  • each triple pattern (à la XQ-FT)

Semantics:

triplesmatchstruct. pred.

witnessesmatchtextpred.

European composers who have won the Oscar,

whose music appeared in dramatic western scenes,

and who also wrote classical pieces ?

Select ?p Where {

?p instanceOf Composer .

?p bornIn ?t . ?t inCountry ?c . ?c locatedIn Europe .

?p hasWon ?a .?a Name AcademyAward .

?p contributedTo ?movie [western, gunfight, duel, sunset] .

?p composed ?music [classical, orchestra, cantata, opera] . }


Querying facts text1

Querying Facts & Text

Problem: not everything is triplified

  • Consider witnesses/sources

  • (provenance meta-facts)

  • Allow text predicates with

  • each triple pattern (à la XQ-FT)

Groupingof

keywordsorphrases

boostsexpressiveness

French politicians married to Italian singers?

Select ?p1, ?p2 Where {

?p1 instanceOf politician [France] .

?p2 instanceOf singer [Italy] .

?p1 marriedTo ?p2 . }

Select ?p1, ?p2 Where {

?p1 instanceOf ?c1 [France, politics] .

?p2 instanceOf ?c2 [Italy, singer] .

?p1 marriedTo ?p2 . }

CS researchers whose advisors worked on the Manhattan project?

Select ?r, ?a Where {

?r instOfresearcher[computerscience] .

?a workedOn ?x [Manhattan project] .

?r hasAdvisor ?a . }

Select ?r, ?a Where {

?r ?p1 ?o1 [computerscience] .

?a ?p2 ?o2 [Manhattan project] .

?r ?p3 ?a . }


Relatedness queries

Relatedness Queries

Schema-agnostickeywordsearch

(on RDF, ER graph, relational DB)

becomes a specialcase

RelationshipbetweenJet Li, Steve Jobs, Carla Bruni?

Select ??p1, ??p2, ??p3 Where {

LiLianjie ??p1 SteveJobs.

SteveJobs ??p2 CarlaBruni.

CarlaBruni??p3 LiLianjie. }

Select ??p1, ??p2, ??p3 Where {

?e1 ?r1 ?c1 [“Li Lianjie“] .

?e2 ?r2 ?c2 [“Steve Jobs“] .

?e3 ?r3 ?c3 [“Carla Bruni“] .

?e1 ??p1 ?e2 .

?e2 ??p2 ?e3 .

?e3 ??p3 ?e1 . }


Querying temporal facts

Querying Temporal Facts

[Y. Wang et al.: EDBT’10]

[O. Udrea et al.: TOCL‘10

Problem: not all facts hold forever (e.g. CEOs, spouses, …)

  • Consider temporal scopes of reified facts

  • Extend Sparql with temporal predicates

1: (BayernMunich, hasWon, ChampionsLeague)

2: (BorussiaDortmund, hasWon, ChampionsLeague)

3: (1, validOn, 23May2001)4: (1, validOn, 15May1974)

5: (2, validOn, 28May1997)

6: (OttmarHitzfeld, manages, BayernMunich)

7: (6, validSince, 1Jul1998) 8:(6, validUntil, 30Jun2004)

Whendid a German soccerclubwinthe Champions League?

Select ?c, ?t Where { ?c isasoccerClub . ?c inCountry Germany .

?id1: ?c hasWonChampionsLeague . ?id1 validOn ?t . }

Managers of German clubswhowonthe Champions League?

Select ?m Where { isasoccerClub . ?c inCountry Germany .

?id1: ?c hasWonChampionsLeague . ?id1 validOn ?t .

?id2: ?m manages ?c . ?id2 validSince ?s . ?id2 validUntil ?u .

[?s,?u] overlaps [?t,?t] . }


Querying with vague temporal scope

Querying with Vague Temporal Scope

1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010

Problem: user‘s temporal interest is imprecise

Problem: user‘s temporal interest is often imprecise

  • Consider temporal phrases as text conditions

  • Allow approximate matching

  • and rank results wisely

German Champion League winners in the nineties?

Select ?c Where {

?c isa soccerClub . ?c inCountry Germany .

?c hasWon ChampionsLeague [nineties] . }

Soccer final winners in summer 2001?

Select ?c Where {

?c isa soccerClub . ?id: ?c matchAgainst ?o [final] .

?id winner ?c [“summer 2001“] . }


Keyword search on graphs

KeywordSearch on Graphs

[BANKS, Discover, DBExplorer, KUPS, SphereSearch, BLINKS, NAGA, …]

Schema-agnostickeywordsearchovermultiple tables:

graphoftupleswithforeign-keyrelationshipsasedges

Example:

Conferences (CId, Title, Location, Year)Journals (JId, Title)

CPublications (PId, Title, CId)JPublications (PId, Title, Vol, No, Year)

Authors (PId, Person)Editors (CId, Person)

Select * From * Where * Contains”Gray, DeWitt, XML, Performance“And Year > 95

Resultisconnectedtreewithnodesthatcontain

asmanyquerykeywordsaspossible

Ranking:

withnodeScorebased on tf*idfor prob. IR

andedgeScorereflectingimportanceofrelationships (orconfidence, authority, etc.)

Top-k querying: compute best trees, e.g. Steiner trees (NP-hard)


Best result group steiner tree

Best Result: Group Steiner Tree

y

x

w

z

Resultisconnectedtreewithnodesthatcontain

asmanyquerykeywordsaspossible

  • Group Steiner tree:

  • match individual keywords terminal nodes, groupedbykeyword

  • computetreethatconnectsat least one terminal node per keyword

  • andhasbest total edgeweight

y

w

y

w

x

w

x

z

forquery: x w y z


Outline1

Outline

Motivation

SearchingforEntities & Relations

Informative Ranking

Efficient Query Processing

User Interface

Wrap-up

...


Ranking criteria

Ranking Criteria

  • Confidence:

  • Prefer results that are likely correct

    • accuracy of info extraction

    • trust in sources

    • (authenticity, authority)

bornIn (Jim Gray, San Francisco) from

„Jim Gray was born in San Francisco“

(en.wikipedia.org)

livesIn (Michael Jackson, Tibet) from

„Fans believe Jacko hides in Tibet“

(www.michaeljacksonsightings.com)

  • Informativeness:

  • Prefer results with salient facts

  • Statistical estimation from:

    • frequency in answer

    • frequency on Web

    • frequency in query log

q: Einstein isa ?

Einstein isa scientist

Einstein isa vegetarian

q: ?x isa vegetarian

Einstein isa vegetarian

Whocares isa vegetarian

Diversity:

Prefer variety of facts

E won … E discovered … E played…

E won … E won … E won … E won …

  • Conciseness:

  • Preferresultsthataretightlyconnected

    • sizeofanswergraph

    • costof Steiner tree

Einstein won NobelPrize

Bohr won NobelPrize

Einstein isa vegetarian

Cruise isa vegetarian

Cruise born 1962 Bohr died 1962


Ranking approaches

Ranking Approaches

  • Confidence:

  • Prefer results that are likely correct

    • accuracy of info extraction

    • trust in sources

    • (authenticity, authority)

empiricalaccuracyof IE

PR/HITS-style estimateoftrust

combineinto:

max { accuracy (f,s) * trust(s) |

s  witnesses(f) }

  • Informativeness:

  • Preferresultswithsalientfacts

  • Statistical LM withestimationsfrom:

    • frequency in answer

    • frequency in corpus (e.g. Web)

    • frequency in query log

PR/HITS-style entity/factranking

[V. Hristidis et al., S.Chakrabarti, …]

or

IR models: tf*idf … [K.Chang et al., …]

Statistical Language Models

Diversity:

Prefer variety of facts

Statistical Language Models

  • Conciseness:

  • Preferresultsthataretightlyconnected

    • sizeof answergraph

    • cost of Steiner tree

graphalgorithms(BANKS, STAR, …)

[J.X. Yu et al., S.Chakrabarti et al.,

B. Kimelfeld et al., G.Kasneciet al., …]


Example rdf express

Example: RDF-Express

[S. Elbassuoni: SIGIR‘12, ESWC‘11, CIKM‘09]

?a1 isMarriedTo ?a2 .

?a1 actedIn ?m .

?a2 actedIn ?m .

http://www.mpi-inf.mpg.de/yago-naga/rdf-express/


Example rdf express1

Example: RDF-Express

?a1 isMarriedTo ?a2 {Bollywood} .

?a1 actedIn ?m .

?a2 actedIn ?m .


Example rdf express2

Example: RDF-Express

?a1 isMarriedTo ?a2 .

?a1 actedIn ?m {thriller} .

?a2 actedIn ?m {love} .


Example rdf express3

Example: RDF-Express

[S. Elbassuoni: SIGIR‘12, ESWC‘11, CIKM‘09]

?a1 directed ?m .

?a2 actedIn ?m .

?a1 hasWonPrizeAcademy_Award .

?a2 type wordnet_musician .


Information retrieval basics

Information Retrieval Basics

......

.....

......

.....

server farm with 10 000‘s of computers,

distributed/replicated data in high-performance file system,

massive parallelism for query processing

[Textbooksby R. Baza-Yates, B. Croft, Manning/Raghavan/Schuetze, …]

extract

& clean

index

rank

search

present

crawl

handle

dynamic pages,

detect duplicates,

detect spam

fast top-k queries,

query logging,

auto-completion

GUI, user guidance,

personalization

scoring function

over many data

and context criteria

strategies for

crawl schedule and

priority queue for

crawl frontier

build and analyze

Web graph,

index all tokens

or word stems


Vector space model for content relevance ranking

Vector Space Model for Content Relevance Ranking

Ranking by

descending

relevance

Similarity metric:

Search engine

Query

(set of weighted

features)

Documents are feature vectors

(bags of words)


Vector space model for content relevance ranking1

Vector Space Model for Content Relevance Ranking

Rankingby

descending

relevance

Similarity metric:

Search engine

Query

(Set of weighted

features)

Documentsarefeaturevectors

(bagsofwords)

tf*idf

formula:

termfrequency *

documentfrequency


Statistical language models lm s

Statistical Language Models (LM‘s)

[Maron/Kuhns 1960, Ponte/Croft 1998, Hiemstra 1998, Lafferty/Zhai 2001]

„God does not play dice“

(Albert Einstein)

„IR does“ (anonymous)

d1

?

LM(1)

q

„God rolls dice

in places where

you can‘t see them“

(Stephen Hawking)

d2

?

LM(2)

  • each doc dihas LM: generative prob. distr. with params i

  • query q viewed as sample from LM(1), LM(2), …

  • estimate likelihood P[ q | LM(i) ]

  • that q is sample of LM of doc di (q is „generated by“ di)

  • rank by descending likelihoods (best „explanation“ of q)


Lm doc as model query as sample

LM: Doc as Model, Query as Sample

estimate likelihood

of observing query

A

A

B

C

E

E

P [ | M]

query

model M

multinomial prob. distribution

A

A

A

A

B

B

C

C

C

D

E

E

E

E

E

document d: sample of M

used for parameter estimation


Lm need for smoothing

LM: Need for Smoothing

+

estimate likelihood

of observing query

A

P [ | M]

B

C

E

F

query

C

B

D

+ background corpus

and/or smoothing

E

F

A

A

used for parameter estimation

model M

A

A

A

A

B

B

C

C

C

D

E

E

E

E

E

document d

Laplace smoothing

Jelinek-Mercer

Dirichlet smoothing


Some lm basics

Some LM Basics

independ. assumpt.

simple MLE:

overfitting

mixture model

forsmoothing

P[q]est. from

logorcorpus

tf*idf

family

KL divergence

(Kullback-Leibler div.)

aka. relative entropy

  • Precomputeper-keywordscores

  • Store in postingsofinvertedindex

  • Score aggregrationfor (top-k) multi-keywordquery

efficient

implementation


Ir as lm estimation

IR as LM Estimation

user likes doc (R)

given that it has features d

and user poses query q

P[R|d,q]

prob. IR

statist. LM

query likelihood:

top-k query result:

MLE would be tf


Multi bernoulli vs multinomial lm

Multi-Bernoulli vs. Multinomial LM

Multi-Bernoulli:

with Xj(q)=1 if jq, 0 otherwise

Multinomial:

with fj(q) = f(j) = frequency of j in q

multinomial LM more expressive and usually preferred


Lm scoring by kullback leibler divergence

LM ScoringbyKullback-LeiblerDivergence

neg. cross-entropy

neg. KL divergence

of q and d


Jelinek mercer smoothing

Jelinek-Mercer Smoothing

Idea:

use linear combinationofdoc LM with

background LM (corpus, commonlanguage);

could also consider

query log as

background LM

for query

  • parametertuningof bycross-validationwithheld-out data:

  • dividesetof relevant (d,q) pairsinto n partitions

  • build LM on thepairsfrom n-1 partitions

  • choose  tomaximizeprecision (orrecallor F1) on nthpartition

  • iteratewith different choiceofnthpartitionandaverage


Jelinek mercer smoothing relationship to tf idf

Jelinek-Mercer Smoothing:Relationship to tf*idf

with absolute

frequencies tf, df

relative tf

~ relative idf


Dirichlet prior smoothing

Dirichlet-Prior Smoothing

Posteriordistr. with

Dirichletdistribution

asprior

withtermfrequencies f

in document d

MAP (Maximum Posterior) for

withi setto  P[i|C]+1 fortheDirichlethypergenerator

and  > 1 setto multiple ofaveragedocumentlength

with

Dirichlet ():

(Dirichletisconjugatepriorforparametersofmultinomialdistribution:

DirichletpriorimpliesDirichletposterior, onlywith different parameters)


Dirichlet prior smoothing relationship to jelinek mercer smoothing

Dirichlet-Prior Smoothing:RelationshiptoJelinek-Mercer Smoothing

tf

j

from

corpus

with MLEs

P[j|d], P[j|C]

with

where 1=  P[1|C], ..., m=  P[m|C] are the parameters

of the underlying Dirichlet distribution, with constant  > 1

typically set to multiple of average document length


Entity search with lm ranking z nie et al www 07 h fang et al ecir 07 p serdyukov et al ecir 08

EntitySearchwith LM Ranking [Z. Nie et al.: WWW’07, H. Fang et al.: ECIR‘07, P. Serdyukov et al.: ECIR‘08, …]

played for ManU, Real, LA Galaxy

David Beckham champions league

England lost match against France

married to spice girl …

Zizou champions league 2002

Real Madrid won final ...

Zinedine Zidane best player

France world cup 1998 ...

query: keywords  answer: entities

LM (entity e) = prob. distr. of words seen in context of e

query q: „French player who

won world championship“

candidate entities:

e1: David Beckham

e2: Ruud van Nistelroy

e3: Ronaldinho

e4: Zinedine Zidane

e5: FC Barcelona

weighted

by conf.


Lm s from entities to facts

LM‘s: from Entities to Facts

Document / Entity LM‘s

LM for doc/entity: prob. distr. of words

LM for query: (prob. distr. of) words

LM‘s: richfordocs/entities, super-sparseforqueries

richerquery LM withqueryexpansion, etc.

Triple LM‘s

LM for facts: (degen. prob. distr. of) triple

LM for queries: (degen. prob. distr. of) triple pattern

LM‘s: apples and oranges

  • expandquery variables by S,P,O valuesfrom DB/KB

  • enhancewithwitnessstatistics

  • queryLM thenisprob. distr. oftriples!


Lm s for triples and triple patterns

LM‘s for Triples and Triple Patterns

[G. Kasneci et al.: ICDE’08; S. Elbassuoni et al.: CIKM’09, ESWC‘11]

triples (facts f):

triplepatterns (queries q):

q: Beckham p ?y

LM(q) + smoothing

200

300

20

30

300

150

20

200

350

10

400

200

150

100

150

20

f1: Beckham p ManchesterU

f2: Beckham p RealMadrid

f3: Beckham p LAGalaxy

f4: Beckham p ACMilan

F5: Kaka p ACMilan

F6: Kaka p RealMadrid

f7: Zidane p ASCannes

f8: Zidane p Juventus

f9: Zidane p RealMadrid

f10: Tidjani p ASCannes

f11: Messi p FCBarcelona

f12: Henry p Arsenal

f13: Henry p FCBarcelona

f14: Ribery p BayernMunich

f15: Drogba p Chelsea

f16: Casillas p RealMadrid

q: Beckham p ManU

q: Beckham p Real

q: Beckham p Galaxy

q: Beckham p Milan

200/550

300/550

20/550

30/550

q: ?x p ASCannes

Zidane p ASCannes20/30

Tidjani p ASCannes10/30

q: ?x p ?y

LM(q): {t  P [t | t matches q] ~ #witnesses(t)}

LM(answer f): {t  P [t | t matches f] ~ 1 for f}

smooth all LM‘s

rank resultsbyascending KL(LM(q)|LM(f))

Messi p FCBarcelona400/2600

Zidane p RealMadrid350/2600

Kaka p ACMilan300/2600

q: Cruyff ?r FCBarcelona

Cruyff playedFor FCBarca 200/500

Cruyff playedAgainst FCBarca 50/500

Cruyff coached FCBarca 250/500

: 2600

witnessstatistics


Lm s for composite queries

LM‘s for Composite Queries

q: Select ?x,?cWhere { France ml ?x . ?x p ?c . ?c in UK . }

queries q with subqueries q1 … qn

P [ F ml Henry,

Henry p Arsenal,

Arsenal in UK ]

P [ F ml Drogba,

Drogbap Chelsea,

Chelsea in UK ]

resultsaren-tuplesoftriplest1 … tn

LM(q): P[q1…qn] = i P[qi]

LM(answer): P[t1…tn] = i P[ti]

KL(LM(q)|LM(answer)) = i KL(LM(qi)|LM(ti))

f1: Beckham p ManU 200

f7: Zidane p ASCannes 20

f8: Zidane p Juventus 200

f9: Zidane p RealMadrid 300

f10: Tidjani p ASCannes 10

f12: Henry p Arsenal 200

f13: Henry p FCBarca 150

f14: Ribery p Bayern 100

f15: Drogba p Chelsea 150

f21: F ml Zidane 200

f22: F ml Tidjani 20

f23: F ml Henry 200

f24: F ml Ribery 200

f25: F ml Drogba 30

f26: IC ml Drogba 100

f27 ALG ml Zidane 50

f31: ManU in UK 200

f32: Arsenal in UK 160

f33: Chelsea in UK 140


Lm s for keyword augmented queries

LM‘s for Keyword-Augmented Queries

q: Select ?x, ?c Where {

France ml ?x [goalgetter, “top scorer“].

?x p ?c .

?c in UK[champion, “cupwinner“, double] . }

subqueriesqiwithkeywords w1 … wm

resultsare still n-tuplesoftriplesti

LM(qi): P[tripleti | w1 … wm] = k  P[ti| wk] + (1) P[ti]

LM(answerfi) analogous

KL(LM(q)|LM(answerfi)) = i KL (LM(qi) | LM(fi))

resultrankingprefers (n-tuplesof) triples

whosewitnesses score high on thesubquerykeywords


Lm s for temporal phrases

LM‘s for Temporal Phrases

[K.Berberich et al.: ECIR‘10]

1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010

Problem: user‘s temporal interest is imprecise

mid 90s

q: Select ?c Where {

?c instOfnationalTeam .

?c hasWonWorldCup[nineties] . }

July 1994

last

century

1998

“nineties“ 

v = [1Jan90,31Dec99]

summer

1990

“mid 90s“ 

xj = [1Jan93,31Dec97]

extract

tempexpr‘sxj

fromwitnesses

& normalize

normalize

tempexpr v

in query

P[q | t] ~ … jP[v | xj]

~ … j  { P[ [b,e] | xj] | interval [b,e]v}

~ … joverlap(v,xj) / union(v,xj)

  • enhancedranking

  • efficientlycomputable

  • plugintodoc/entity/triplesLM‘s

P[v=„nineties“ | xj = „mid 90s“] = 1/2

P[„nineties“ | „summer 1990“] = 1/30

P[„nineties“ | „last century“] = 1/10


Query relaxation

Query Relaxation

q: … Where{ France ml ?x . ?x p ?c . ?c in UK . }

q(4): … Where{ IC ml ?x . ?x p ?c . ?c in UK . }

q(3): … Where{ France ?r?x . ?x p ?c . ?c in UK . }

q(2): … Where{ ?xml ?x . ?x p ?c . ?c in UK . }

q(1): … Where{ France ml ?x . ?x p ?c . ?c in ?y. }

[ F resOfDrogba,

Drogbap Chelsea,

Chelsea in UK]

[ IC ml Drogba,

Drogbap Chelsea,

Chelsea in UK]

[ IC ml Drogba,

Drogbap Chelsea,

Chelsea in UK]

[ F ml Zidane,

Zidane p Real,

Real in ESP ]

LM(q*) =  LM(q) + 1 LM(q(1)) + 2 LM(q(2)) + …

replacee in q bye(i) in q(i):

precompute P:=LM (e ?p ?o)

and Q:=LM (e(i) ?p ?o)

set i ~ 1/2 (KL (P|Q) + KL (Q|P))

replacer in q byr(i) in q(i)  LM (?s r(i) ?o)

replacee in q by?x in q(i)  LM (?x r ?o)

LM‘sof e, r, ...

are prob. distr.‘s

oftriples!

f21: F ml Zidane 200

f22: F ml Tidjani 20

F23: F ml Henry 200

F24: F ml Ribery 200

F26: IC ml Drogba 100

F27 ALG ml Zidane 50

f1: Beckham p ManU 200

f7: Zidane p ASCannes 20

f9: Zidane p Real 300

f10: Tidjani p ASCannes 10

f12: Henry p Arsenal 200

f15: Drogba p Chelsea 150

f31: ManU in UK 200

f32: Arsenal in UK 160

f33: Chelsea in UK 140


Result personalization

Result Personalization

[S. Elbassuoni et al.: PersDB‘08]

q2

f3

f1

q1

f2

q3

f5

f4

q

q

q3

f3

f6

q4

q5

q6

same answer

foreveryone?

Personal historiesof

queries & clickedfacts

 LM(user u):

prob. distr. oftriples !

LM(q|u] =  LM(q) + (1) LM(u)

thenbusinessasusual

u1 [classicalmusic]  q: ?p from Europe . ?p hasWonAcademyAward

u2 [romanticcomedy]  q: ?p from Europe . ?p hasWonAcademyAward

u3 [fromAfrica]  q: ?p isaSoccerPlayer . ?p hasWon ?a

Open issue: „insightful“ results (newtotheuser)


Result diversification

ResultDiversification

[J. Carbonell, J. Goldstein: SIGIR‘98]

q: Select ?p, ?c Where { ?p isaSoccerPlayer . ?p playedFor ?c . }

1 Beckham, ManchesterU

2 Beckham, RealMadrid

3 Beckham, LAGalaxy

4 Beckham, ACMilan

5 Zidane, RealMadrid

6 Kaka, RealMadrid

7 Cristiano Ronaldo, RealMadrid

8 Raul, RealMadrid

9 van Nistelrooy, RealMadrid

10 Casillas, RealMadrid

1 Beckham, ManchesterU

2 Beckham, RealMadrid

3 Zidane, RealMadrid

4 Kaka, ACMilan

5 Cristiano Ronaldo, ManchesterU

6 Messi, FCBarcelona

7 Henry, Arsenal

8 Ribery, BayernMunich

9 Drogba, Chelsea

10 Luis Figo, Sporting Lissabon

rank results f1 ... fkbyascending

 KL(LM(q) | LM(fi))  (1) KL( LM(fi) | LM({f1..fk}\{fi}))

implementedbygreedyre-rankingoffi‘s in candidatepool


Gerhard weikum max planck institute for informatics http www mpi inf mpg de weikum

Entity-Search Ranking by Link Analysis[A. Balmin et al. 2004, Nie et al. 2005, Chakrabarti 2007, J. Stoyanovich 2007]

  • EntityAuthority(ObjectRank, PopRank, HubRank, EVA, etc.):

  • defineauthoritytransfergraph

  • amongentitiesandpageswithedges:

    • entity pageifentityappears in page

    • page entityifentityisextractedfrompage

    • page1  page2 ifthereishyperlinkorimplicit link betweenpages

    • entity1  entity2 ifthereis a semanticrelationbetweenentities

  • edgescanbetypedand (degree- orweight-) normalizedand

  • areweightedbyconfidenceand type-importance

  • also applicabletographof DB recordswithforeign-keyrelations

  • (e.g. bibliographywith different weightsofpublisher vs. locationforconferencerecord)

  • comparedtostandard Web graph, ER graphsofthiskind

  • havehighervariationofedgeweights


Outline2

Outline

Motivation

SearchingforEntities & Relations

Informative Ranking

Efficient Query Processing

User Interface

Wrap-up

...


Scalable semantic web pattern queries on large rdf graphs

Scalable Semantic Web: Pattern Queries on Large RDF Graphs

S partOf capital .,.

.,. .,. .,. .,.

Person

hasWon

S P O

S hasWon bornIn .,.

S O

Einstein hasWon Nobel

Einstein bornIn Ulm

Ronaldo hasWon FIFA

Spain partOf Europe

France partOf Europe

… … .,.

Einstein Nobel

Ronaldo FIFA

… …

Einstein Nobel Ulm .,.

Ronaldo FIFA Rio .,.

.,. .,. .,. .,.

Country

bornIn

S O

… …

schema-free RDF triples: subject-property-object (SPO)

example: Einstein hasWon NobelPrize

SPARQL triple patterns: Select ?p,?c Where {

?p isa scientist . ?p hasWon NobelPrize .

?p bornIn ?t .?t inCountry ?c . ?c partOf Europe}

large join queries, unpredictable workload,

difficult physical design, difficult query optimization

?p ?b ?t

AllTriples

Semantic-Web engines (Sesame, Jena, etc.)

did not provide scalable query performance


Scalable semantic web rdf 3x engine t neumann et al vldb 2008 sigmod 2009 vldbj 10

Scalable Semantic Web: RDF-3X Engine[T. Neumann et al.: VLDB 2008, SIGMOD 2009, VLDBJ’10]

  • RISC-style, tuning-free system architecture

  • map literals into ids (dictionary) and precompute

  • exhaustive indexing for SPO triples:

  • SPO, SOP, PSO, POS, OSP, OPS,

  • SP*, PS*, SO*, OS*, PO*, OP*, S*, P*, O*

  • very high compression

  • efficient merge joins with order-preservation

  • join-order optimization

  • by dynamic programming over subplan  result-order

  • statistical synopses for accurate result-size estimation

http://code.google.com/p/rdf3x/

http://www.mpi-inf.mpg.de/~neumann/rdf3x/


Scalable indexing merge joins in rdf 3x

Scalable Indexing & Merge Joins in RDF-3X

poor

join order

?p isa scientist . ?p hasWon NobelPrize .

?p bornIn ?t .?t inCountry ?c . ?c partOf Europe

...

||

[S=S]

||

[O=S]

||

[S=S]

scan OPS

[O=scientist

P=isa]

scan OPS

[O=NobelPrize,

P=hasWon]

scan PSO

[P=inCountry]

scan PSO

[P=bornIn]

O P S

O P S

P S O

P S O

3007 2015 1011

3111 2003 1019

2100 1003 5648

2077 1005 8311

3007 2015 1055

3111 2003 1055

2100 1119 5555

2077 1099 8018

3007 2015 1119

3111 2003 1111

2100 1127 5712

2077 1127 8123

3007 2015 1127

3111 2003 1127

2100 1255 5515

2077 1148 8244

3007 2015 1135

3111 2003 1135

2100 1266 5712

2077 1266 8510

3007 2015 1339

3111 2003 1199

2100 1418 5510

2077 1333 8751

3007 2015 1348

3111 2003 1227

2100 1429 5555

2077 1429 8135

3007 2015 1418

3111 2003 1243

2100 1683 5843

2077 1431 8200


Scalable indexing merge joins in rdf 3x1

Scalable Indexing & Merge Joins in RDF-3X

good

join order

?p isa scientist . ?p hasWon NobelPrize .

?p bornIn ?t .?t inCountry ?c . ?c partOf Europe

...

||

[O=S]

||

[S=S]

||

[S=S]

scan OPS

[O=scientist

P=isa]

scan OPS

[O=NobelPrize,

P=hasWon]

scan PSO

[P=inCountry]

scan PSO

[P=bornIn]

O P S

O P S

P S O

P S O

3007 2015 1011

3111 2003 1019

2100 1003 5648

2077 1005 8311

3007 2015 1055

3111 2003 1055

2100 1119 5555

2077 1099 8018

3007 2015 1119

3111 2003 1111

2100 1127 5712

2077 1127 8123

3007 2015 1127

3111 2003 1127

2100 1255 5515

2077 1148 8244

3007 2015 1135

3111 2003 1135

2100 1266 5712

2077 1266 8510

3007 2015 1339

3111 2003 1199

2100 1418 5510

2077 1333 8751

3007 2015 1348

3111 2003 1227

2100 1429 5555

2077 1429 8135

3007 2015 1418

3111 2003 1243

2100 1683 5843

2077 1431 8200


Scalable indexing merge joins in rdf 3x2

Scalable Indexing & Merge Joins in RDF-3X

sideways

information

passing

run-time filters

?p isascientist . ?p hasWonNobelPrize .

?p bornIn ?t .?t inCountry ?c . ?c partOf Europe

...

||

[O=S]

||

[S=S]

||

[S=S]

scan OPS

[O=scientist

P=isa]

scan OPS

[O=NobelPrize,

P=hasWon]

scan PSO

[P=inCountry]

scan PSO

[P=bornIn]

O P S

O P S

P S O

P S O

3007 2015 1011

3111 2003 1019

2100 1003 5648

2077 1005 8311

3007 2015 1055

3111 2003 1055

2100 1119 5555

2077 1099 8018

3007 2015 1119

3111 2003 1111

2100 1127 5712

2077 1127 8123

3007 2015 1127

3111 2003 1127

2100 1255 5515

2077 1148 8244

3007 2015 1135

3111 2003 1135

2100 1266 5712

2077 1266 8510

3007 2015 1339

3111 2003 1199

2100 1418 5510

2077 1333 8751

3007 2015 1348

3111 2003 1227

2100 1429 5555

2077 1429 8135

3007 2015 1418

3111 2003 1243

2100 1683 5843

2077 1431 8200


Join order optimization for sparql

Join-Order Optimizationfor SPARQL

BobDylan

Berlin

female

inCity

composedBy

gender

hasFriend

likes

performedIn

singer

?f

?s

?p

?a

me

livesIn

protest

antiwar

inYear

Berlin

?x

?y

2009

[T. Neumann: VLDBJ‘10]

Fine-grained RDF (e.g. oversocial-taggingdata)

oftenentailscomplexquerieswith 10, 20 ormorejoins

?f genderfemale .

?f hasFriendme .

?f livesIn Berlin .

?f likes ?s .

?s taggedProtestBy ?x .

?s taggedAntiwarBy ?y .

?s composedByBobDylan .

?s performedIn ?p .

?p inCity Berlin .

?s inYear 2009 .

?p singer ?a .

  • Join-order optimizationoperates on joingraph:

  • nodesaretriplepatterns, edgesdenoteshared variables

  • needselectivity (resultcardinality) estimator

  • join-order optimizationis O(n3) forchains, O(2n) forstars

  • exactoptimizationusesdynamicprogramming

  • oversub-graphsandthesort order of intermediate results


Experimental evaluation setup

Experimental Evaluation: Setup

[T. Neumann: SIGMOD‘09]

  • Setup and competitors:

  • 2GHz dual core, 2 GB RAM, 30MB/s disk, Linux

  • column-store property tables by MIT folks, using MonetDB

  • triples store with SPO, POS, PSO indexes, using PostgreSQL

Datasets:

1) Barton library catalog: 51 Mio. triples (4.1 GB)

2) YAGO knowledge base: 40 Mio. triples (3.1 GB)

3) Librarything social-tagging excerpt: 30 Mio. triples (1.8 GB)

Select ?t Where {

?b hasTitle ?t .

?u romance ?b .

?u love ?b .

?u mystery ?b .

?u suspense ?b .

?u crimeNovel ?c .

?u hasFriend ?f .

?f ... }

Benchmark queries such as:

1) counts of French library items (books, music, etc.),

with creator, publisher, language, etc.

2) scientist from Poland with French advisor who both won awards

3) books tagged with romance, love, mystery, suspense

by users who like crime novels and have friends who ...


Experimental evaluation results

Experimental Evaluation: Results

[T. Neumann: SIGMOD‘09]

execution time [s]

execution time [s]

YAGO dataset

(40 Mio. triples, 3.1 GB):

2.7 GB in RDF-3X, loaded in 25 min

Librarything dataset

(36 Mio. triples, 1.8 GB):

1.6 GB in RDF-3X, loaded in 20 min

measurements on Uniprot (845 Mio. triples) and Billion-Triples (562 Mio. triples)

confirm these experimental results


Distributed rdf stores sparql at scale

Distributed RDF Stores & Sparql at Scale

[D. Abadi et al.: VLDB‘11]

Madrid

midfielder

Casillas

Casillas

Madrid

Madrid

?pos

> 1 000 000

Özil

playsPos

Özil

Özil

?pop

Gelsenkirchen

Gelsenkirchen

?player

Real

Real

Real

Iniesta

Iniesta

hasPop

playsFor

bornIn

FC Barca

FC Barca

Messi

Messi

?team

?city

locatedIn

Rosario

Rosario

Xavi

Xavi

Barcelona

Barcelona

  • Partition datagraphbyhashingtriples on subject

  • EnhancelocalitybyreplicatingN-hopneighborswitheachtriple

  • Can automaticallydeterminewhenqueryis

  • parallelizablewithoutcommunication

  • Forremainingcases: distributedjoins, semi-joins, etc.


Top k query processing for ranked results

Top-k Query Processing for Ranked Results

fetch lists

join&sort

simple & DB-style;

needs only O(k) memory;

for monotone score aggr.

Data items: d1, …, dn

d1

s(t1,d1) = 0.7

s(tm,d1) = 0.2

Query: q = (t1, t2, t3)

aggr: summation

Index lists

d78

0.9

d23

0.8

d10

0.8

d1

0.7

d88

0.2

t1

d64

0.8

d23

0.6

d10

0.6

d10

0.2

d78

0.1

t2

d10

0.7

d78

0.5

d64

0.4

d99

0.2

d34

0.1

t3


Gerhard weikum max planck institute for informatics http www mpi inf mpg de weikum

Threshold Algorithm (TA)

[Fagin 01, Güntzer 00, Nepal 99, Buckley85]

Rank Doc Score

Rank Doc Score

Rank Doc Score

1 d10 2.1

1 d10 2.1

1 d10 2.1

Rank Doc Score

2 d78 1.5

2 d78 1.5

2 d78 1.5

1 d10 2.1

2 d64 0.9

1 d78 1.5

2 d64 1.2

1 d78 1.5

2 d78 1.5

simple & DB-style;

needs only O(k) memory;

for monotone score aggr.

Threshold algorithm (TA):

scan index lists; consider d at posi in Li;

highi := s(ti,d);

if d  top-k then {

look up s(d) in all lists L with i;

score(d) := aggr {s(d) | =1..m};

if score(d) > min-k then

add d to top-k and remove min-score d’;

min-k := min{score(d’) | d’  top-k};

threshold := aggr {high | =1..m};

if thresholdmin-k then exit;

Data items: d1, …, dn

d1

s(t1,d1) = 0.7

s(tm,d1) = 0.2

Query: q = (t1, t2, t3)

aggr: summation

Index lists

k = 2

Rank Doc Score

d78

0.9

d23

0.8

d10

0.8

d1

0.7

d88

0.2

t1

Scan

depth 1

Scan

depth 2

Scan

depth 3

Scan

depth 4

1 d78 0.9

d64

0.9

d23

0.6

d10

0.6

d12

0.2

d78

0.1

t2

2 d64 0.9

d10

0.7

d78

0.5

d64

0.3

d99

0.2

d34

0.1

t3

STOP!


Ta with sorted access only nra fagin 01 g ntzer et al 01

TA with Sorted Access Only (NRA)[Fagin 01, Güntzer et al. 01]

sequential access (SA) faster

than random access (RA)

by factor of 20-1000

No-random-access algorithm (NRA):

scan index lists; consider d at posi in Li;

E(d) := E(d)  {i}; highi := s(ti,d);

worstscore(d) := aggr{s(t,d) |  E(d)};

bestscore(d) := aggr{worstscore(d),

aggr{high |   E(d)}};

if worstscore(d) > min-k then add d to top-k

min-k := min{worstscore(d’) | d’  top-k};

else if bestscore(d) > min-k then

cand := cand  {d};

threshold := max {bestscore(d’) | d’ cand};

if threshold  min-k then exit;

Data items: d1, …, dn

d1

s(t1,d1) = 0.7

s(tm,d1) = 0.2

Query: q = (t1, t2, t3)

aggr: summation

Index lists

k = 1

d78

0.9

d23

0.8

d10

0.8

d1

0.7

d88

0.2

t1

Scan

depth 1

Scan

depth 2

Scan

depth 3

d64

0.8

d23

0.6

d10

0.6

d12

0.2

d78

0.1

t2

d10

0.7

d78

0.5

d64

0.4

d99

0.2

d34

0.1

t3

STOP!


Gerhard weikum max planck institute for informatics http www mpi inf mpg de weikum

  • Problems:

  • efficiency

  • tuning the

  • threshold 

Query expansion

Efficient top-k search

with dynamic expansion

Query Expansion (Synonyms, Related Terms, …)

User query: ~t1 ... ~tm

Example:

~professor Germany ~course ~IR

Thesaurus/Ontology:

concepts, relationships, glossesfrom

WordNet, Wikipedia, KB, Web, etc.

alchemist

Term2Concept with WSD

primadonna

magician

director

artist

wizard

investigator

exp(ti)={w | sim(ti,w) }

intellectual

Weightedexpandedquery

Example:

(professorlecturer (0.749) scholar (0.71) ...)

and Germany

and ( (courseclass (1.0) seminar (0.84) ... )

and („IR“ „Web search“ (0.653) ... ) )

researcher

professor

HYPONYM (0.749)

educator

scientist

scholar

lecturer

mentor

teacher

academic,

academician, faculty member

relationshipsquantifiedby

statisticalcorrelationmeasures

e.g. Jaccardcoefficients: |XY| / |XY|

better recall, better mean

precision for hard queries


Query expansion with incremental merging

Query Expansion with Incremental Merging

alchemist

magician

primadonna

director

artist

wizard

investigator

intellectual

Related (0.48)

researcher

professor

Hyponym (0.749)

lecturer:

0.7

scientist

scholar: 0.6

scholar

lecturer

mentor

37: 0.9

92: 0.9

academic,

academician,

faculty member

teacher

67: 0.9

44: 0.8

52: 0.9

22: 0.7

44: 0.8

23: 0.6

55: 0.8

51: 0.6

...

52: 0.6

...

[M. Theobald et al.: SIGIR 2005]

relaxable query q: ~professor research

with expansions

based on ontology relatedness modulating

monotonic score aggregation

exp(i)={w | sim(i,w)  }

iq exp(i)

TA scans of index lists for

Better: dynamic query expansion with

incremental merging of additional index lists

B+ tree index on terms

ontology / meta-index

professor

research

professor

57: 0.6

12: 0.9

lecturer: 0.7

scholar: 0.6

academic: 0.53

scientist: 0.5

...

44: 0.4

44: 0.4

14: 0.8

52: 0.4

28: 0.6

33: 0.3

17: 0.55

75: 0.3

61: 0.5

...

44: 0.5

...

efficient, robust, self-tuning


Query expansion with incremental merging1

Query Expansion withIncrementalMerging

[M. Theobald et al.: SIGIR 2005]

  • Principlesofthealgorithm:

  • organizepossibleexpansionsExp(t)={w | sim(t,w)  , tq}

  • in priorityqueuesbased on sim(t,w) foreach t

  • forgiven q = {t1, tm}:

  • keeptrackofactiveexpansionsforeachti:

  • ActExp(ti) = {w1(ti), w2(ti), ...w j(i)(ti)}

  • ~ position j(i) in thepriorityqueueforExp(ti)

  • scanindexlistsfor t1, w1(t1), ..., wj(1)(t1), ..., tm, w1(tm), ..., wj(m)(tm)

  • maintainingcursorspos()and score boundshigh()foreachlist

  • in eachscanstep do foreachti:

  • advancecursoroflistfor w(ti)  ActExp(ti)

  • forwhichbest := high()*sim(w,ti)islargestamong all listsofti

  • oraddnextlargest w(ti) toActExp(ti)ifsim(w,ti) > best

  • andproceed in thisnewlist


Nested top k theobald et al sigir 05

Nested Top-k [Theobald et al.: SIGIR’05]

top-k

top-k

research professor XML „information retrieval“ „query result ranking“

top-k

research

prof

XML

information

query

result

ranking

retrieval

multithreaded coroutine-like evaluation with synchronization of PQs


Nested top k theobald et al sigir 051

Nested Top-k [Theobald et al.: SIGIR’05]

top-k

top-k

science

scholar

lecturer

~research ~professor XML „information retrieval“ „query result ranking“

top-k

top-k

top-k

research

prof

XML

information

query

result

ranking

retrieval

multithreadedcoroutine-style evaluationwithsynchronizationof PQs


Gerhard weikum max planck institute for informatics http www mpi inf mpg de weikum

Top-k Rank Joins on Structured Data

[Ilyas et al. 2008]

extend TA/NRA/etc. torankedqueryresultsfromstructureddata

(improveoverbaseline: evaluatequery, thensort)

Select R.Name, C.Theater, C.Movie

From RestaurantsGuide R, CinemasProgram C

Where R.City = C.City

Order By R.Quality/R.Price + C.Rating Desc

RestaurantsGuide

CinemasProgram

Name Type Quality Price City

Theater Movie Rating City

BlueDragon Chinese €15 SB

Haiku Japanese €30SB

Mahatma Indian €20 IGB

MescalMexican €10 IGB

BigSchwenk German €25 SLS

...

BlueSmoke Tombstone 7.5 SB

Oscar‘s Hero 8.2 SB

Holly‘sDie Hard6.6 SB

GoodNightSeven 7.7 IGB

BigHits Godfather 9.1 IGB

...

processindexlists on qualitydesc, priceasc, ratingdesc


Top k sparql queries over rdf

Top-k Sparql Queries over RDF

Mark_Twain wrote Roughing_it0.999

J_K_Rowlingwrote Harry_Potter0.982

Stephanie_Meyer wrote Twilight 0.966

Cormac_McCarthy wrote The_Road0.864

Jimmy_Carterwrote Palestine0.732

……………………………………….

An_Hour_before_DaylighthasTag Memoir0.880

The_Grand_AlliancehasTag Memoir0.873

The_Night_TrilogyhasTag Memoir 0.800

Roughing_ithasTagMemoir0.796

................................................…….

[S. Elbassuoni 2012]

  • Precomputescore-sortedindexlists

  • foreach S value, P, O, SP value pair, PO, SO

  • Performrank-joinoverindexlists

?w wrote ?b . ?b hasTagMemoir .


Top k sparql queries over rdf text

Top-k Sparql Queries over RDF + Text

nobel

Jimmy_Carter wrote Palestine0.712

Albert_Camus wrote The_Rebel0.667

Doris_Lessingwrote Alfred_and_Emily0.660  ……………………………………….

prize

Ian_McEwan wrote On_Chesil_Beach 0.882

Jimmy_Carter wrote Palestine0.774 Cormac_McCarthy wrote The_Road0.623 Jhumpa_Lahiri wrote The_Namesake0.600

……………………………………….

An_Hour_before_DaylighthasTag Memoir0.880

The_Grand_AlliancehasTag Memoir0.873

The_Night_TrilogyhasTag Memoir 0.800

Roughing_ithasTagMemoir0.796

................................................…….

[S. Elbassuoni 2012]

  • Precompute score-sortedindexlists

  • foreachS+keyword, P+keyword, O+…, SP+keyword, PO+…, SO+…

  • Perform rank-joinoverindexlists

?w wrote ?b[nobel prize]; ?b hasTagMemoir


Outline3

Outline

Motivation

SearchingforEntities & Relations

Informative Ranking

Efficient Query Processing

User Interface

Wrap-up

...


Ui table search by example description

UI: Table Search by Example & Description

Expectresulttable

 specifyquery in tabular form

(QBE, QueryByDescription)

[S.Sarawagiet al.: VLDB‘09]

q1:

matchagainst

tablerows

David BeckhamReal Madrid

Zinedine ZidaneJuventus

Lionel MessiFC Barcelona

QBE

inferqueryfrom

tableheaders ???

q2:

Bayern MunichReal Madrid

1.FC KaiserslauternReal Madrid

Borussia DortmundReal Madrid

2:1

5:0

5:1

QBE

q2:

soccerclub

from Germany andwinneragainst Real Madrid

QBD

outputclass

description

player club

???

q1:

???

QBD


Ui structured keyword search

UI: Structured Keyword Search

[Ilyas et al. Sigmod‘10]

Need tomap (groupsof) keywordsontoentities & relationships

based on name-entitysimilarities/probabilities

q: Champions League finalswith Real Madrid

UEFA

Champions

League

Real Madrid C.F.

final match

League of

Wrestling

Champions

final exam

q: German footballclubsthatwon (a match) againstReal

socialclub

soccer

Real Madrid C.F.

nightclub

American football

Real Zaragoza

sportsteam

Brazilian Real

moreambiguity, moresophisticatedrelation

 more candidates  combinatorial complexity

Familia Real

Real Number


Ui natural language questions

UI: Natural Language Questions

  • translatequestionintoSparqlquery:

    • dependencyparsing to decomposequestion

    • mapping of questionunitsontoentities, classes, relations

Which German soccerclubshavewonagainst Real Madrid ?

mapresults

intotabular

orvisual

presentation

orspeech


Ui natural language questions1

UI: Natural Language Questions

  • translatequestionintoSparqlquery:

    • dependencyparsing to decomposequestion

    • mapping of questionunitsontoentities, classes, relations

?c German soccerclubs ?r

Which ?c

Select?c Where { ?c isaGermanSoccerClub . c? wonAgainstRealMadrid . }

Select?c Where { ?c isaSoccerClub . c? in Germany . }

?c wonAgainstRealMadrid . }

?id: ?c playedAgainstRealMadrid .

?c isWinnerOf ?id .}

?r havewonagainst ?o

mapresults

intotabular

orvisual

presentation

orspeech

?o Real Madrid ?


Sparql or keywords or

SPARQL or Keywords or … ?

Who composedscoresforwesternsandisfromRome?

SPARQL query: Select ?p Where {

?p created ?s . ?s type music .

?s contributesTo ?m . ?m type movie .

?p bornInRome . }

Keywordsquery:

composer score western Rome

or:{composerRome} score western

or:…

Facetedsearchwithinteractivenavigation:

composer

Rome

score musicmovie

western


Natural language question answering

Natural Language Question Answering

IBM Watson in Jeopardyshow:

Whichcityhastwoairportsnamed after

a war heroand a battlefield?

usedYago & Dbpediafor type checking

question

Overall architectureof Watson (simplified)

Hypotheses

Generation

(Search):

Answer

Candidates

Question

Analysis:

Classification

Decomposition

Hypotheses

& Evidence

Scoring

Candidate

Filtering &

Ranking

answer

[IBM Journal of R&D 56(3-4), 2012]


Semantic technologies in ibm watson

Semantic Technologies in IBM Watson

IBM Watson in Jeopardyshow:

Whichcityhastwoairportsnamed after

a war heroand a battlefield?

usedYago & Dbpediafor type checking

Semanticcheckingofanswercandidates

question

spatial & temporal

relations

Relation

Detection

Constraint

Checker

Entity

Disambiguation

& Matching

candidate

string

KB instances

candidate

score

Type

Checker

Predicate

Disambiguation

& Matching

semantictypes

lexical

answer type

[A. Kalyanpur et al.:

ISWC 2011]

KB


Scoring of semantic answer types

Scoring of Semantic Answer Types

Check for 1) Yagoclasses, 2) Dbpediaclasses, 3) Wikipedialists

Match lexicalanswer type againstclasscandidates

based on stringsimilarityandclasssizes (popularity)

Examples: Scottishinventor inventor, star moviestar

Computescoresforsemantictypes, considering:

classmatch, subclassmatch, superclassmatch,

siblingclassmatch, lowestcommonancestor, classdisjointness, …

notypesYagoDbpediaWikipediaall 3

Standard QA

accuracy50.1%54.4% 54.7% 53.8% 56.5%

Watson

accuracy 65.6%68.6% 67.1% 67.4% 69.0%

[A. Kalyanpur et al.: ISWC 2011]

see also: j-archive.com


From questions to queries

From Questions to Queries

Dependencyparsingexposesstructureofquestion

 „triploids“ (sub-cues)

NL question:

Who composedscoresforwesternsandisfromRome?

  • Who composedscores

  • isfromRome

  • scoresforwesterns


From triploids to triples

From Triploids to Triples

Who composedscoresforwesternsandisfromRome?

  • Who composedscores

?x composed ?s

?x composedscores

?x type composer

?s type music

  • scoresforwesterns

scorescontributesTo ?y

?s contributesTo ?y

?y type westernMovie

  • Who isfromRome

?x bornInRome


Disambiguation mapping for triploids

Disambiguation Mapping for Triploids

Who composedscoresforwesternsandisfromRome?

person

Who

musician

The Who

composed

q1

created

wroteComposition

composed

scores

wroteSoftware

q2

soundtrack

scoresfor

weightededges (coherence, similarity, etc.)

soundtrackFor

shootsGoalFor

q3

western movie

westerns

Western Digital

q4

actedIn

isfrom

bornIn

Rome (Italy)

Rome

Lazio Roma

CombinatorialOptimizationby ILP (with type constraints etc.)


Relaxing overconstrained queries

Relaxing Overconstrained Queries

Select ?p Where {

?p composed ?s . ?s type music .

?s for ?m . ?m type movie .

?p bornInRome . }

Select ?p Where {

?p composed ?s . ?s type music .

?s for ?m . ?m type movie[western] .

?p bornInRome . }

Select ?p Where {

?p ?rel1 ?s [composed] . ?s type music .

?s ?rel2 ?m . ?m type movie[western] .

?p bornInRome . }

withextended SPARQL-FullText: SPOX quadpatterns

(S. Elbassuoni et al.: CIKM‘10, ESWC’11)


Preliminary results

Preliminary Results

(M. Yahya et al.: WWW’12)

http://www.mpi-inf.mpg.de/yago-naga/deanna/


Preliminary results1

Preliminary Results


Preliminary results2

Preliminary Results


Preliminary results3

Preliminary Results


Outline4

Outline

Motivation

SearchingforEntities & Relations

Informative Ranking

Efficient Query Processing

User Interface

Wrap-up

...


Take home lessons

Take-Home Lessons

Semantic Search overEntitiesandRelationships:

Sparqlisnaturalchoice, but textextensioniscrucial

Forranking, LM‘sareeffective, elegant, versatile

statisticallyprincipled & compositional

applicabletoentities, RDF, triples+text, temporal expr, …

Efficientimplementationtechniques

tripleindexing, joinorderoptimization, …

top-k queryprocessing

APIs becomingclear, UIs widely open

keywords++, tablesbyexample, naturallanguage, speech, …


Open problems and grand challenges

Open Problems and Grand Challenges

ExtendSparqlwith Text, Time, Space, …

Influence W3C standard …

Extended LM-based Ranking forEntities

in Temporal & Spatial (& Social) Context

Efficientandeffective (LM-based) Top-k Ranking

forentire Web ofLinked Data (& Text)

Distributed / federatedqueryprocessing

forentireWeb ofLinked Data

ReconcileexpressivenessofSparql + extensions

withease-of-useof natural-language QA


Recommended readings search ranking

Recommended Readings: Search & Ranking

  • G. Kasneci, F. Suchanek, G. Ifrim: NAGA: Searchingand Ranking Knowledge. ICDE 2008

  • S. Elbassuoni, M. Ramanath, G. Weikum: Query Relaxation forEntity-Relationship Search. ESWC 2011

  • S. Elbassuoni, M. Ramanath, et al.: Language-model-basedrankingforquerieson RDF-graphs. CIKM 2009

  • Z. Nie, Y. Ma, S. Shi, J.-R. Wen, W.-Y. Ma: Web Object Retrieval. WWW 2007

  • H. Bast, A. Chitea, F. Suchanek, I. Weber: ESTER: efficient search on text, entities, and relations. SIGIR 2007

  • J. Pound, P. Mika, H. Zaragoza: Ad-hoc object retrieval in the web of data. WWW 2010

  • O. Udrea, D. Recupero, V.S. Subrahmanian: Annotated RDF. ACM Trans. Comput. Log. 11(2), 2010

  • V. Hristidis, H.Hwang, Y.Papakonstantinou: Authority-based keyword search in databases. TODS 33(1), 2008

  • T. Cheng, X. Yan, K. Chang: EntityRank: Searching Entities Directly and Holistically. VLDB 2007

  • D. Vallet, H. Zaragoza: Inferring the Most Important Types of a Query: a Semantic Approach. SIGIR 2008

  • R. Blanco, H. Zaragoza: Finding support sentences for entities. SIGIR 2010

  • P. Serdyukov, D. Hiemstra: Modeling Documents as Mixtures of Persons for Expert Finding. ECIR 2008

  • R. Kaptein, P. Serdyukov, A.P. de Vries, J. Kamps: Entity ranking using Wikipedia as a pivot. CIKM 2010

  • H. Fang, C. Zhai: Probabilistic Models for Expert Finding. ECIR 2007

  • D. Petkova, W.B. Croft: Hierarchical Language Models for Expert Finding in Enterprise Corpora. ICTAI 2006

  • M. Pasca: Towards Temporal Web Search. SAC 2008

  • K. Berberich, S.J. Bedathur, O. Alonso, G. Weikum: A Language Modeling Approach for Temporal

  • Information Needs. ECIR 2010: 13-25

  • J. Tappolet et al.: Applied temporal RDF: efficient temporal querying of RDF data with SPARQL. ESWC 2009

  • C. Zhai: Statistical Language Models for Information Retrieval. Morgan&Claypool, 2008

  • B.C. Ooi (Ed.): Special Issue on Keyword Search, IEEE Data Eng. Bull. 33(1), 2010

  • J.X. Yu, L. Qin, L. Chang: Keyword Search in Databases Morgan & Claypool, 2010

  • S. Ceri, M. Brambilla (Eds.): Search Computing: Challenges and Directions, Springer, 2010

  • M. Arenas, S. Conca, J. Perez: Countingbeyond a Yottabyte, orhow SPARQL 1.1 propertypaths

  • will preventadoptionofthestandard. WWW 2012


Recommended readings efficiency ui

Recommended Readings: Efficiency & UI

  • T. Neumann, G. Weikum: The RDF-3X engine for scalable management of RDF data. VLDB J. 19(1), 2010

  • J. Huang, D.J. Abadi, K. Ren: Scalable SPARQL Queryingof Large RDF Graphs. PVLDB 4(11), 2011

  • L. Sidirourgos, R. Goncalves, M.L. Kersten, N. Nes, S. Manegold: Column-store supportfor

  • RDF datamanagement: not all swansarewhite. PVLDB 1(2), 2008

  • R. Delbru, S. Campinas, G. Tummarello: Searching web data: An entityretrievaland high-performance

  • indexingmodel. J. Web Sem. 10, 2012

  • G. Tummarelloet al.: Sig.ma: live views on the web ofdata. WWW 2010

  • M. Schmidt, T. Hornung, G. Lausen, C. Pinkel: SP^2Bench: A SPARQL Performance Benchmark. ICDE 2009

  • M. Morsey, J. Lehmann, S. Auer, A.N. Ngomo: DBpedia SPARQL Benchmark - Performance Assessment

  • withReal Queries on Real Data. ISWC 2011

  • K. Hose, R. Schenkel, M. Theobald, G. Weikum: Database FoundationsforScalable RDF Processing.

  • in: A. Polleres et al. (Eds.): ReasoningWeb - Semantic Technologies for the Web of Data, Springer 2011

  • A. Schwarte, P. Haase, K. Hose, R. Schenkel, M. Schmidt: FedX: OptimizationTechniquesfor

  • Federated Query Processing on Linked Data. ISWC 2011

  • J. Pound, I.F. Ilyas, G.E. Weddell: Expressive and flexible access to web-extracted data:

  • a keyword-based structured query language. SIGMOD 2010

  • R. Pimplikar, S. Sarawagi: Answering Table Queries on the Web using Column Keywords. PVLDB 5(10), 2012

  • A. Kalyanpur, J.W. Murdock, J. Fan, C.A. Welty: Leveraging Community-Built Knowledge for Type Coercion

  • in Question Answering. ISWC 2011

  • C. Unger, L. Bühmann, J. Lehmann, A.N. Ngomo, D. Gerber, P. Cimiano: Template-based question answering

  • over RDF data. WWW 2012

  • M. Yahya. K. Berberich, S. Elbassuoni, M. Ramanath, V. Tresp, G. Weikum:

  • Natural Language Questionsforthe Web ofData. EMNLP 2012

  • D.A. Ferrucci et al.: BuildingWatson: An OverviewoftheDeepQA Project. AI Magazine 31(3), 2010

  • IBM Journal of Research and Development 56(3-4), Special Issue on “This is Watson”, 2012


Thank you

Thank You!


  • Login