Shady Elbassuoni, Luis Galarraga, Peter Haase, Katja Hose, Hassan Issa, Steffen Metzger, Maya Ramana...
Download
1 / 75

Semantic Information - PowerPoint PPT Presentation


  • 56 Views
  • Uploaded on

Shady Elbassuoni, Luis Galarraga, Peter Haase, Katja Hose, Hassan Issa, Steffen Metzger, Maya Ramanath, Michael Schmidt, Andreas Schwarte, Michael Stoll, Marcin Sydow, Gerhard Weikum. Semantic Information. Resource Description Framework:

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about ' Semantic Information' - taya


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

Shady Elbassuoni, Luis Galarraga, Peter Haase, Katja Hose, Hassan Issa, Steffen Metzger, Maya Ramanath, Michael Schmidt, Andreas Schwarte, Michael Stoll, Marcin Sydow, Gerhard Weikum


Semantic information
Semantic Information Hassan Issa, Steffen Metzger, Maya Ramanath, Michael Schmidt, Andreas Schwarte, Michael Stoll, Marcin Sydow, Gerhard Weikum

Resource Description Framework:

  • Represent knowledge about resources (things) in a machine-readable way.

  • Resources and their relations identified by URIs

  • Statements (triples) with prefixes represent facts

<http://xmlns.com/foaf/0.1/name>

<http://www.mpii.de/yago/resource/John_Doe>

PREFIX yago: <http://www.mpii.de/yago/resource/>

PREFIX foaf: <http://xmlns.com/foaf/0.1/>

yago:John_Doe foaf:name “John Doe”

Subject

Predicate

Object

Ralf Schenkel


Rdf sparql
RDF & SPARQL Hassan Issa, Steffen Metzger, Maya Ramanath, Michael Schmidt, Andreas Schwarte, Michael Stoll, Marcin Sydow, Gerhard Weikum

RDF data can be seen as data graph

yago:John_Doe

foaf:name

foaf:knows

“John Doe”

yago:Max_Mustermann

foaf:name

“Max Mustermann”

SPARQL: Query language for RDF from the W3C

for graph pattern queries on the knowledge base

Ralf Schenkel


Ontologies for representing knowledge
Ontologies for Representing Knowledge Hassan Issa, Steffen Metzger, Maya Ramanath, Michael Schmidt, Andreas Schwarte, Michael Stoll, Marcin Sydow, Gerhard Weikum

“Barack Obama”

label

“44th president”

label

resource

subclassOf

subclassOf

classes

person

location

subclassOf

domain

range

subclassOf

subclassOf

bornIn

scientists

politician

city

isA

relations

isA

Single fact:

Triple (subject, predicate, object)

Example:

(Barack_Obama, bornIn, Honolulu)

bornIn

instances/entities

(URIs)

Honolulu

bornOn

04-08-1961

Ralf Schenkel


Sparql example
SPARQL – Example Hassan Issa, Steffen Metzger, Maya Ramanath, Michael Schmidt, Andreas Schwarte, Michael Stoll, Marcin Sydow, Gerhard Weikum

scientist

isA

isA

actor

vegetarian

physicist

chemist

isA

isA

isA

isA

isA

isA

Mike_Myers

Jim_Carrey

Albert_Einstein

Otto_Hahn

bornIn

bornIn

bornIn

bornIn

Scarborough

Newmarket

Ulm

Frankfurt

locatedIn

locatedIn

locatedIn

locatedIn

Ontario

Germany

locatedIn

locatedIn

Europe

Canada

Example query:Find all actors from Ontario (that are in the knowledge base)

Ralf Schenkel


Sparql example1
SPARQL – Example Hassan Issa, Steffen Metzger, Maya Ramanath, Michael Schmidt, Andreas Schwarte, Michael Stoll, Marcin Sydow, Gerhard Weikum

actor

constants

isA

?person

variables

bornIn

?loc

locatedIn

Ontario

Example query:Find all actors from Ontario (that are in the knowledge base)

SELECT?personWHERE?person isA actor. ?person bornIn ?loc.?loc locatedIn Ontario.

scientist

Find subgraphs of this form:

isA

isA

actor

vegetarian

physicist

chemist

isA

isA

isA

isA

isA

isA

Mike_Myers

Jim_Carrey

Albert_Einstein

Otto_Hahn

bornIn

bornIn

bornIn

bornIn

Scarborough

Newmarket

Ulm

Frankfurt

locatedIn

locatedIn

locatedIn

locatedIn

Ontario

Germany

locatedIn

locatedIn

Europe

Canada

Ralf Schenkel


Examples for semantic data
Examples for Semantic Data Hassan Issa, Steffen Metzger, Maya Ramanath, Michael Schmidt, Andreas Schwarte, Michael Stoll, Marcin Sydow, Gerhard Weikum

  • General Knowledge Bases: DBPedia, Freebase, YAGO

  • Domain-specific knowledge: Biology, Geo, Government, Publications, Movies, Songs, …

  • Linked Open Data as large integrated knowledge base

Ralf Schenkel


Semantic data grows rapidly
Semantic Data Grows Rapidly Hassan Issa, Steffen Metzger, Maya Ramanath, Michael Schmidt, Andreas Schwarte, Michael Stoll, Marcin Sydow, Gerhard Weikum

Biggest reported application

(telecommunication data):

>1 trillion triples

More than 31 billion triples in the LOD cloud

DBPedia: 3.6 million entities, 1.2 billion triples

Ralf Schenkel


Queries can be complex too
Queries can be complex, too Hassan Issa, Steffen Metzger, Maya Ramanath, Michael Schmidt, Andreas Schwarte, Michael Stoll, Marcin Sydow, Gerhard Weikum

SELECT DISTINCT ?a ?b ?lat ?long WHERE

{ ?a dbpedia:spouse ?b.

?a dbpedia:wikilink dbpediares:actor.

?b dbpedia:wikilink dbpediares:actor.

?a dbpedia:placeOfBirth ?c.

?b dbpedia:placeOfBirth ?c.

?c owl:sameAs ?c2.

?c2 pos:lat ?lat.

?c2 pos:long ?long.

}

Find actors that are married to each other and were born

In the same place, together with coordinates of that place

Q7 on BTC2008 in [Neumann & Weikum, 2009]

Ralf Schenkel


Outline of the talk
Outline of the Talk Hassan Issa, Steffen Metzger, Maya Ramanath, Michael Schmidt, Andreas Schwarte, Michael Stoll, Marcin Sydow, Gerhard Weikum

  • Introduction

  • Querying Federations of Knowledge Bases

  • Building and Querying Distributed RDF Stores

  • Information Extraction and SPARQL extensions

  • Cooperative Knowledge Services

Ralf Schenkel


Motivation federated execution
Motivation: Federated Execution Hassan Issa, Steffen Metzger, Maya Ramanath, Michael Schmidt, Andreas Schwarte, Michael Stoll, Marcin Sydow, Gerhard Weikum

  • Linked Open Data

    • includes cross-collection links

    • supports cross-collection querieson large virtual collection

    • stored in different servers

  • Naive query execution:

    • Copy all data to central server

    • Execute query at central server

Many problems:

volume of data (>31 billion triples), changes of base data,sources may not provide RDF dump (only SPARQL access)

Better: purely virtual integration by federation of sources

Ralf Schenkel


Federated query processing
Federated Query Processing Hassan Issa, Steffen Metzger, Maya Ramanath, Michael Schmidt, Andreas Schwarte, Michael Stoll, Marcin Sydow, Gerhard Weikum

Federation layer at central server

  • Computes (distributed) execution plan

  • Fetches subresults from local sources (SPARQL)

  • Combines subresults

query

  • Advantages:

  • Access to live data

  • No local storage and maintenance

  • On-demand access to sources

Federation

  • But:

  • Sources provide only limited level of cooperation

  • Only limited information about data in each source

  • User must select sources to include in federation

SPARQLEndpoint

SPARQLEndpoint

SPARQLEndpoint

DataSource

DataSource

DataSource

Ralf Schenkel


Naive federated processing
Naive Federated Processing Hassan Issa, Steffen Metzger, Maya Ramanath, Michael Schmidt, Andreas Schwarte, Michael Stoll, Marcin Sydow, Gerhard Weikum

  • Iteratively evaluate triple patterns at all sources

  • For each resulting binding, fill value in next triple pattern and submit to all sources (nested loop join)

  • Continue until all patterns are evaluated

Example: 3 triple patterns, 4 sources

  • Evaluate this at all sources:

  • 200 results from source1: (?Country,?Capital) bindings

  • no results from other sources

  • overall 4 requests

  • For each of the 200 ?Country bind.:

  • replace ?Country by value (e.g., „Austria ns:population ?CP“)

  • submit to all sources

  • overall 200*4 requests, 100 res. (the same from sources 2 and 3)

  • For each of the 100 ?Capital bind.with matching ?Country bind.:

  • replace ?Capital by value (e.g., „Wien ns:population ?CP“)

  • submit to all sources

  • overall 100*4 requests, 100 res.

?Country ns:capital ?Capital.

?Country ns:population ?CountryPop.

?Capital ns:population ?CapitalPop.

Many unnecessary requests:

Sources do not have results or overlap in results; inefficient NL join

Our approach:

Apply techniques from logical, physical, and cost-based query optimization

Ralf Schenkel


Query optimization in fedx
Query Optimization in FedX Hassan Issa, Steffen Metzger, Maya Ramanath, Michael Schmidt, Andreas Schwarte, Michael Stoll, Marcin Sydow, Gerhard Weikum

Specific optimization techniques in FedX:

  • Source Selection

  • Exclusive Groups

  • Join Order

  • Bound Joins

A.Schwarte, P. Haase, K. Hose, R. Schenkel, M. Schmidt:ESWC 2011 (demo), ISWC 2011

Ralf Schenkel


Technique 1 source selection
Technique 1: Source Selection Hassan Issa, Steffen Metzger, Maya Ramanath, Michael Schmidt, Andreas Schwarte, Michael Stoll, Marcin Sydow, Gerhard Weikum

TRUE

FALSE

FALSE

Which sources contribute results for a pattern?

  • One SPARQL ASK request per source

  • Local cache to reduce remote communication(with time-based invalidation) save on subsequent queries with this pattern

  • Annotate triple patterns with relevant sources(for constructing the query)

Example: Federation (DBpedia, NYTimes, LinkedMDB)

?Country ns:capital ?Capital.

DBPedia: ASK ?Country WHERE {?Country ns:capital ?Capital.}

NYTimes: ASK ?Country WHERE {?Country ns:capital ?Capital.}

LinkedMDB: ASK ?Country WHERE {?Country ns:capital ?Capital.}

 only DBpedia relevant for this triple pattern

Ralf Schenkel


Technique 2 exclusive groups
Technique 2: Exclusive Groups Hassan Issa, Steffen Metzger, Maya Ramanath, Michael Schmidt, Andreas Schwarte, Michael Stoll, Marcin Sydow, Gerhard Weikum

Group joining triple patterns with

the same single relevant source

  • Needs only a single request

  • Evaluate join at the source, no communication needed

Example: Federation (DBpedia, NYTimes, LinkedMDB)

SELECT ?President ?Party ?Title WHERE {

?President rdf:type dbpedia:President .

?President dbpedia:Party ?Party .

?President dc:title ?Title .

}

Source Selection

@ DBpedia

@ DBpedia

@ DBpedia, NYTimes

Exclusive Group

 Execute multiple triple patterns in a single request

Ralf Schenkel


Technique 3 join order
Technique 3: Join Order Hassan Issa, Steffen Metzger, Maya Ramanath, Michael Schmidt, Andreas Schwarte, Michael Stoll, Marcin Sydow, Gerhard Weikum

Determine optimal execution order of

  • triple patterns

  • Joins

    in order to minimize intermediate results

Example: Federation (DBpedia, LinkedMDB), 100 results

SELECT ?actor WHERE {

?actor rdf:type imdb:actor .

?actor bornIn Salzburg .

}

>1 million results in LinkedMDB

1000 results in DBPedia

 Execute second triple pattern first

Need for selectivity and join statistics at federated level

Ralf Schenkel


Technique 4 bound joins
Technique 4: Bound Joins Hassan Issa, Steffen Metzger, Maya Ramanath, Michael Schmidt, Andreas Schwarte, Michael Stoll, Marcin Sydow, Gerhard Weikum

Perform joins in a block nested loop fashion

  • Connect bound triple patterns with SPARQL UNIONS

  • Apply local post-processing to retain correctness

  • Rename variables to represent original bindings

Example: Process join for patterns (?S type U) and (?S p ?O), where results for left argument (?S type U) are already computed

Block Input

?S=s1

?S=s2

?S=s3

?S=s4

?S=s5

Before (NLJ)

SELECT ?O WHERE { s1 p ?O }

SELECT ?O WHERE { s2 p ?O }

SELECT ?O WHERE { s3 p ?O }

SELECT ?O WHERE { s4 p ?O }

SELECT ?O WHERE { s5 p ?O }

Now (bound joins)

SELECT ?O_1 ?O_2 .. ?O_5 WHERE {

{ s1 p ?O_1 } UNION

{ s2 p ?O_2 } UNION

{ s5 p ?O_5 } }

 Execute in a single remote request

Ralf Schenkel


Evaluation
Evaluation Hassan Issa, Steffen Metzger, Maya Ramanath, Michael Schmidt, Andreas Schwarte, Michael Stoll, Marcin Sydow, Gerhard Weikum

Benchmarks using FedBench: SPARQL Federation

Often large improvements over state-of-the-art systems

Ralf Schenkel


Revisiting the source selection problem
Revisiting the Source Selection Problem Hassan Issa, Steffen Metzger, Maya Ramanath, Michael Schmidt, Andreas Schwarte, Michael Stoll, Marcin Sydow, Gerhard Weikum

SPARQL example (simplified – no prefixes etc.):

SELECT ?a WHERE

{ ?a dc:authorOf ?p.

?p dc:publishedAt SIGMOD2012 .}

Source selection problem:Which of the 325 sources to query?

Many sources contain the same facts

Many duplicate results

Many unnecessary requests

Obvious problem: overlapping sources

Ralf Schenkel


Example for overlapping sources
Example for Overlapping Sources Hassan Issa, Steffen Metzger, Maya Ramanath, Michael Schmidt, Andreas Schwarte, Michael Stoll, Marcin Sydow, Gerhard Weikum

  • 6 results overall

  • 2 sources enough to retrieve all results

  • Source 1 alone is „optimal“ if

    • only one access possible

    • or 5 results are enough

Source 1

Source 2

Source 3

Our contribution: Determine „optimal“set of sources without seeing the results

[[email protected] 2012]

Ralf Schenkel


Problem definition
Problem Definition Hassan Issa, Steffen Metzger, Maya Ramanath, Michael Schmidt, Andreas Schwarte, Michael Stoll, Marcin Sydow, Gerhard Weikum

Given SPARQL query with triple patterns P and possible sources S, compute query plan qpPS (which pattern is executed at which source) such that

  • all results are retrieved with a minimal number of requests to sources (minimal exact plan)

  • as many results as possible are retrieved with |qp|≤max (maximize recall)

  • as little requests as possible are performed to retrieve at least r results (minimal approximate plan)

Ralf Schenkel


Bbq high level solution overview
BBQ: High-Level Solution Overview Hassan Issa, Steffen Metzger, Maya Ramanath, Michael Schmidt, Andreas Schwarte, Michael Stoll, Marcin Sydow, Gerhard Weikum

  • Extend ASK operation to provide concise yet expressive summary of result bindings of each variable (instead of boolean yes/no)

  • Estimate source overlap with summaries

  • Select sources incrementally based on benefit

    Functional properties of summaries for sets:

    • Size of set (number of distinct elements)

    • Size of union of two sets

    • Size of intersection of two sets

    • Summary smaller than the data

    • Data not be reproducible from the summary

      Examples: Bloom Filters, kmv synopsis, …

Significant reduction of query cost compared to standard solutions

Ralf Schenkel


Source selection for single triple pattern
Source Selection for Single Triple Pattern Hassan Issa, Steffen Metzger, Maya Ramanath, Michael Schmidt, Andreas Schwarte, Michael Stoll, Marcin Sydow, Gerhard Weikum

  • Benefit of a source: number of new results it can contribute

  • Incremental selection algorithm:

    • Maintain summary for union of results from sources already selected

    • Estimate source benefit from summary

    • Select source with highest benefit

  • Stop when target (# results or # requests) reached

  • Finally: Evaluate triple pattern at all selected sources; select more sources if too few results

Ralf Schenkel


Example single triple pattern
Example (Single Triple Pattern) Hassan Issa, Steffen Metzger, Maya Ramanath, Michael Schmidt, Andreas Schwarte, Michael Stoll, Marcin Sydow, Gerhard Weikum

6:

0:

5:

2:

2:

5:

5:

3:

5:

3:

0

1

1

1

0

1

0

1

1

1

0

0

0

0

0

0

0

0

0

0

1

1

1

0

0

1

1

1

1

0

0

1

0

0

1

1

1

1

0

0

1

1

0

1

1

1

0

0

0

0

1

1

0

0

1

1

1

1

1

0

Source 1

Source 2

Source 3

1. ASK each source

2. Select source with highest number of results

3. Stop if stopping condition is met (recall or number of results)

4. Compute benefit for each remaining source

Source 2: 2 -

= 1

Source 3: 3 -

= 1

5. Select source with highest benefit

6. Continue with step 3

current result summary

Ralf Schenkel


Star shaped queries
Star-Shaped Queries Hassan Issa, Steffen Metzger, Maya Ramanath, Michael Schmidt, Andreas Schwarte, Michael Stoll, Marcin Sydow, Gerhard Weikum

Multiple triple patterns with a single identical variable

  • Not enough to consider each triple pattern separately

  • Need to focus on the intersection of the result sets

  • Extended incremental algorithm:

    • Init: Pick one source for each triple pattern with most results

    • Benefit of evaluating a triple pattern at a source: number of new results in the intersection

    • Estimated by intersection of per-pattern summaries (union of summaries from each selected source)

?x imdb:gender „female“.?x imdb:bornIn dbpedia:Germany.?x imdb:actedIn imdb:Titanic.

Ralf Schenkel


Complex queries
Complex Queries Hassan Issa, Steffen Metzger, Maya Ramanath, Michael Schmidt, Andreas Schwarte, Michael Stoll, Marcin Sydow, Gerhard Weikum

Queries with >1 variable and >1 triple patterns

  • Summaries not applicable for whole query:

    • no connection of summaries for variables ?m and ?p

    • Do new bindings for ?p join with existing bindings for ?m ?

  • But: separate source selection for each pattern possible

  • Plus: exclude join candidates at execution time reduces effort for nested-loop joins run full query at sources if no cross-joins possible

imdb:Tom_Cruise imdb:actedIn ?m.?m imdb:producedBy ?p.

best: 3 local joins

naive:3x3 joins

improved:6 joins

Ralf Schenkel


Experimental evaluation setup
Experimental Evaluation: Setup Hassan Issa, Steffen Metzger, Maya Ramanath, Michael Schmidt, Andreas Schwarte, Michael Stoll, Marcin Sydow, Gerhard Weikum

  • RDF Dataset from first 100,000 IMDB moviesand their actors and directors

  • Generate overlapping partitions

    • For movies based on genre (28 partitions)

    • For persons based on birthplace and birthdate (22 p.)

  • Queries:

    • 20 single triple patterns

    • 20 star-shaped queries

  • Consider minimal exact plan

  • Bloom filters of different sizes, kmv synopsis

Ralf Schenkel


Triple pattern queries
Triple Pattern Queries Hassan Issa, Steffen Metzger, Maya Ramanath, Michael Schmidt, Andreas Schwarte, Michael Stoll, Marcin Sydow, Gerhard Weikum

Much fewer requests while retrieving (almost) all results

Ralf Schenkel


Extensions of federated processing
Extensions of Federated Processing Hassan Issa, Steffen Metzger, Maya Ramanath, Michael Schmidt, Andreas Schwarte, Michael Stoll, Marcin Sydow, Gerhard Weikum

  • Increase sources‘ level of cooperation:

    • Export extensive selectivity and join statistics(improves federated join order)

    • Interfaces beyond SPARQL (enables more efficient joins)

  • Caching of data at federated level(reduces latency and risk of unavailable sources)

  • Best-effort execution for given cost budget(time, messages, money), considering

    • overlap of sources

    • fraction of results retrieved

    • quality of a source (correctness, trust, recency)

Ralf Schenkel


Outline of the talk1
Outline of the Talk Hassan Issa, Steffen Metzger, Maya Ramanath, Michael Schmidt, Andreas Schwarte, Michael Stoll, Marcin Sydow, Gerhard Weikum

  • Introduction

  • Querying Federations of Knowledge Bases

  • Building and Querying Distributed RDF Stores

  • Information Extraction and SPARQL extensions

  • Cooperative Knowledge Services

Ralf Schenkel


Motivation distributed rdf
Motivation: Distributed RDF Hassan Issa, Steffen Metzger, Maya Ramanath, Michael Schmidt, Andreas Schwarte, Michael Stoll, Marcin Sydow, Gerhard Weikum

Improve storing and querying of RDF in one system bydistributing it over multiplemachines

  • Improve storage capacity(rule of thumb: 50GB per 1 billion triples)

  • Improve query processing performance by

    • Keeping data in memory

    • Exploiting parallelism

  • General approach:

  • Build small fragments of the data

  • Allocate fragments to nodes

  • Rewrite SPARQL queries to consider distributed data

Ralf Schenkel


Partout high level architecture
Partout: High-Level Architecture Hassan Issa, Steffen Metzger, Maya Ramanath, Michael Schmidt, Andreas Schwarte, Michael Stoll, Marcin Sydow, Gerhard Weikum

Ralf Schenkel


Partout workload based fragmentation
Partout: Workload-Based Fragmentation Hassan Issa, Steffen Metzger, Maya Ramanath, Michael Schmidt, Andreas Schwarte, Michael Stoll, Marcin Sydow, Gerhard Weikum

  • Consider “typical” query workload

  • Use triple patterns in queries for fragmentation

SELECT ?s, ?o WHERE { ?s foaf:name ?o. }

  • For two triple patterns P1, P2:

  • Consider all combinations of Pi and their negation:P1  P2, P1  P2, P1  P2, P1  P2

  • Each combination defines a fragment

  • Number of fragments exponential in number of triplepatterns, but usually ok (many must be empty)

Ralf Schenkel


Fragment allocation and querying
Fragment Allocation and Querying Hassan Issa, Steffen Metzger, Maya Ramanath, Michael Schmidt, Andreas Schwarte, Michael Stoll, Marcin Sydow, Gerhard Weikum

  • Allocate fragments to hosts such that

    • Execution of each workload query is cheapby allocating its fragments at same host

    • Hosts receive balanced load and limited number of triples

  • Formulate as Integer Linear Program

Use greedy heuristics for

large optimization problem

Ralf Schenkel


Details for fragment allocation
Details for Fragment Allocation Hassan Issa, Steffen Metzger, Maya Ramanath, Michael Schmidt, Andreas Schwarte, Michael Stoll, Marcin Sydow, Gerhard Weikum

balancedload

localqueries

number of queries where fragments m and m‘ appear together

size of fragment

frequency of fragment

aggregated load of all fragments allocated to host h

Ralf Schenkel


Querying processing
Querying Processing Hassan Issa, Steffen Metzger, Maya Ramanath, Michael Schmidt, Andreas Schwarte, Michael Stoll, Marcin Sydow, Gerhard Weikum

  • Query processing similar to federated case, but

    • Full information about triple location

    • Full information about local statistics

    • More complex operations (semi joins etc.)

  • Two-stage query optimization:

    • Start with RDF-3X query plan using aggregated statistics from complete dataset

    • Transform + optimize for distributed setup

  • Extension of RDF-3X cost model to consider communication costs.

Ralf Schenkel


Example query optimization
Example: Query optimization Hassan Issa, Steffen Metzger, Maya Ramanath, Michael Schmidt, Andreas Schwarte, Michael Stoll, Marcin Sydow, Gerhard Weikum

Initial RDF-3X query plan

Step 1: Source identification

PREFIX db: <http://dbpedia.org/resource/>

PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>

SELECT ?name

WHERE{

?z db:name ?name .

?z rdf:type db:city .

?z db:located db:USA .

}

Ralf Schenkel


Example query optimization1
Example: Query optimization Hassan Issa, Steffen Metzger, Maya Ramanath, Michael Schmidt, Andreas Schwarte, Michael Stoll, Marcin Sydow, Gerhard Weikum

Initial RDF-3X query plan

Step 2: Merge-Union Operators

PREFIX db: <http://dbpedia.org/resource/>

PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>

SELECT ?name

WHERE{

?z db:name ?name .

?z rdf:type db:city .

?z db:located db:USA .

}

Ralf Schenkel


Example query optimization2
Example: Query optimization Hassan Issa, Steffen Metzger, Maya Ramanath, Michael Schmidt, Andreas Schwarte, Michael Stoll, Marcin Sydow, Gerhard Weikum

Initial RDF-3X query plan

Step 3: optimizations and host allocation for inner operators

Ralf Schenkel


Evaluation billion triple challenge 2008
Evaluation: Billion Triple Challenge 2008 Hassan Issa, Steffen Metzger, Maya Ramanath, Michael Schmidt, Andreas Schwarte, Michael Stoll, Marcin Sydow, Gerhard Weikum

  • 500 million triples, 3 hosts

  • opponents:

    • centralized RDF3X

    • property-based distribution

    • graph partitioning (HAR+, VLDB 2011)

Significant advantage for Partout in response time (and throughput)

Ralf Schenkel


Outline of the talk2
Outline of the Talk Hassan Issa, Steffen Metzger, Maya Ramanath, Michael Schmidt, Andreas Schwarte, Michael Stoll, Marcin Sydow, Gerhard Weikum

  • Introduction

  • Querying Federations of Knowledge Bases

  • Building and Querying Distributed RDF Stores

  • Information Extraction and SPARQL extensions

  • Cooperative Knowledge Services

Ralf Schenkel


Question answering with the web
Question Answering with the Web Hassan Issa, Steffen Metzger, Maya Ramanath, Michael Schmidt, Andreas Schwarte, Michael Stoll, Marcin Sydow, Gerhard Weikum

Ralf Schenkel


Limits in entities and facts
Limits in Entities and Facts Hassan Issa, Steffen Metzger, Maya Ramanath, Michael Schmidt, Andreas Schwarte, Michael Stoll, Marcin Sydow, Gerhard Weikum

Ralf Schenkel


List questions on the web
List Questions on the Web Hassan Issa, Steffen Metzger, Maya Ramanath, Michael Schmidt, Andreas Schwarte, Michael Stoll, Marcin Sydow, Gerhard Weikum

Ralf Schenkel


Limits in query complexity
Limits in Query Complexity Hassan Issa, Steffen Metzger, Maya Ramanath, Michael Schmidt, Andreas Schwarte, Michael Stoll, Marcin Sydow, Gerhard Weikum

Ralf Schenkel


Use case goethe tour
Use case „Goethe tour“ Hassan Issa, Steffen Metzger, Maya Ramanath, Michael Schmidt, Andreas Schwarte, Michael Stoll, Marcin Sydow, Gerhard Weikum

  • Problem:Build interesting tour that combinesplaces Goethe visited at least once

  • Combines (historic and encyclopedic)text from libraries and TextGrid,information about historic names,(routable) maps, hotel portals, …

  • Workflow:

    • Search texts about Goethe

    • Extract locations

    • Map to current locations

    • Assess interestingness …

Ralf Schenkel


Searching is difficult
Searching is difficult Hassan Issa, Steffen Metzger, Maya Ramanath, Michael Schmidt, Andreas Schwarte, Michael Stoll, Marcin Sydow, Gerhard Weikum

  • How is Goethe mentioned in the text?„Johann Wolfgang von Goethe“, „Goethe“, „Goete“,„the author of Faust“

  • Difficult to restrict results to Goethe‘s travels

    • Extend query by „travel“, „trip“, „stay“?

    • Could miss important results!

  • Documents need to be read completely to extract important knowledge

    • Places that Goethe visited

    • Additional information on these places, e.g.,

      • Is the place in Germany?

      • Are there any interesting sights there?

Named Entity Recognition

Automated Fact Extraction

Structured Queries

Background Knowledge

Ralf Schenkel


Step 1 named entity recognition
Step 1: Named Entity Recognition Hassan Issa, Steffen Metzger, Maya Ramanath, Michael Schmidt, Andreas Schwarte, Michael Stoll, Marcin Sydow, Gerhard Weikum

  • Goal: Map entity occurrences in texts to

    • predefined categories (persons, locations, …)

    • predefined lists of entities (Goethe, Schiller, …)

  • Input: Background knowledge base (YAGO, …)

    • Entities with their textual representations(Goethe: „Goethe“, „Goete“, „Herr Geheimer Rath“, …)

    • Mapping of entities to categories(Goethe is an author, is a person, …)

    • Relationships to other entities(Goethe was born in Frankfurt, died in Weimar, …)

This talk

Ralf Schenkel


Example named entity recognition
Example: Named Entity Recognition Hassan Issa, Steffen Metzger, Maya Ramanath, Michael Schmidt, Andreas Schwarte, Michael Stoll, Marcin Sydow, Gerhard Weikum

Goethe was born in Frankfurt in August 1749.

Identify the „correct“ Frankfurt basedon context in the document

Goal: coherent map of all entity occurrences

Goethe label „Goethe“

KnowledgeBase

Frankfurt(Main) label „Frankfurt“

Frankfurt(Oder) label „Frankfurt“

Ralf Schenkel


Step 2 from text to facts
Step 2: From text to facts Hassan Issa, Steffen Metzger, Maya Ramanath, Michael Schmidt, Andreas Schwarte, Michael Stoll, Marcin Sydow, Gerhard Weikum

Bei einem Besuch in Dresden…

Goethe verließ Dresden…

…am 25. war Goethe wieder in Dresden…

Goethe besuchte Dresden mehrmals…

Anschließend reiste Goethe nach Dresden

Goethe located_in Dresden

abstract fact:

Approach:

Determine typical text patterns for representation of facts,

based on a few training patterns and already known facts

Pattern p1:

X was born in Y

Goethe born_inFrankfurt(Main)

Goethe was born in Frankfurt

p1 represents born_in

text passages

patterns and their meaning

facts

Ralf Schenkel


Background pattern based extraction
Background: Pattern-Based Extraction Hassan Issa, Steffen Metzger, Maya Ramanath, Michael Schmidt, Andreas Schwarte, Michael Stoll, Marcin Sydow, Gerhard Weikum

Confidence in pattern

Goethe’s birthplace Frankfurt, …

Automatically extract (RDF) facts throughtextual patterns (learned from training facts)

Pattern p:

X`s birthplace Y

p expresses bornIn

KnowledgeBase

Goethe bornIn Frankfurt

Ralf Schenkel


Background pattern based extraction1
Background: Pattern-Based Extraction Hassan Issa, Steffen Metzger, Maya Ramanath, Michael Schmidt, Andreas Schwarte, Michael Stoll, Marcin Sydow, Gerhard Weikum

Confidence in pattern

Johann Wolfgang von Goethe was born in Frankfurt,Germany, as son of Johann Caspar Goethe and Catharina Elisabeth Goethe. He …

Automatically extract (RDF) facts throughtextual patterns (learned from training facts)

Pattern p1:

X was born in Y

Goethe bornIn Frankfurt

KnowledgeBase

p1 expresses bornIn

Ralf Schenkel


Knowledge workbench textgrid
Knowledge Workbench (TextGrid) Hassan Issa, Steffen Metzger, Maya Ramanath, Michael Schmidt, Andreas Schwarte, Michael Stoll, Marcin Sydow, Gerhard Weikum

Check (and correct) fact,including provenance

sourceselection

Overview of facts andentities in this document

Export to ontology(with consistency check)

Confidence-based filter

[CIKM 2012]

Ralf Schenkel


Extensions of the core annotations
Extensions of the core annotations Hassan Issa, Steffen Metzger, Maya Ramanath, Michael Schmidt, Andreas Schwarte, Michael Stoll, Marcin Sydow, Gerhard Weikum

  • Expressiveness of the representation:

    time, purpose, belief, …

    • Store as meta facts (facts about facts, reification)

  • Facts about text passages:

    time & location, style, links to other text passages

  • Several layers of annotation (fictional vs. real): real locations, but fictional characters

Goethe located_in Dresden

fact#1 hasStartDate 1794-Aug-03

fact#1 hasEndDate 1794-Aug-11

fact#1

fact#1 hasPurpose Visit

Did Goethe and Schiller ever meet in Dresden?

Ralf Schenkel


Extensions towards other languages
Extensions towards other languages Hassan Issa, Steffen Metzger, Maya Ramanath, Michael Schmidt, Andreas Schwarte, Michael Stoll, Marcin Sydow, Gerhard Weikum

NE VAFIN APPR NE WPP

pp

subj

  • Current work in progress:

    • Use part-of-speech tagger (ParZu)

    • Derive patterns from parse trees

  • Structure of non-English expressions different:Einstein wurde in Ulm geboren.

Einstein werden in Ulm gebären

X

Y

Ralf Schenkel


Semantic search for facts
Semantic Search for Facts Hassan Issa, Steffen Metzger, Maya Ramanath, Michael Schmidt, Andreas Schwarte, Michael Stoll, Marcin Sydow, Gerhard Weikum

Find locations where Goethe was

SPARQL pattern:

„Goethe located_in ?x“

Result list:

„Goethe located_in Frankfurt“

„Goethe located_in Dresden“

„Goethe located_in Straßburg“

Easy to add (some) constraints:

Only locations in Germany

Locations where also Schiller was

Locations with a hotel (from our list)

Locations with at least 1000 tourists per year (from OpenGov)

Ralf Schenkel


This is a more likely result list
This is a more likely result list… Hassan Issa, Steffen Metzger, Maya Ramanath, Michael Schmidt, Andreas Schwarte, Michael Stoll, Marcin Sydow, Gerhard Weikum

SPARQL pattern:

„Goethe located_in ?x“

5 out of 5175 results:

„Goethe located_in Sachsenhausen“

„Goethe located_in Offenbach“

„Goethe located_in Kleinmachnow“

„Goethe located_in Holzhausen “

„Goethe located_in Straßburg“

Solution: ranking of results (just as in document search)

based on number & importance of source documents

Ralf Schenkel


Why ranking is essential
Why ranking is essential Hassan Issa, Steffen Metzger, Maya Ramanath, Michael Schmidt, Andreas Schwarte, Michael Stoll, Marcin Sydow, Gerhard Weikum

Queries often have a huge number of results:

scientists from Germany

conferences in Berlin

publications in databases

actors from the U.S.

Ranking as integral part of search

Huge number of app-specific ranking methods:paper/citation count, impact, salary, …

Need for generic ranking

Ralf Schenkel


Lms from documents to facts
LMs: From Documents to Facts Hassan Issa, Steffen Metzger, Maya Ramanath, Michael Schmidt, Andreas Schwarte, Michael Stoll, Marcin Sydow, Gerhard Weikum

[CIKM 2009]

Document LM‘s

LM for document: prob. distr. of words

LM for query: (prob. distr. of) words

LM‘s: rich for documents, super-sparse for queries

Triple LM‘s

LM for facts: (degen. prob. distr. of) triple

LM for queries: (degen. prob. distr. of) triple pattern

LM‘s: apples and oranges

  • expand query variables by S,P,O values from ontology

  • enhance with witness statistics

  • query LM then is prob. distr. of triples !

Ralf Schenkel


Lms for triples and triple patterns
LMs for Triples and Triple Patterns Hassan Issa, Steffen Metzger, Maya Ramanath, Michael Schmidt, Andreas Schwarte, Michael Stoll, Marcin Sydow, Gerhard Weikum

triple patterns (queries q):

triples (facts f):

q: Beckham p ?y

200

300

20

30

300

150

20

200

350

10

400

200

150

100

150

20

f1: Beckham p ManchesterU

f2: Beckham p RealMadrid

f3: Beckham p LAGalaxy

f4: Beckham p ACMilan

F5: Kaka p ACMilan

F6: Kaka p RealMadrid

f7: Zidane p ASCannes

f8: Cruyff p FCBarca

f9: Zidane p RealMadrid

f10: Tidjani p ASCannes

f11: Messi p FCBarcelona

f12: Henry p Arsenal

f13: Henry p FCBarcelona

f14: Ribery p BayernMunich

f15: Drogba p Chelsea

f16: Cruyff c FCBarca

q: Beckham p Real

q: Beckham p ManU

q: Beckham p Milan

q: Beckham p Galaxy

300/550

200/550

20/550

30/550

q: ?x p ASCannes

Zidane p ASCannes 20/30

Tidjani p ASCannes 10/30

q: ?x p ?y

Messi p FCBarcelona 400/2580

Zidane p RealMadrid 350/2580

Kaka p ACMilan 300/2580

q: Cruyff ?r FCBarca

LM(q): {t  P [t | t matches q] ~ #witnesses(t)}

LM(answer f): {t  P [t | t matches f] ~ 1 for f}

smooth & rank results by ascending KL(LM(q)|LM(f))

Cruyff playedFor FCBarca 200/220

Cruyff coached FCBarca 20/220

: 2600

witness statistics

62

62

Ralf Schenkel


Lms for composite queries
LMs for Composite Queries Hassan Issa, Steffen Metzger, Maya Ramanath, Michael Schmidt, Andreas Schwarte, Michael Stoll, Marcin Sydow, Gerhard Weikum

q: Select ?x,?c Where {?x bornIn France . ?x playsFor ?c . ?c in UK . }

queries q with patterns q1 … qn

P [ Henry bI F,

Henry p Arsenal,

Arsenal in UK ]

P [ Drogba bI F,

Drogba p Chelsea,

Chelsea in UK ]

results are n-tuples of triples t1 … tn

LM(q): P[q1…qn] = i P[qi]

LM(answer): P[t1…tn] = i P[ti]

KL(LM(q)|LM(answer)) = i KL(LM(qi)|LM(ti))

f1: Beckham p ManU 200

f7: Zidane p ASCannes 20

f8: Zidane p Juventus 200

f9: Zidane p RealMadrid 300

f10: Tidjani p ASCannes 10

f12: Henry p Arsenal 200

f13: Henry p FCBarca 150

f14: Ribery p Bayern 100

f15: Drogba p Chelsea 150

f21: Zidane bI F 200

f22: Tidjani bI F 20

f23: Henry bI F 200

f24: Ribery bI F 200

f25: Drogba bI F 30

f26: Drogba bI IC 100

F27: Zidane bI ALG 50

f31: ManU in UK 200

f32: Arsenal in UK 160

f33: Chelsea in UK 140

Semantic Knowledge Bases from Web Sources

Harvesting Knowledge from Web Data

Ralf Schenkel


But not everything can be extracted
But: Not everything can be extracted Hassan Issa, Steffen Metzger, Maya Ramanath, Michael Schmidt, Andreas Schwarte, Michael Stoll, Marcin Sydow, Gerhard Weikum

SELECT ?a ?m WHERE

{ ?a actedIn ?m [western duel sunset]. }

evaluated on source documents of triples

evaluated on RDF

  • Training facts are naturally limited

  • Predicates in a knowledge base are limited

    How can we query then?

Solution: SPARQL FullTextCombine structured SPARQL with keyword search

  • Assign meta document to each fact (content of all documents from which it was extracted)

  • Rank matches of triple pattern by IR score of keywords on meta documents

[CIKM 2009]

Ralf Schenkel


Facts do not come with background
Facts do not Come with Background Hassan Issa, Steffen Metzger, Maya Ramanath, Michael Schmidt, Andreas Schwarte, Michael Stoll, Marcin Sydow, Gerhard Weikum

  • Provenance / Explanation

    Why should that fact be true? Who stated that?

  • Heterogeneous interpretation

    Is an in-transit stay an instance of located_in?

    Can comic figures be subjects of located_in?

  • Level of detail modeled in the ontology:

    How did he get there? Did he like the trip?

    Were there any special incidents during the trip?

  • Information not easy to extract/represent as RDF

    Time, sequences, reasons, opinions, …

Only source documents contain all important information

Ralf Schenkel


Back to the roots then
Back to the Roots, then… Hassan Issa, Steffen Metzger, Maya Ramanath, Michael Schmidt, Andreas Schwarte, Michael Stoll, Marcin Sydow, Gerhard Weikum

Query: „Goethe located_in ?x“

Ranked List ofSource documents

3 out of 5175 results:

Doc A

50 Documents

„Goethe located_in Frankfurt“

„Goethe located_in Dresden“

46 Documents

„Goethe located_in Dresden“

Doc B

42 Documents

„Goethe located_in Straßburg“

Doc C

How should we rank?

Ralf Schenkel


Criteria for ranking
Criteria for ranking Hassan Issa, Steffen Metzger, Maya Ramanath, Michael Schmidt, Andreas Schwarte, Michael Stoll, Marcin Sydow, Gerhard Weikum

  • Coverage– how many of the facts are covered in the document?

  • Persuasiveness – how convincing are the sources

    • Authority of the document

    • Confidence in the extraction pattern

  • On-Topicness – is the focus of the document on the topic expressed by the facts?

[VLDB 2010, CIKM 2011]

Ralf Schenkel


S3k searching statement witnesses
S3K: Searching Statement Witnesses Hassan Issa, Steffen Metzger, Maya Ramanath, Michael Schmidt, Andreas Schwarte, Michael Stoll, Marcin Sydow, Gerhard Weikum

HeathLedger actedIn ?Movie

1) HeathLedger actedIn TheDarkKnight

2) HeathLedger actedIn ThePatriot

Ralf Schenkel


Additional use supporting statements
Additional Use: Supporting Statements Hassan Issa, Steffen Metzger, Maya Ramanath, Michael Schmidt, Andreas Schwarte, Michael Stoll, Marcin Sydow, Gerhard Weikum

[CIKM 2011]

Find „proof“ in corpus for factual statement

  • „Obama was born in Kenya“

  • Use extractor to convert to fact:Obama born_in Kenya

  • Use patterns to generate texual representations:

    Obama‘s birthplace Kenya, Barack‘s Kenyian origins, …

  • Search phrases, rank with S3K

Refuting statements is a lot more difficult

Ralf Schenkel


Semantic similarity of documents
Semantic Similarity of Documents Hassan Issa, Steffen Metzger, Maya Ramanath, Michael Schmidt, Andreas Schwarte, Michael Stoll, Marcin Sydow, Gerhard Weikum

  • Finding similar texts important (plagiarism, related news, but also similar structure, style, …)

  • Content-based methods limited,clever plagiarists rephrase

  • Use fact-based similarity

    • Extract facts from both documents

    • Check fact overlap and order

    • Maximal matching in bipartite graph

  • Extensions towards similar facts / fact groups,classification of text types, …

Initial results: better than content-based similarity (with ideal facts)

Ralf Schenkel


Outline of the talk3
Outline of the Talk Hassan Issa, Steffen Metzger, Maya Ramanath, Michael Schmidt, Andreas Schwarte, Michael Stoll, Marcin Sydow, Gerhard Weikum

  • Introduction

  • Querying Federations of Knowledge Bases

  • Building and Querying Distributed RDF Stores

  • Information Extraction and SPARQL extensions

  • Cooperative Knowledge Services

Ralf Schenkel


Most of today s ontologies are black boxes
Most of Today‘s Ontologies are Black Boxes Hassan Issa, Steffen Metzger, Maya Ramanath, Michael Schmidt, Andreas Schwarte, Michael Stoll, Marcin Sydow, Gerhard Weikum

Fact Candidates

Reasoner

Schema

Type System

Base Facts

Rules

InformationExtraction

NER

Sourcedocuments

WSD

NLP Tools

SPARQL

Often highly customized and tunedImpossible to replace components

Impossible to replace sourcesNo provenance, quality, freshness info

results

Ralf Schenkel


Collaborative knowledge services
Collaborative Knowledge Services Hassan Issa, Steffen Metzger, Maya Ramanath, Michael Schmidt, Andreas Schwarte, Michael Stoll, Marcin Sydow, Gerhard Weikum

  • From monolithic knowledge bases to collaboration of specialized services

  • Different classes of services:

    • Storage services (RDF stores, relational wrappers, Web services)

    • Creation services (Information extraction, crowd sourcing)

    • Reasoning services

    • Query and aggregation services

[[email protected]]

Ralf Schenkel


On the fly combination of services
On-the-fly Combination of Services Hassan Issa, Steffen Metzger, Maya Ramanath, Michael Schmidt, Andreas Schwarte, Michael Stoll, Marcin Sydow, Gerhard Weikum

q={?movie rdf:type imdb:horror. HL_Cinema hl:shows ?movie}

Query

q2

q1

rule:?show hasMovie ?movie ?show happensAt ?cinema

 ?cinema hl:shows ?movie

Reasoner

Aggregation

IMDB

MovieDB

Aggregation

www.hl-cinema.de

Extractor

MyCines

Ralf Schenkel


Core challenges of this framework
Core Challenges of this Framework Hassan Issa, Steffen Metzger, Maya Ramanath, Michael Schmidt, Andreas Schwarte, Michael Stoll, Marcin Sydow, Gerhard Weikum

  • Per-query on-the-fly combination of services

  • Description of service properties (functional and domain)

  • Service quality and cost

  • Uncertain and contradicting facts

  • Trust and authority of services

  • Provenance of results

  • Feedback and exchange of information

  • Query routing, optimization, processing

Ralf Schenkel


Summary
Summary Hassan Issa, Steffen Metzger, Maya Ramanath, Michael Schmidt, Andreas Schwarte, Michael Stoll, Marcin Sydow, Gerhard Weikum

  • Fast growing volume of semantic data current big challenge for data management

  • Distributed solutions required to cope with size and query performance requirements

  • Federated solutions required for querying Linked Open Data

  • Interfaces beyond pure SPARQL required

  • Next Big Thing: Collaborative Knowledge Services

Ralf Schenkel


ad