Interoperation between Data Sources on the Web

Interoperation between Data Sources on the Web Andreas Harth, DERI Galway Joint work with Stefan Decker, Axel Polleres, Hannes Gassert, Aidan Hogan, Matteo Magni, Cristina Feier andreas.harth@deri.org Modena, December 2005

Introduction "If HTML and the Web made all the online documents look like one huge book, RDF, schema, and inference languages will make all the data in the world look like one huge database." Tim Berners-Lee, Weaving the Web, 1999

Data and the Web • The current Web is used as the medium to exchange documents • The Semantic Web aims to enable people and machines to exchange data

Research Question • How to link data sources on the Web to arrive at a distributed, self-organized system comprising a large number of autonomous, interoperable data sources? • How to utilize these links to efficiently interoperate (share, exchange, and integrate data) between the related data sources?

Scenario – DB/Web Research Community Photos of Authors Publications of People Abstracts and Citation Information for Publications

Approach • Basis is efficient RDF indexing and query answering infrastructure (RDF database  YARS) • Adding inference mechanisms known from Logic Programming and Deductive Databases (recursive datalog¬) • Adapting and extending these technologies and methods to the distributed Web environment • Of concern: scalability, scalability, scalability

Scope – What I not try to do • No syntax conversion - assume RDF as data model, legacy sources can be wrapped • Only recursive datalog with (scoped) negation – “safe” subset with nice complexity and known efficient algorithms (RDBMS) • No equality reasoning – unique name assumption (UNA) • No typical P2P file-sharing scenario where peers join and leave randomly - assume relatively stable, self-organizing network where people link resources, similar to the HTML Web

Outline • Introduction • Background • RDF/Notation3 • Context • Semantics for N3 • Query Processing • Implementation • Ontology Languages • Conclusion

Revised Semantic Web Layer Cake This talk http://www.w3.org/2005/Talks/0511-keynote-tbl/#[17]

Resource Description Framework (RDF) Basics • Let U be the set of all URIs • Let B be the set of all blank nodes • Let L be the set of all literals • The RDF data model consists of triples (subject, predicate, object), which are members of the set (UυBυL) x (U) x (Uυ BυL) • RDF Example: @prefix: ex http://example.org/ .ex:pat ex:knows ex:jo . • LP-style notation (assuming namespace ex): triple(ex:pat, ex:knows, ex:jo) .

Blank Nodes • Blank nodes are unnamed • Example: there exists something with the first name “Max”. • RDF: _:a foaf:firstName “Max” . • Blank nodes are treated as existentially quantified variables:  X triple(X,foaf:firstName,“Max”) . • A skolem function is used to replace all existentially quantified variables with a unique constant

RDF Vocabulary Description: RDFS • RDFS extends the RDF vocabulary with modelling object-oriented features • Classes:rdfs:Resource, rdfs:Class, rdfs:Literal, rdfs:Datatype, rdfs:Container, rdfs:ContainerMembershipProperty • Properties:rdfs:domain, rdfs:range, rdfs:subPropertyOf, rdfs:subClassOf, rdfs:member, rdfs:seeAlso, rdfs:isDefinedBy, rdfs:comment, rdfs:label This slide adapted from Jos de Brujin

Sample Vocabulary: FOAF • “The Friend of a Friend (FOAF) project is about creating a Web of machine-readable homepages describing people, the links between them and the things they create and do.”1 • Example: @prefix foaf: <http://xmlns.com/foaf/0.1/> . @prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> . @prefix : <http://www.harth.org/andreas/#> . :me rdf:type foaf:Person . :me foaf:name “Andreas Harth” . :me foaf:homepage <http://sw.deri.org/~aharth/> . :me foaf:workplaceHomepage <http://www.deri.org/> . :me foaf:knows <http://decker.cn/stefan/> . 1 http://www.foaf-project.org/

Notation3 • N3 is an extension to RDF which allows variables and grouping of graphs • Variables (set of variables: V) are denoted using a question mark ? • A triple pattern is a member of the set: (UυBυLυV) x (UυV) x (Uυ BυLυV) • Subgraphs (sets of statements) can become the subject or object of another statement using braces {}

N3 Query Language • N3QL introduces ql:select and ql:where predicates to express queries • Query consists of ql:where clause with triple patterns • Optional ql:select clause containing triple patterns determines the format of the result set • N3QL has closure: ability to compose multiple queries • Closure is achieved by imposing a syntactical restriction: query results are always triples

N3QL Example • Get all triples where subject is :me: (<> denotes current document): @prefix : <http://www.harth.org/andreas/#> . <> ql:select { :me ?p ?o . }; ql:where { :me ?p ?o . } . • LP-style triple(:me, P, O) :- triple(:me, P, O).

N3 Rules • The log:implies predicate can be used to encode implications () in N3 syntax • An N3 implication is a formula in the form {B1…Bn} log:implies { H} . where B1…Bn and H are triple patterns • An implication is safe if all variables occurring in the head also occur in the body • An N3 program is a finite set of N3 facts and implications

N3 Implication Example • Example: for every triple with a foaf:name predicate infer a triple with rdfs:label predicate with the same subject and object. • N3: { ?s foaf:name ?o . } log:implies { ?s rdfs:label ?o . } . • LP: triple(S, rdfs:label, O) :- triple(S, foaf:name, O) .

Why Context? • Web is distributed; there is not only one knowledge base, but many • Need to track provenance of statements in a data integration scenario • Context are the rules and facts in a finite set of files on the Web

Defining Context… • A context c ε C is a member of the set UυB • A context denotes a set of RDF statements and N3 implications • Each rule or fact belongs to at least one context • Local context: relative to the base URI of the database (e.g. /context1) • Remote context: absolute URI (http://…) • Basic operations to a context: • Tell • Ask • Retract

Example RDF with Context

Scoped Triple Pattern • yars:context predicate denotes that the subgraph quoted as the subject is occurring in the context provided as the object • Informal meaning of scoped triple patterns is that a triple pattern referenced by an external context represents a link to another program accessible on the Web • E.g. triple pattern for all triples in given context: { ?s ?p ?o . } yars:context ex:foaf. • LP: triple(S,P,O)@ex:foaf . • If no context is specified, we assume to apply all currently known contexts to the triple pattern

Semantics of Rule with Context • A variable assignment (or valuation) is a function val from v V to Ui=1n (UciυBciυLci) • Let q be a rule in the form {B1…Bn} log:implies { H } ., and C1…Cn the contexts associated with B1…Bn • The image of C1…Cn under q is q(C1…Cn) = { val(B) | val is a valuation over var(q) and each triple in val(Bi) ε Ci for each i ε [1,n]}.

Intensional vs. Extensional Database • A query is basically a rule (select = head, where = body) • From now on, we only consider rules • Rules are itself triples (useful for e.g. querying rules) • Stored triples belong to edb • Inferred triples (those in the head of a rule) belong to idb • With only one rule, idb is query answer (and we look at only one rule for a while…)

Monotonicity • Assume rule q, context I  J • q(I)  q(J) • Monotonicity is a desirable feature: knowledge base can only grow; when adding new facts no known facts have to be retracted • Negation adds non-monotoncity (tricky)

Recap: Negation as Failure • Traditional negation as failure (naf): assume a statement to be false if the statement is not in the database (closed-world assumption) • Not suitable for the “open” Web with notoriously incomplete information, where we never can assume our database is complete

Scoped Negation • Proposal: naf wrt to a context • Desired feature: monotonicity wrt to addition of contexts • Assume rule q, contexts c1…cn • Context cn+1 added: c1…cn c1…cn+1 • q over c1 υc2 υ …cn q over c1υ c2υ…cn+1

Queries with Scoped Negation • A triple pattern that is negatively referenced in a rule has to be scoped (vs. open triple pattern) • I.e. give me all movies which are not rated as crap by imdb: • LP: triple(X, rdf:type, ex:Movie) :- triple(X, rdf:type, ex:Movie), not triple(X, ex:rating, ex:crap)@imdb:rate . • N3: { ?x rdf:type ex:Movie . {?x ex:rating ex:crap .} :notInContext imdb:rate. } log:implies { ?x rdf:type ex:Movie . } .

Outline • Introduction • Background • Query Processing • Relational Operators • Indexes • Distribution of Indexes • Query Plans • Implementation • Ontology Languages • Conclusion

Relational Operators • Conjunctive queries (where clause): select, project, join (SPJ) • Remote select for remote contexts • For result templates (select clause): construction operators • Union • For negation: set difference • Bells and whistles: sorting, aggregation, datatypes not considered here

Indexes • We introduce indexes to allow for fast lookup operations (selection) • Two sets of indexes: Lexicon and Quad Index • Lexicon: stores mappings from literal values and resources to object IDs and vice versa • oidnode • nodeoid • Quad Index: stores quads (s, p, o, c)

Object Identifiers • OIDs help to save space (only need to store 64 bit OID instead of entire resource/literal all the time)

Quad Access Patterns • We want to be able to retrieve all combinations of (s, p, o, c) with a single index lookup • In total, 2*2*2*2 = 16 combinations

Recap: B+-Trees • Underlying storage technique based on B+-trees • One property of B+-trees: range lookups/prefix lookups • (key, value) pairs with fast retrieval on given (partial) key

Complete Index on Quads • Given prefix lookup capabilities, only 6 indexes are needed to cover all access patterns

Index Creation Performance (Lehigh Benchmark, ~3 Mio triples)

Evaluating Triple Patterns • Triple pattern (?s, foaf:name, “Stefan Decker”) • Translate string values to OIDs (2, 11) • Determine index (POCS) • Construct key (2:11:*:*) • Perform prefix lookup on POCS with key 2:11:*:* • Translate result back to string values

Query Performance Over 3 minutes, really Queries: 1: ?x rdf:type univ:UndergradStudent 2: ?x ?p "UndergraduateStudent0" 3: <http://www.Univ965.edu> ?p ?o 4: ?x univ:worksFor ?y

Centralized System No network communication: faster response No synchronization required No fault tolerance: if the server goes down the system is unusable System not scalable Distributed System Network communication: slower response due to message exchange on the network, delays… Synchronization required: depending on the system, as the number of nodes increases it might be required to move data from one node to another one Fault tolerance: if a node goes down the system can adjust itself and still work (replica) Scalability: as the amount of data increases so does the number of servers Centralized vs. distributed approach

What we have access interface query YARS data • One machine with one YARS instance running, storing all the data and answering to users' requests. user

What we want (shared-nothing architecture) access interface YARS access interface YARS network query user data access interface YARS • Reliable and easy to scale • Ability to dynamically add machines to the system as the amount of data increases or system becomes slow • Several machines, each running a database instance, to (uniformly) distribute data

Index Distribution • Three possibilities to distribute database over a computing cluster; 2) and 3) are employing a shared-nothing architecture • Virtualization software • Distributed hashtables • Range-based distribution

Recap: Hashing • A hash function H is a transformation that takes a variable-size input m and returns a fixed-size string, which is called the hash value h (h=H(m)). • As an example, we may consider the strings “Galway” and “Dublin”, and a generic hash function could return the values • H(“Galway”)=6 • H(“Dublin”)=2

How to determine location of key • In order to have fast responses when querying the system for data we have to know where data is • DHTs let us select the machine to store/fetch a datum on the basis of the datum itself (we get the datum's id hashing it). • location = hash(key) • p2p network based on Distributed Hash Tables (Chord)

Load balancing Load balancing problems might occur if the nodes are not uniformly distributed in the identifier space. This is a common issue of DHTs, because they don't evenly partition the address space into which keys get mapped.

Virtual nodes A solution for this may be the use of “virtual nodes”: every physical node gets mapped into several locations on the address space, which is partitioned into small intervals assigned to different nodes. The overall load of a physical node is given by summing up the loads of all its virtual nodes.

Query Evaluation Procedure • Parse N3 into set of quads • Transform to implication () • Transform to relational algebra • Apply optimizations (mainly: join reordering, System R-style dynamic programming) • Transform to physical access plan based on iterators (with init(), next(), and close() methods)

Rule with Context Example Photos of Authors @prefix eo: <http://example.org/> . @prefix en: <http://example.net/> . { ?x foaf:name ?n . {?y foaf:name ?n . ?y foaf:depiction ?p .} yars:context eo:foaf } log:implies { ?x foaf:depiction ?p . } . Rule stored at en:dblp

Parser Representation

Implication Tree

Interoperation between Data Sources on the Web

Interoperation between Data Sources on the Web

Presentation Transcript

On the road to clarity: Differences between sample sources

Not getting caught in the web: Credible sources on the web

Distinguish between Direct and Indirect Data Sources

Data Sources

Data on the (Semantic) Web

Data Sources

Not Getting Caught in the Web: Credible Sources on the Web

Interoperation among data sources on the Web

Interoperation of Information Sources via Articulation of Ontologies

Data Sources

Information Sources on the Web

Data Sources

7 Top-k Queries on Web Sources and Structured Data

Raster Data Sources on the Web

Grid Interoperation on Data Movement between NAREGI and EGEE gLite

Differential Analysis on Deep Web Data Sources

Data centres (on the Web)

Integrating data sources on the World-Wide Web

Genome Data and Tool Interoperation over the “Semantic” Web

PI Data on the Web