Interoperation among data sources on the Web

Interoperation among data sources on the Web Andreas Harth

Research Question • How to link data sources on the Web? • How to use these links to interoperate among data sources?

Example • @@@ here the picture with the data sources (DBLP, FOAF, Citeseer)

Significant Problems in the Field • Knowledge Representation • How to incorporate negation into a query and rule language? • How complex are the algorithms for query evaluation? • Information Integration • How to come up with the global schema? • How to incorporate, relate, and combine partial schema and instance data from multiple sources? • How to deal with trust issues? I.e. tracking provenance of data, providing justifications (“why?”) for query answers. • Network/Query Routing • How to coordinate autonomous data sources that are dispersed across the Web? • How to handle/maintain a complex system on Web-scale?

State of the Art: Knowledge Representation • KR language: relational model, OEM, XML, RDF • Query language: conjunctive queries, Datalog, SQL, XQuery, RDQL, DLP, DL, FOL • Different complexity classes: ??? (e.g. DLP view definitions with conjunctive queries is decidable [Abiteboul and Duschka])

State of the Art: Information Integration • Mediated/virtual integration systems: Infomaster [Genesereth], ? [Knoblock], TSIMMIS [?], Garlic [?], … • Data sources are usually wrapped (using e.g. Java) to a common syntax • Information integration systems use a single global schema, which is either manually constructed by domain experts, or the union of all local schemas • Systems typically employ only one mediator and multiple data sources (client – server, tree-like model) • The source schemas are related to the global schema using rules: global-as-view (GAV), local-as-view (LAV), or global-local-as-view (GLAV) notation • Evaluation algorithm for GAV: simple rule-unfolding [Ulmann], for LAV: bucket algorithm, MinCon [Halevy], and for GLAV: ??? [Halevy]? • Tracking provenance to provide justifications is discussed in [McGuiness and Pinhero de Silviva] • Object consolidation (fusing of object identifiers) is done e.g. using decision trees (martin)

State of the Art: Network/Query Routing • Structured networks: distributed hashtables ([Stoica]), range-based partition of key space ([Garcia-Molina]) • P2P: mostly keyword search, queries are flooded (Gnutella), with optimizations such as finger tables, skip lists for query forwarding • Relational model: coordination rules [Bernstein], co-DB [Fausto] • Source Indexes: keep track of which relations are stored in which database [Stuckenschmidt] • Ontologies: C-OWL [Franconi, Fausto Trento]

Approach • Our scenario involves multiple data sources (query processors: mediators, databases, reasoners, …) that (intend to) exchange related data • The syntax for facts, rules, and queries is RDF/Notation3 with context • The system allows rules in the DLP fragment that can be used to axiomatize RDFS/OWL DLP semantics

Ideas • Queries and rules can contain subgoals pertaining to local and/or remote contexts • Rules spanning multiple contexts act as coordination links between contexts/data sources • The thesis provides definitions of model-theoretic, operational (forward-chaining) and proof-theoretic (backward-chaining) meaning of rules • The framework uses an extension of database query processing algorithms for the distributed case, possibly utilizing optimized approaches (SIP strategies/magic set/QSQ) • The algorithms keep track of proofs/processing steps for debugging and providing justifications to the user

Example • @@@ rules for example here

Out of Scope • Data warehouse approach • LAV/GLAV (contrast LAV/GAV processing algorithms – how does LAV relate to reasoning???) • DL, function symbols, FOL, sublanguage definitions • Active databases (Triggers, ECA rules…) • Automatic construction/detection of coordination links • Wrapper construction for legacy data sources • Object consolidation

Preliminary Results • Defined and implemented a complete index on RDF with context • Implemented query processing for conjunctive N3 queries • Conducted performance test and comparisons with existing systems • Formalized the notion of local/remote context • Formalized model theoretic, proof theoretic, and operational semantics for the distributed case • Implemented initial rules processing algorithms (naïve forward chaining) • Formalized the notion of local/remote context • Designed network API • Extended query processing in prototype to handle remote queries

Sketch of Research Methodology • Go from abstract to concrete and back to the abstract again – oscillating process • Implement prototype • Carry out experiments and take measurements • Literature review: compare with other approaches • Start: literature survey – coding – writing –coding – goto Start

Contribution to Problem Solution • Defined the notion of links to relate data in decentralized, autonomous data sources on the Web • Adapted efficient algorithms for RDBMS query evaluation to RDF and our distributed case • Links specify which part of the data set should be imported (for forward-chaining) or “visible” during query processing (for backward-chaining) • Query component uses coordination links for query processing and optimization • Links can be queried, exchanged, collected, analyzed, and mined • From only local interactions a global pattern emerges (-> local-global/scoped ontology), self-organization

What is Better than Existing Approaches • Rules engines are monolithic, local systems • Data integration approaches are mostly XML-based (but having URIs for constants in universe helps  RDF) • Our query/rule language is more expressive than relational algebra in RDBMS because of recursion and network • The systems provides answers to queries rather than keyword searches (P2P) • The peer network model is self-organizing rather than highly structured ( more flexible) • There is no single global schema; but multiple, possibly overlapping schemas: each for every data sources, depending on the links • Better than C-OWL: we allow rules • Better than co-DB: we use RDF as data interchange format (object oriented, instances and classes model)

Why is This Important for Mankind? • Enable to relate, exchange, and interlink data sources rather than just documents or data, which results in a more powerful use of the Web • Possible to automate previously labor-intensive tasks for knowledge workers in the knowledge-based economy

Conclusion • I presented a method to interlink data sources on the Web • The framework allows to operate with only local information • I adapt database query processing algorithms to the distributed case • A partially implemented prototype (ongoing work) is available

Interoperation among data sources on the Web

Interoperation among data sources on the Web

Presentation Transcript

Data Sources

Data Sources

Not getting caught in the web: Credible sources on the web

Data Sources

Data on the (Semantic) Web

Data Sources

Not Getting Caught in the Web: Credible Sources on the Web

Interoperation of Information Sources via Articulation of Ontologies

Data Sources

Information Sources on the Web

Data Sources

7 Top-k Queries on Web Sources and Structured Data

Composing Mappings among Data Sources

Raster Data Sources on the Web

Grid Interoperation on Data Movement between NAREGI and EGEE gLite

Differential Analysis on Deep Web Data Sources

Data centres (on the Web)

Integrating data sources on the World-Wide Web

Genome Data and Tool Interoperation over the “Semantic” Web

Interoperation between Data Sources on the Web

PI Data on the Web