Integration
Download
1 / 31

web data integration - PowerPoint PPT Presentation


  • 289 Views
  • Updated On :

Integration . what it takes to put data together. Ir. Richard Vdovjak, MTD. [email protected] The Grand Challenge. The Integration System should: Provide flexible, homogeneous and transparent access to several (possibly) distributed, autonomous, and heterogeneous sources.

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'web data integration' - lotus


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
Integration

Integration

what it takes to put data together

Ir. Richard Vdovjak, MTD.

[email protected]


The grand challenge
The Grand Challenge

The Integration System should:

  • Provide flexible, homogeneous and transparent access to several (possibly) distributed, autonomous, and heterogeneous sources.

  • Answer queries expressed in terms of the global schema by means of the source’s data.

  • Preserve the source’s data semantics.

  • Automate the integration process as much as possible.


Why is it so difficult those sources
Why is it so difficult? Those Sources!

  • Number of sources / size of the problem

    • the Web is large! New optimization techniques are needed

      (Pipelining, Distributed processing)

  • Location of the sources / source discovery

    • does a source that supposedly fulfills my info needs exist?

    • where is it located? (The Semantic Web helps – “google” for SW agents” )

  • Heterogeneity of the sources

    • syntactic heterogeneity (HTML, XML, RDF, RDBS, ORDBS, etc..)

    • semantic heterogeneity (class Painter =? Creator)

    • designation heterogeneity (“John Smith” =? http://foo-ns/Smith)


Those sources 2
Those Sources! 2

  • Autonomy of the sources

    • sources may change their data, schemas especially on the Internet and they will not notify the integration system about these changes.

  • Volatility of the sources

    • sources may come and go, the system should be flexible to recover from these changes with minimal effort (use flexible ways to map sources to the system, use of adaptive query answering techniques)

  • Different source capabilities

    • some source are more intelligent than others, e.g. they allow to perform e.g. join like queries on them)

    • some sources are more restrictive than others, e.g. they require certain input (filling out some forms) before they provide results


Related fields
Related Fields

  • Distributed Databases

    • Sources are homogeneous

    • Sources are not (so) autonomous

    • Similarities at the optimization and execution level

  • Information Retrieval

    • Keywords only, no semantics

    • IR techniques help for addressing the designation problem

  • Data mining

    • discovering “hidden” properties of data


Approaches to data integration
Approaches to Data Integration

  • Materialized views

    • global (materialized) schema

    • all data pre-fetched into a central repository,

      • should be automated

      • similar to querying in VV

    • “easy” querying

    • after the data is “in house” the system is independent from the sources

    • difficult to keep up-to-date

    • intranet applications (e.g. enterprise data integration)

  • Virtual views

    • global (mediated) schema

    • data retrieved on demand (no central repository)

    • a query against the global schema is translated into sub-queries and shipped to the sources

    • always up-to-date

    • most WWW applications e.g. shopbots, different internet portals


Approaches to data integration1
Approaches to Data Integration

  • Materialized views

  • Virtual views


Some db theory

Schema

Conjunctive queries

For example:

Give me students together with their advisors who took courses given by their advisors after the 1st trimester 2003.

Some DB Theory


Db theory

SQL

SELECT Advises.prof, Advises.student

FROM Registered, Teaches, Advises

WHERE Registered.c-number = Teaches.c-number and Registered.trimester=Teaches.trimester and Advises.prof=Teaches.prof and Advises.student=Registered.student and Registered.trimester > “2003\1”

DATALOG

Q(prof, student) :-

Registered(student,course, trimester), Teaches(prof,course, trimester) , Advises(prof, student), trimester > “2003\1”

DB Theory

head

body


Db theory query containment

A query Q’ is contained in Q if for all databases D, the set of tuples computed for Q’ is a subset of those computed for Q,

i.e., Q’ Q iff  D Q’(D)  Q(D)

A query Q’ is equivalent to Q iff

Q’ Q and Q Q’

Example:

Q’(prof, student) :-

Registered(student,course, trimester), Teaches(prof,course, trimester) , Advises(prof, student), trimester > “2003\2”

is contained in

Q(prof, student) :-

Registered(student,course, trimester), Teaches(prof,course, trimester) , Advises(prof, student), trimester > “2003\1”

DB Theory, query containment


Query containment check

Create a canonical database D that is the “frozen” body of Q’

Compute Q(D)

If Q(D) contains the frozen head of Q’ then Q’  Q otherwise not

Example:

Q’: p3(x,y) :- arc(x,z),arc(z,w),arc(w,y)

Q: path(x,y) :- arc(x,y)

path(x,y) :- path(x,z), path(z,y)

1. Freeze Q’ with say 0,1,2,3 for x,z,w,y respectively:

Q’(0,3) :- arc(0,1), arc(1,2), arc(2,3)

D={arc(0,1), arc(1,2), arc(2,3)}

2. Compute Q(D):

Q(D)={(0,1), (1,2), (2,3), (0,2), (1,3), (0,3)}

3. The frozen head of Q’ = (0,3) is in the Q(D)  Q’Q

Query containment check


Query reformulation using views

Problem: Given a user query Q and view definitions V={V1..Vn}

Q’ is an Equivalent Rewriting of Q using V iff

Q’ refers only to views in V,

Q’ is equivalent to Q

Q’ is a Maximally-contained Rewriting of Q using V iff

Q’ refers only to views in V,

Q’  Q,

there is no such rewriting Q1 that Q’  Q1  Q, Q1  Q

Query reformulation using views


Bored
Bored? V={V1..Vn}

  • How about some motivation?

  • Why do we bother?

  • Is it really worth it?

  • Is anybody out there interested in integration?



How big is big

Estimated Annual Integration costs + data quality costs worldwide:

$1 Trilion/year!1,2

...and this is just a beginning

“The Grand Challenge is now becoming mission critical”.

How Big is Big?

1 Dr. M. Brodie, The Grand Challenge of Information Technology, Invited talk, CAiSE 2002

2 Trillion is 12 or 18 zeros, depending of which side of Atlantic you are :)


Approaches to map describe the sources

G worldwide:lobal As View (GAV)

The mediated schema is defined as a view over the source schemas

Local As View (LAV)

Sources are described as views over the mediated schema

Ontological Approach

An explicit integration model instance that adheres to an integration model ontology is created.

Approaches to map/describe the Sources


Global as view gav

X is defined as a view over Y and Z worldwide:

Mapping Example:

Faculty(name,area) -> Person(id, name), Expert(id, area)

Teaches(name, c-number, trim)

-> Person(id, name), Gives(id, c-number, trimester)

Query answering means unfolding the mappings/rules

Global As View (GAV)


Query answering query unfolding
Query Answering = Query Unfolding worldwide:

  • Query Example:

    Who from the area of “IS” was teaching in the trimester “2003/2”?

    Q(name):-

    Faculty(name, “IS”), Teaches(name, c,2003/2”)

    Person(id, name), Expert(id,”IS”), Gives(id, c, “2003/2”)


Local as view lav

Each data source is described by one or more views wrt the mediated schema.

The views might not be complete, i.e., they may only contain a subset of tuples satisfying their definition.

Local As View (LAV)


Local as view lav1

Mapping Example mediated schema.

V1(student, c-number, trimester,title) :- Registered(student, c-number, trimester), Course(c-number, title), c-number> “500”, trimester> “2002/1”

V2(student, prof, c-number, trimester) :- Registered(student, c-number, trimester), Teaches(prof, c-number, quarter)

V3(student, c-number) :- Registered(student, c-number, trimester), trimester < “1998/1”

V4(prof, c-number, title, trimester) :- Registered(student, c-number, trimester), Course(c-number, title), Teaches(prof, c-number, trimester), trimester < “2000/1”

User Query Example

Q(s,c,p) :-

Teaches(p,c,q), Registered(s,c,q), Course(c,t), c > “300”, q > “1999/2”

Local As View (LAV)


Query answering query rewriting
Query Answering = Query Rewriting mediated schema.

  • Goal: reformulate a user query (against the mediated schema) into a query that refers directly to the available sources/views.

  • = Find a maximally-contained rewriting of the user query in the views describing the sources.

  • How?


The bucket algorithm 1

The main idea: the number of rewritings that need to be considered can be reduced by considering each sub-goal of the query in isolation.

input = set of views

output = maximally-contained rewriting composed from (some of the) views

informally, a view can be useful for a query if the set of relations it mentions overlaps with that of the query, and it selects some of the attributes selected by the query + the constrains of the view must be equivalent or weaker than those from the query

The Bucket Algorithm 1


The bucket algorithm 2

Step 1: considered can be reduced by considering each sub-goal of the query in isolation.

Create a bucket for each sub-goal of Q and fill it in with views/sources that are relevant for that particular sub-goal.

A view V is put in the bucket of the relation/sub-goal g if it contains this relation AND the constraints in V are compatible (weaker than) those in Q.

Step 2:

Combine source relations from each bucket into conjunctive queries (each consisting of one conjunct from every bucket).

Check containment wrt to Q, if contained add the rewriting to the answer.

The result is a union of conjunctive rewritings.

The Bucket Algorithm 2


The bucket algorithm example step 1

V1(student, c-number, trimester,title) :- Registered(student, c-number, trimester), Course(c-number, title), c-number> “500”, trimester> 2002/1”

V2(student, prof, c-number, trimester) :- Registered(student, c-number, trimester), Teaches(prof, c-number, quarter)

V3(student, c-number) :- Registered(student, c-number, trimester), trimester < “1998/1”

V4(prof, c-number, title, trimester) :- Registered(student, c-number, trimester), Course(c-number, title), Teaches(prof, c-number, trimester), trimester < “2000/1”

Query Constraints: c > “300”, q > “1999/2”

Sub-goals: Teaches(p,c,q), Registered(s,c,q), Course(c,t) ,

Buckets: V2(s’,p,c,q) V1(s,c,q,t’) V1(s’,c,q’,t)

V4(p,c,t,q) V2(s,p’,c,q) V4(p’,c,t,q’)

variable mappings:

{p-> prof, c->c-number, q-> trimester}

{s->student, c->c-number, q-> trimester}

{c->c-number, t->title}

The Bucket Algorithm, Example: Step 1


The bucket algorithm example step 2

V1(student, c-number, trimester,title) :- Registered(student, c-number, trimester), Course(c-number, title), c-number> “500”, trimester> 2002/1”

V2(student, prof, c-number, trimester) :- Registered(student, c-number, trimester), Teaches(prof, c-number, quarter)

V3(student, c-number) :- Registered(student, c-number, trimester), trimester < “1998/1”

V4(prof, c-number, title, trimester) :- Registered(student, c-number, trimester), Course(c-number, title), Teaches(prof, c-number, trimester), trimester < “2000/1”

Query Constraints: c > “300”, q > “1999/2”

Sub-goals: Teaches(p,c,q), Registered(s,c,q), Course(c,t) ,

Buckets:V2(s’,p,c,q)V1(s,c,q,t’)V1(s’,c,q’,t)

V4(p,c,t,q) V2(s,p’,c,q) V4(p’,c,t,q’)

Q’(s,c,p) :- V2(s’,p,c,q), V1(s,c,q,t’), V1(s’,c,q’,t)

= V2(s,p,c,q), V1(s,c,q,t’),

Q’’(s,c,p) :- V2(s,p,c,q), V4(p,c,t’,q)

Q’  Q, Q’’  Q  Q’Q’’ Q

The Bucket Algorithm, Example: Step 2


Gav versus lav

GAV Registered(student, c-number, trimester), Course(c-number, title), c-number> “500”, trimester> 2002/1”

adding/deleting a new source means changing the global schema (that is defined by all it’s rewriting rules)

“easy” query reformulation

LAV

adding/deleting the source means just adding/deleting views from, i.e. the schema stays intact

“difficult” query reformulation

GAV versus LAV


The road doesn t end here there are still problems to tackle
The road doesn’t end here, there are still problems to tackle

  • Performance issues

    • the size of the problem, remember?

    • network behavior (delay, bursting data)

  • Source Discovery

  • Schema Discovery

  • Approximate answers

    • a less precise answer is often better than no (precise) answer at all

  • Choosing the right sources wrt to user demands

    • not everybody has the same evaluation criterion for an answer especially when multimedia are involved, e.g. (e.g. most “vivid”, with highest resolution, fastest, most reliable, etc. )


A light at the end of the tunnel
A Light at the End of the Tunnel? tackle

  • Pipelineing, Parallelism, Adaptivity

  • Ontologies, Semantic Web, Agents

  • Emerging Standards e.g. RDF(S)

  • Combination of IR techniques and data integration, use of BBN

  • Programmable Mediators


Programmable mediators

The main idea: tackle

Sources are clustered (based on the information they offer) and the order in which they are consulted within a cluster is not “hard-wired” but flexible based on their annotations.

RDFS-based application independent integration meta ontology (IMO) and its application dependent specialization

Sources are mapped and annotated as RDF instantiation of the IMO

Programmable Mediators


Programmable mediators 2

Sources are mapped by articulations to the mediated schema. tackle

Each such mapping can be annotated with a decoration(s) which rates that mapping for a concrete ordering criterion (criteria).

When the mediator encounters “equivalent” sources it chooses those that are having better “score” in the query dependent criterion

Programmable Mediators 2


The end
The End tackle

  • Thank you for your attention!

  • Those who fell asleep: WAKE UP! It’s time to go home :)


ad