Data Integration - PowerPoint PPT Presentation

Data integration l.jpg
Download
1 / 57

  • 198 Views
  • Updated On :
  • Presentation posted in: Home / Garden

Data Integration . Data Integration Challenge. Find houses with 4 bedrooms priced under 300K. New faculty member. realestate.com. homeseekers.com. homes.com. wrapper. wrapper. wrapper. Architecture of Data Integration Systems. Find houses with 4 bedrooms priced under 300K.

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.

Download Presentation

Data Integration

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Data integration l.jpg

Data Integration


Data integration challenge l.jpg

Data Integration Challenge

Find houses with

4 bedrooms

priced under

300K

New faculty

member

realestate.com

homeseekers.com

homes.com


Architecture of data integration systems l.jpg

wrapper

wrapper

wrapper

Architecture of Data Integration Systems

Find houses with 4 bedroomspriced under 300K

mediated schema

source schema 1

source schema 2

source schema 3

homes.com

realestate.com

houses.com

Provide a uniform query interface


Architecture of data integration systems4 l.jpg

cost | city | numbeds | numbaths

price location beds baths

$185,000 Urbana, IL 2 2

$270,000 Seattle, WA 3 -

Architecture of Data Integration Systems

Find houses with 4 bedroomspriced under 300K

price location beds

$185K Urbana, IL 2

$299K Kent, WA 3

wrapper

wrapper

$185,000 <em>Urbana, IL</em>

2 beds / 2 baths Century 21

$270,000 <em>Seattle, WA</em>

3 beds REMAX real estate

homeseekers.com

homeseekers.com

Involve many tasks


Another example l.jpg

Another Example

Uniform query capability across autonomous, heterogeneous data sources on LAN, WAN, or Internet


More motivating examples l.jpg

More Motivating Examples

  • An organization has on average 49 databases

    • can talk about the same topic, but use different vocabularies, different schemas

    • how can we access them as if accessing a single db?

  • Hundreds of online bookstores

    • amazon.com, barnes&noble.com, etc.

    • how can we query them as if querying a single source?

  • Hundreds of CS websites in US, in text format

    • can we consolidate information about all of them and query them as if querying a giant relational database?


The general problem l.jpg

The General Problem

  • How can we access a set of heterogeneous, distributed, autonomous databases as if accessing a single database?

  • Arises in numerous contexts

    • on the Web, at enterprises, military, scientific cooperation, bio-informatics domains, e-commerce, etc.

  • Currently very hot

    • in both database research and industry


Current state of affairs l.jpg

Current State of Affairs

  • Mostly ad-hoc programming: create a special solution for every case; pay consultants a lot of money.

  • Long-standing challenge in the DB community

  • AI/WWW communities are on board

  • Annual workshops, vision papers, ...

  • Companies

    • Informatica, many others, ...


A brief research history l.jpg

A Brief Research History

  • Many early ad-hoc solutions

  • Converged into two approaches

    • data warehousing vs. virtual DI systems

  • Semi-structured data, XML

  • Wrappers, info extraction

  • Other issues: query optimization, schema matching, ...

  • Current directions

    • DI for specialized domains (e.g., bioinformatics)

    • on-the-fly DI, entity-centric DI

    • simplify integration tasks

  • New types of data sharing systems

    • P2P systems, Semantic Web


Data warehousing vs virtual di systems l.jpg

Data warehousing vs.Virtual DI systems


Data warehouse architecture l.jpg

Data Warehouse Architecture

OLAP / Decision support/

Data cubes/ data mining

User queries

Relational database (warehouse)

Data extraction

programs

Data cleaning/

scrubbing

Data

source

Data

source

Data

source


Data warehousing l.jpg

Data warehousing

  • Data warehousing: load all the data periodically into a warehouse.

    • 6-18 months lead time

    • Separates operational DBMS from decision support DBMS. (not only a solution to data integration).

    • Performance is good; data may not be fresh.

    • Need to clean, scrub you data.


The virtual integration architecture l.jpg

The Virtual Integration Architecture

  • Leave the data in the sources.

  • When a query comes in:

    • Determine the relevant sources to the query

    • Break down the query into sub-queries for the sources.

    • Get the answers from the sources, and combine them appropriately.

  • Data is fresh.

  • Challenge: many


Virtual integration architecture l.jpg

Virtual Integration Architecture

User queries

Mediated schema

Mediator:

Reformulation engine

optimizer

Which data

model?

Data source

catalog

Execution engine

wrapper

wrapper

wrapper

Data

source

Data

source

Data

source

Sources can be: relational, hierarchical (IMS), structure files, web sites.


Architecture of virtual data integration system l.jpg

Architecture of (Virtual) Data Integration System

Find books written by Isaac Asimov & priced under $15

global query interface

query interface 1

query interface 2

query interface 3

amazon.com

bn.com

powell.com


A brief history l.jpg

A Brief History

  • Many early ad-hoc solutions

  • Converged into two approaches

    • data warehousing vs. virtual DI systems

  • Semi-structured data, XML

  • Wrappers

  • Other issues: query optimization, schema matching, ...

  • Current directions

    • DI for specialized domains (e.g., bioinformatics)

    • on-the-fly DI, entity-centric DI

  • New types of data sharing systems

    • P2P systems, Semantic Web


Semi structured data l.jpg

Semi-structured Data

  • What should be the underlying data model for DI contexts?

    • relational model is not an ideal choice

  • Developed semi-structured data model

    • started with the OEM (object exchange model)

  • Then XML came along

  • It is now the most well-known semi-structured data model

  • Generating much research in the DB community


Slide18 l.jpg

HTML

<h1> Bibliography </h1>

<p> <i> Foundations of Databases </i>

Abiteboul, Hull, Vianu

<br> Addison Wesley, 1995

<p> <i> Data on the Web </i>

Abiteboul, Buneman, Suciu

<br> Morgan Kaufmann, 1999

HTML is hard for applications


Slide19 l.jpg

XML

<bibliography>

<book> <title> Foundations… </title>

<author> Abiteboul </author>

<author> Hull </author>

<author> Vianu </author>

<publisher> Addison Wesley </publisher>

<year> 1995 </year>

</book>

</bibliography>

XML describes the content: easy for applications


Dtds as grammars l.jpg

DTDs as Grammars

Same thing as:

  • A DTD is a EBNF (Extended BNF) grammar

  • An XML tree is precisely a derivation tree

db ::= (book|publisher)*

book ::= (title,author*,year?)

title ::= string

author ::= string

year ::= string

publisher ::= string

XML Documents that have a DTD and conform to it are called valid


More on dtds as grammars l.jpg

More on DTDs as Grammars

<!DOCTYPE paper [

<!ELEMENT paper (section*)>

<!ELEMENT section ((title,section*) | text)>

<!ELEMENT title (#PCDATA)>

<!ELEMENT text (#PCDATA)>

]>

<paper> <section> <text> </text> </section>

<section> <title> </title> <section> … </section>

<section> … </section>

</section>

</paper>

XML documents can be nested arbitrarily deep


Xml for representing data l.jpg

<persons>

<row> <name>John</name>

<phone> 3634</phone></row>

<row> <name>Sue</name>

<phone> 6343</phone>

<row> <name>Dick</name>

<phone> 6363</phone></row>

</persons>

XML for Representing Data

XML:

persons

persons

row

row

row

phone

name

phone

name

phone

name

“John”

3634

“Sue”

6343

“Dick”

6363


Xml vs data models l.jpg

XML vs Data Models

  • XML is self-describing

  • Schema elements become part of the data

    • Relational schema: persons(name,phone)

    • In XML <persons>, <name>, <phone> are part of the data, and are repeated many times

  • Consequence: XML is much more flexible

  • XML = semistructured data


Semi structured data explained l.jpg

Semi-structured Data Explained

  • Missing attributes:

  • Repeated attributes

<person> <name> John</name>

<phone>1234</phone>

</person>

<person> <name>Joe</name>

</person>

 no phone !

<person> <name> Mary</name>

<phone>2345</phone>

<phone>3456</phone>

</person>

 two phones !


Semistructured data explained l.jpg

Semistructured Data Explained

  • Attributes with different types in different objects

  • Nested collections (no 1NF)

  • Heterogeneous collections:

    • <db> contains both <book>s and <publisher>s

<person> <name> <first> John </first>

<last> Smith </last>

</name>

<phone>1234</phone>

</person>

 structured name !


Xml data v s e r odl relational l.jpg

XML Data v.s. E/R, ODL, Relational

  • Q: is XML better or worse ?

  • A: serves different purposes

    • E/R, ODL, Relational models:

      • For centralized processing, when we control the data

    • XML:

      • Data sharing between different systems

      • we do not have control over the entire data

      • E.g. on the Web

  • Do NOT use XML to model your data ! Use E/R, ODL, or relational instead.


Exporting relational data to xml l.jpg

Exporting Relational Data to XML

  • Product(pid, name, weight)

  • Company(cid, name, address)

  • Makes(pid, cid, price)

makes

product

company


Export data grouped by companies l.jpg

Export data grouped by companies

<db><company> <name> GizmoWorks </name>

<address> Tacoma </address>

<product> <name> gizmo </name>

<price> 19.99 </price>

</product>

<product> …</product>

</company>

<company> <name> Bang </name>

<address> Kirkland </address>

<product> <name> gizmo </name>

<price> 22.99 </price>

</product>

</company>

</db>

Redundant

representation

of products


The dtd l.jpg

The DTD

<!ELEMENT db (company*)>

<!ELEMENT company (name, address, product*)>

<!ELEMENT product (name,price)>

<!ELEMENT name (#PCDATA)>

<!ELEMENT address (#PCDATA)>

<!ELEMENT price (#PCDATA)>


Export data by products l.jpg

Export Data by Products

<db> <product> <name> Gizmo </name>

<manufacturer>

<name> GizmoWorks </name>

<price> 19.99 </price>

<address> Tacoma </address>

</manufacturer>

<manufacturer>

<name> Bang </name>

<price> 22.99 </price>

<address> Kirkland </address>

</manufacturer>

</product>

<product> <name> OneClick </name> …

</db>

Redundant

Representation

of companies


Which one do we choose l.jpg

Which One Do We Choose ?

  • The structure of the XML data is determined by agreement, with our partners, or dictated by committees

    • Many XML dialects (called applications)

  • XML Data is often nested, irregular, etc

  • No normal forms for XML 


Xml query languages l.jpg

XML Query Languages

  • Xpath

  • XML-QL

  • Xquery


A brief history33 l.jpg

A Brief History

  • Many early ad-hoc solutions

  • Converged into two approaches

    • data warehousing vs. virtual DI systems

  • Semi-structured data, XML

  • Wrappers

  • Other issues: query optimization, schema matching, ...

  • Current directions

    • DI for specialized domains (e.g., bioinformatics)

    • on-the-fly DI, entity-centric DI

  • New types of data sharing systems

    • P2P systems, Semantic Web


Virtual integration architecture34 l.jpg

Virtual Integration Architecture

User queries

Mediated schema

Mediator:

Reformulation engine

optimizer

Which data

model?

Data source

catalog

Execution engine

wrapper

wrapper

wrapper

Data

source

Data

source

Data

source

Sources can be: relational, hierarchical (IMS), structure files, web sites.


Wrapper programs l.jpg

Wrapper Programs

  • Task: to communicate with the data sources and do format translations.

  • They are built w.r.t. a specific source.

  • They can sit either at the source or at the mediator.

  • Often hard to build (very little science).

  • Can be “intelligent”: perform source-specific optimizations.


Example l.jpg

Example

Transform:

<b> Introduction to DB </b>

<i> Phil Bernstein </i>

<i> Eric Newcomer </i>

Addison Wesley, 1999

<book>

<title> Introduction to DB </title>

<author> Phil Bernstein </author>

<author> Eric Newcomer </author>

<publisher> Addison Wesley </publisher>

<year> 1999 </year>

</book>

into:


Wrapper construction l.jpg

Wrapper Construction

  • Huge amount of research in the past decade

  • Two major approaches

    • machine learning: typically requires some hand-labeled data

    • data-intensive, completely automatic

  • Different focuses

    • pull out each record (i.e., segment page into records)

    • pull out fields in each record

    • remove junk portions (ads, etc.)

  • Current solutions are still brittle

  • Unclear whether “standards” such as XML & Web services will eliminate the problem

    • the need likely will still remain


Information extraction l.jpg

Information Extraction

  • If the source cannot be wrapped with a grammar or some easy-to-parse rules

    • must do information extraction

  • Huge research in the AI community


A brief history39 l.jpg

A Brief History

  • Many early ad-hoc solutions

  • Converged into two approaches

    • data warehousing vs. virtual DI systems

  • Semi-structured data, XML

  • Wrappers

  • Other issues: query optimization, schema matching, ...

  • Current directions

    • DI for specialized domains (e.g., bioinformatics)

    • on-the-fly DI, entity-centric DI

  • New types of data sharing systems

    • P2P systems, Semantic Web


Data source catalog l.jpg

Data Source Catalog

  • Contains all meta-information about the sources:

    • Logical source contents (books, new cars).

    • Source capabilities (can answer SQL queries)

    • Source completeness (has all books).

    • Physical properties of source and network.

    • Statistics about the data (like in an RDBMS)

    • Source reliability

    • Mirror sources

    • Update frequency.


Content descriptions l.jpg

Content Descriptions

  • User queries refer to the mediated schema.

  • Data is stored in the sources in a local schema.

  • Content descriptions provide the semantic mappings between the different schemas.

  • Data integration system uses the descriptions to translate user queries into queries on the sources.


Desiderata from source descriptions l.jpg

Desiderata from Source Descriptions

  • Expressive power: distinguish between sources with closely related data. Hence, be able to prune access to irrelevant sources.

  • Easy addition: make it easy to add new data sources.

  • Reformulation: be able to reformulate a user query into a query on the sources efficiently and effectively.


Reformulation problem l.jpg

Reformulation Problem

  • Given:

    • A query Q posed over the mediated schema

    • Descriptions of the data sources

  • Find:

    • A query Q’ over the data source relations, such that:

      • Q’ provides only correct answers to Q, and

      • Q’ provides all possible answers from to Q given the sources.


Approaches to specifying source descriptions l.jpg

Approaches to Specifying Source Descriptions

  • Global-as-view: express the mediated schema relations as a set of views over the data source relations

  • Local-as-view: express the source relations as views over the mediated schema.

  • Can be combined with no additional cost.


Global as view l.jpg

Global-as-View

Mediated schema:

Movie(title, dir, year, genre),

Schedule(cinema, title, time).

Create View Movie AS

select * from S1 [S1(title,dir,year,genre)]

union

select * from S2 [S2(title, dir,year,genre)]

union[S3(title,dir), S4(title,year,genre)]

select S3.title, S3.dir, S4.year, S4.genre

from S3, S4

where S3.title=S4.title


Global as view example 2 l.jpg

Global-as-View: Example 2

Mediated schema:

Movie(title, dir, year, genre),

Schedule(cinema, title, time).

Create View Movie AS [S1(title,dir,year)]

select title, dir, year, NULL

from S1

union [S2(title, dir,genre)]

select title, dir, NULL, genre

from S2


Global as view example 3 l.jpg

Global-as-View: Example 3

Mediated schema:

Movie(title, dir, year, genre),

Schedule(cinema, title, time).

Source S4: S4(cinema, genre)

Create View Movie AS

select NULL, NULL, NULL, genre

from S4

Create View Schedule AS

select cinema, NULL, NULL

from S4.

But what if we want to find which cinemas are playing comedies?


Global as view summary l.jpg

Global-as-View Summary

  • Query reformulation boils down to view unfolding.

  • Very easy conceptually.

  • Can build hierarchies of mediated schemas.

  • You sometimes loose information. Not always natural.

  • Adding sources is hard. Need to consider all other sources that are available.


Local as view example 1 l.jpg

Local-as-View: example 1

Mediated schema:

Movie(title, dir, year, genre),

Schedule(cinema, title, time).

Create Source S1 AS

select * from Movie

Create Source S3 AS [S3(title, dir)]

select title, dir from Movie

Create Source S5 AS

select title, dir, year

from Movie

where year > 1960 AND genre=“Comedy”


Local as view example 2 l.jpg

Local-as-View: Example 2

Mediated schema:

Movie(title, dir, year, genre),

Schedule(cinema, title, time).

Source S4: S4(cinema, genre)

Create Source S4

select cinema, genre

from Movie m, Schedule s

where m.title=s.title

.

Now if we want to find which cinemas are playing comedies, there is hope!


Local as view summary l.jpg

Local-as-View Summary

  • Very flexible. You have the power of the entire query language to define the contents of the source.

  • Hence, can easily distinguish between contents of closely related sources.

  • Adding sources is easy: they’re independent of each other.

  • Query reformulation: answering queries using views!


The general problem52 l.jpg

The General Problem

  • Given a set of views V1,…,Vn, and a query Q, can we answer Q using only the answers to V1,…,Vn?

  • Many, many papers on this problem.

  • The best performing algorithm: The MiniCon Algorithm, (Pottinger & Levy, 2000).

  • Great survey on the topic: (Halevy, 2001).


Query optimization l.jpg

Query Optimization

  • Very related to query reformulation!

  • Goal of the optimizer: find a physical plan with minimal cost.

  • Key components in optimization:

    • Search space of plans

    • Search strategy

    • Cost model


Optimization in distributed dbms l.jpg

Optimization in Distributed DBMS

  • A distributed database (2-minute tutorial):

    • Data is distributed over multiple nodes, but is uniform.

    • Query execution can be distributed to sites.

    • Communication costs are significant.

  • Consequences for optimization:

    • Optimizer needs to decide locality

    • Need to exploit independent parallelism.

    • Need operators that reduce communication costs (semi-joins).


Ddbms vs data integration l.jpg

DDBMS vs. Data Integration

  • In a DDBMS, data is distributed over a set of uniform sites with precise rules.

  • In a data integration context:

    • Data sources may provide only limited access patterns to the data.

    • Data sources may have additional query capabilities.

    • Cost of answering queries at sources unknown.

    • Statistics about data unknown.

    • Transfer rates unpredictable.


A brief research history56 l.jpg

A Brief Research History

  • Many early ad-hoc solutions

  • Converged into two approaches

    • data warehousing vs. virtual DI systems

  • Semi-structured data, XML

  • Wrappers, info extraction

  • Other issues: query optimization, schema matching, ...

  • Current directions

    • DI for specialized domains (e.g., bioinformatics)

    • on-the-fly DI, entity-centric DI

    • simplify integration tasks

  • New types of data sharing systems

    • P2P systems, Semantic Web


Architecture of data integration systems57 l.jpg

wrapper

wrapper

wrapper

Architecture of Data Integration Systems

Find houses with 4 bedroomspriced under 300K

mediated schema

source schema 1

source schema 2

source schema 3

homes.com

realestate.com

houses.com

Provide a uniform query interface


  • Login