data integration the teenage years l.
Download
Skip this Video
Loading SlideShow in 5 Seconds..
Data Integration: The Teenage Years PowerPoint Presentation
Download Presentation
Data Integration: The Teenage Years

Loading in 2 Seconds...

play fullscreen
1 / 54

Data Integration: The Teenage Years - PowerPoint PPT Presentation


  • 125 Views
  • Uploaded on

Data Integration: The Teenage Years. Alon Halevy (Google) Anand Rajaraman (Kosmix) Joann Ordille (Avaya) VLDB 2006. Agenda. A few perspectives on the last 10 years Technical, commercial Perspectives from our personal paths Wild speculations about the future

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Data Integration: The Teenage Years' - sequoia


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
data integration the teenage years

Data Integration: The Teenage Years

Alon Halevy (Google)

Anand Rajaraman (Kosmix)

Joann Ordille (Avaya)

VLDB 2006

agenda
Agenda
  • A few perspectives on the last 10 years
    • Technical, commercial
  • Perspectives from our personal paths
  • Wild speculations about the future
  • This is not a survey on data integration

(See the paper in the proceedings for another non-survey)

acknowledgements
Acknowledgements

Other members of the Information Manifold Project:

  • Jaewoo Kang (NCSU, Korea Univ.)
  • Divesh Srivastava (AT&T Labs)
  • Shuky Sagiv (Hebrew U.)
  • Tom Kirk
acknowledgements4
Acknowledgements

To the SIGMOD 1996 Program committee

For rejecting the earlier version of the paper.

timeline
Timeline

95

96

97

98

99

00

01

02

03

04

05

06

data integration

Enterprise Databases

Sequenceable Entity

Structured Vocabulary

Experiment

Phenotype

Gene

Nucleotide Sequence

Microarray Experiment

Protein

Legacy Databases

Services and Applications

Data Integration
the information manifold
The Information Manifold
  • Goal: integrate data from multiple sources on the web:

Find the Woody Allen movies playing in my area, and their reviews

  • Need to describe the data sources:
    • Contents, constraints, access patterns
slide8

wrapper

wrapper

wrapper

wrapper

wrapper

Design time

Run time

Mediated Schema

query

reformulation

Semantic mappings

optimization &

execution

semantic mappings a k a source descriptions
Semantic Mappings[a.k.a. Source Descriptions]

Mediated Schema

CD: ASIN, Title, Genre,…

Artist: ASIN, name, …

logic

CDs

Album

ASIN

Price

DiscountPrice

Studio

Books

Title

ISBN

Price

DiscountPrice

Edition

Authors

ISBN

FirstName

LastName

Artists

ASIN

ArtistName

GroupName

CDCategories

ASIN

Category

BookCategories

ISBN

Category

global as view gav

Source

Source

Source

Source

Source

Global-as-View (GAV)

Mapping:

CD(A,T,G) :- R1(A,T,G)

CD(A,T,G) :- R2(A,T), R3(T,G)

Mediated Schema

CD: ASIN, Title, Genre,…

Artist: ASIN, name, …

R1

R2

R3

R4

R5

local as view lav

Source

Source

Source

Source

Source

Local-as-View (LAV)

Mapping:

R1(A,T,G) :- CD(A,T,G,Y), Artist(A,N), Y< 1970

R2(A,T) :- CD(A,T,”French”,Y)

Mediated Schema

CD: ASIN, Title, Genre, Year

Artist: ASIN, Name, …

R1

R2

R3

R4

R5

query answering in lav answering queries using views
Query Answering in LAV =Answering queries using views

Given a set of views V1,…,Vn,

And a query Q,

Can we answer Q using only the answers to V1,…,Vn?

aquv i
AQUV (I)
  • [Larson et al., 85 & 87], [Tsatalos et al., 94], [Chaudhuri et al., 95],
  • Focus on AQUV for:
    • Query optimization
    • Supporting physical data independence
  • Every commercial DBMS supports AQUV.
aquv ii
AQUV (II)
  • AQUV for data integration:
    • Find maximally contained rewriting
    • Not necessarily equivalent rewriting
  • Algorithms:
    • Bucket algorithm [LRO, 96]
    • Inverse rules [Duschka, 97]
    • Minicon [Pottinger and Halevy, 2000]
  • Views and security: [Miklau and Suciu, 04]

Survey: Halevy, VLDB Journal, 2001

some subsequent results
Some Subsequent Results
  • Semantics of data integration:
    • Abiteboul & Duschka, 1998: certain answers
    • Open vs. closed world assumption
      • CWA is bad complexity news!

Survey: Lenzerini, PODS 2002

certain answers
Certain Answers

Mediated schema: Route (Origin, Destination)

Source 1: Origins

SF

NY

Source 2: Destinations

Seattle

Seoul

Query: Route (SF, Seattle)?

Possible databases:

some subsequent results17
Some Subsequent Results
  • Limitations due to binding patterns
    • Input title, get book info [Rajaraman et al., 95]
  • Additional query processing capabilities
    • Form applies multiple predicates
  • Disjunction, negation in sources.
  • Ordering sources, probabilistic mappings
    • [Florescu et al., 97, Doan et al., Dong et al.]
  • GLAV [Millstein et al., 99]

Survey: Lenzerini, PODS 2002

a word on description logics
A word on Description Logics
  • Selecting relevant sources = reasoning.
  • Description logics to the rescue:
    • [Catarci and Lenzerini, 93]
  • Information Manifold
    • Combined the Classic DL with Datalog (CARIN)
    • See AAAI-96 (not sigmod)
  • Brought DL and DB closer together.
    • A very active area of research today.
slide19

XML

95

96

97

98

99

00

01

02

03

04

05

06

xml and semi structured data
XML and Semi-structured Data
  • Tsimmis: semi-structured data for integration.
  • XML: whetted the integration appetites
    • We have the syntax
    • Now just solve the silly semantics problems
    • Don’t bother: we’ll all standardize on DTDs.
  • XML will have a significant role on the data integration industry and research.
slide21

XML

95

96

97

98

99

00

01

02

03

04

05

06

back in the lab
Back in the Lab…
  • Two observations:
    • Who’s going to write all these LAV/GAV formulas?
    • This was the bottleneck.
  • Once we have mappings, how can we execute queries?
    • Traditional plan-then-execute doesn’t work.
semantic mappings
Semantic Mappings

Books

Title

ISBN

Price

DiscountPrice

Edition

Authors

ISBN

FirstName

LastName

BooksAndMusic

Title

Author

Publisher

ItemID

ItemType

SuggestedPrice

Categories

Keywords

BookCategories

ISBN

Category

CDCategories

ASIN

Category

CDs

Album

ASIN

Price

DiscountPrice

Studio

Artists

ASIN

ArtistName

GroupName

Inventory

Database A

Inventory Database B

“Standards are great, but there are too many of them.”

techniques for schema mapping survey by rahm and bernstein vldbj 2001
Techniques for Schema Mapping[Survey by Rahm and Bernstein, VLDBJ 2001]
  • Compare schema elements based on:
    • Names (or n-grams)
    • Data types and instances
    • Text descriptions, integrity constraints
  • Combine multiple techniques:
    • [Momis, Cupid, LSD, Coma]
  • Create mappings from matches
    • [Clio @ IBM + Miller]
a machine learning approach doan et al 2001 acm distinguished dissertation 2003
A Machine Learning Approach[Doan et al., 2001, ACM Distinguished Dissertation 2003]

Mediated schema

  • Many mapping tasks are repetitive
  • Learn from previous experience:
    • Build a classifier for every element of the mediated schema.
    • Many kinds of cues  meta-strategy learning

Given matches

Predict new ones

matching real estate sources
Matching Real-Estate Sources

Mediated schema

address price agent-phone description

locationlisted-pricephonecomments

Learned hypotheses

If “phone” occurs in the name => agent-phone

Schema of realestate.com

location

Miami, FL

Boston, MA

...

listed-price

$250,000

$110,000

...

phone

(305) 729 0831

(617) 253 1429

...

comments

Fantastic house

Great location

...

realestate.com

If “fantastic” & “great” occur frequently in data values =>

description

homes.com

price

$550,000

$320,000

...

contact-phone

(278) 345 7215

(617) 335 2315

...

extra-info

Beautiful yard

Great beach

...

reference reconciliation to join or not to join
Reference ReconciliationTo Join or not to Join?
  • Many ways to refer to the same object in the world:
    • “IBM”, “International Business Machines”
    • Alon Levy, Alon Halevy
  • Automated methods are necessity
    • Can’t go through all the data manually
  • Very active area in ML, KDD, DB, UAI, …
query processing to plan or to execute
Query ProcessingTo Plan or to Execute?
  • In addition to distributed query processing issues:
    • Few statistics, if any.
    • Network behavior issues: latency, burstiness,…
    • Garlic @IBM
  • “Adaptive query processing”:
    • Stonebraker saw it coming in Ingres.
    • Revivals by Graefe (1993) and DeWitt (1998).
    • Query scrambling [Urhan & Franklin]
    • Eddies [Avnur & Hellerstein]
    • Convergent query processing [Ives et al.]
slide29

XML

95

96

97

98

99

00

01

02

03

04

05

06

commercialization
Commercialization
  • Late 90’s – anything goes.
  • Want money from VC’s?
    • Say “XML” 3 times loud and clear.
  • Academia at the forefront:
    • Nimble (UW), Cohera (Berkeley), Enosys (UCSD),…
  • Big companies took notice
    • Some faster than others
commercialization retrospective see panel of experts sigmod 05
Commercialization Retrospective[See Panel-of-Experts, SIGMOD 05]
  • Uphill battle vs. the warehousing folks
    • Virtual integration was more “pay-as-you-go”
  • Another battle with the EAI folks
    • Should really be a symbiosis there.
  • Go vertical or horizontal?
    • Obvious: go vertical if you can find the right one.
  • The technology worked
    • But it’s all in the timing…
after 30m
After $30M…

XML Query

XML

Relational

Data Warehouse/

Mart

Legacy

Flat File

Web Pages

Front-End

Lens Builder™

User

Applications

Lens™ File

InfoBrowser™

Software

Developers Kit

NIMBLE™ APIs

Management

Tools

Integration

Layer

Nimble Integration Engine™

Metadata

Server

Cache

Compiler

Executor

Security Tools

Common XML View

Integration

Builder

Concordance Developer

Data

Administrator

slide33

NASDAQ

XML

95

96

97

98

99

00

01

02

03

04

05

06

so back in the lab
So… Back in the Lab
  • Model management
  • Peer data management systems
  • Data exchange
model management bernstein et al
Model Management[Bernstein et al.]
  • Generic infrastructure for managing schemas and mappings:
    • Manipulate models and mappings as bulk objects
    • Operators to create & compose mappings, merge & diff models
    • Short operator scripts can solve schema integration, schema evolution, reverse engineering, etc.
  • First challenge: semantics of operators.
peer data management systems

Q3

Q1

Q4

Q5

Q6

Q

Q2

Peer Data Management Systems

UW (Wisconsin)

Stanford

Berkeley

LAV, GLAV

DBLP

CiteSeer

UW (Washington)

UW (Waterloo)

pdms related projects
PDMS-Related Projects
  • Piazza (Washington)
  • Hyperion (Toronto)
  • PeerDB (Singapore)
  • Local relational models (Trento, Toronto)
  • Active XML (INRIA)
  • Edutella (Hannover, Germany)
  • Semantic Gossiping (EPFL Lausanne)
  • Raccoon (UC Irvine)
  • Orchestra (U. Penn)
pdms challenges
PDMS Challenges
  • Semantics:
    • careful about cycles
  • Optimization:
    • Compose mappings
    • Prune paths

UW (Wisconsin)

Stanford

Berkeley

  • Manage networks:
    • Consistency
    • Quality
    • Caching

DBLP

UW (Washington)

UW (Waterloo)

CiteSeer

data exchange
Data Exchange
  • Key question: given an instance of S and a mapping, create an instance for T.
  • [Fagin, Kolaitis, Popa & Tan]

S

T

M

slide40

XML

95

96

97

98

99

00

01

02

03

04

05

06

slide41

XML

?

95

96

97

98

99

00

01

02

03

04

05

06

2006 status report the people angle
2006 Status Report[The People Angle]
  • Joann @ Avaya
    • Integrating communications into business processes
  • Anand @ Kosmix
    • Creating a new kind of search company
  • Alon @ Google
    • Working for Joann’s old boss
    • Deep web evangelist
    • Pondering data management for the masses
2006 status report enterprise angle
2006 Status Report[Enterprise Angle]
  • Enterprise Information Integration is established:
    • IBM, BEA, Oracle, MetaMatrix, Composite, Actuate, …
  • Impact on design tools:
    • IBM Rational Data Architect
    • ADO .NET v. 3
forrester says
Forrester Says…

"Enterprises are facing the growing challenges of using disparate sources of datamanaged by different applications, including problems with data integration, security, performance, availability and quality.... New technology is emerging that Forrester has coined "information fabric,"a term defined as a virtualized data layer that integrates heterogeneous data and content repositories in real time.... The potential benefits of this technology are so great that enterprises should develop a strategy to leverage information fabric technology as it becomes more widely available."

2006 status report web angle
2006 Status Report[Web Angle]
  • Vertical search engines: one domain
  • At scale: need even better source descriptions
    • deep web can be surfaced
  • Terminology: Data integration = mashups!
slide46
Wikipedia:

A mashup is a website or Web 2.0 application that uses content from more than one source to create a completely new service. This is akin to transclusion.

looking ahead
Looking Ahead
  • Data management: from the enterprise to the masses
  • Challenges:
    • Databases of everything
    • Need support for collaboration
    • Help people structure their data
    • Pay-as-you go data management
pay as you go data management
Pay-as-you-go Data Management

Dataspaces: Franklin, Halevy, Maier [see PODS 2006]

Benefit

Dataspaces

Data integration solutions

Investment (time, cost)

Artist: Mike Franklin

reusing human attention
Reusing Human Attention
  • Principle:
    • User action = statement of semantic relationship
    • Leverage actions to infer other semantic relationships
  • Examples
    • Providing a semantic mapping
      • Infer other mappings
    • Writing a query
      • Infer content of sources, relationships between sources
    • Creating a “digital workspace”
      • Infer “relatedness” of documents/sources
      • Infer co-reference between objects in the dataspace
    • Annotating, cutting & pasting, browsing among docs
conclusion
Conclusion
  • We’ve done extremely well as a community!
  • Next challenge: data management and integration tools for the masses