Cse 636 data integration
This presentation is the property of its rightful owner.
Sponsored Links
1 / 27

CSE 636 Data Integration PowerPoint PPT Presentation


  • 66 Views
  • Uploaded on
  • Presentation posted in: General

CSE 636 Data Integration. Schema Matching Cupid. Fall 2006. Virtual Integration Architecture. Wrapper. Wrapper. Design-Time. Run-Time. . Schema Matching. Query Reformulation. Query. Result. End User. Mediation Language. Optimization & Execution. Mediator. Global Schema.

Download Presentation

CSE 636 Data Integration

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Cse 636 data integration

CSE 636Data Integration

Schema Matching

Cupid

Fall 2006


Virtual integration architecture

Virtual Integration Architecture

Wrapper

Wrapper

Design-Time

Run-Time

Schema

Matching

Query

Reformulation

Query

Result

End User

Mediation

Language

Optimization

& Execution

Mediator

Global

Schema

Web Services

XML

Data

Source

Data

Source

Local

Schema

Local

Schema


Schema heterogeneity

Schema Heterogeneity

Independently created schemas…

… might be modeling similar information…

… in slightly different ways

DB1

DB2

DB3

ugrad *

course *

student *

ugradID

courseID

studentID

title

name

name

type

enrollment *

student *

courseID

studentID

course *

ugradID

name

courseID

grade

type

title ?

evaluation

type

letter


Schema heterogeneity1

Schema Heterogeneity

  • Similar entities represented

  • Dissimilar structures (inverted nesting)

  • Different element names for similar data values

  • Similar element names for different data values

DB1

DB2

DB3

ugrad *

course *

student *

ugradID

courseID

studentID

title

name

name

type

enrollment *

student *

courseID

studentID

course *

ugradID

name

courseID

grade

type

title ?

evaluation

type

letter


Schema matching vs schema mapping

Schema Matching vs. Schema Mapping

  • GAV and LAV are schema mapping languages

  • Mappings:

    • set of queries

    • associations + semantics

  • Match:

    • set of associations only

  • Schema Matching:

    • Identifying associations

    • First step towards constructing mappings


Schema matching vs schema mapping1

Schema Matching vs. Schema Mapping

DB1

DB3

ugrad *

student *

ugradID

studentID

name

name

type

enrollment *

courseID

course *

ugradID

courseID

grade

title ?

type

letter

Semantics

for $s1 in DB3/student

where $s1/type = ‘UGRAD’

return<DB1>

<ugrad>

<ugradID>{$s1/studentID}</ugradID>

<name>{$s1/name}</name>

</ugrad>

</DB1>

LAV Mapping: DB1  Q(DB3)

Associations


The problem of schema matching

The Problem of Schema Matching

Input

  • Schemas S1 and S2

  • Possibly data instances for S1 and S2

  • Background knowledge

    • thesauri

    • validated matches

    • standard schemas

    • reference instances

    • ontologies

    • constraints (keys, data types etc)

      Output

  • Associations between S1 and S2

    Goal

  • Schema matching tools with significant automated support


Schema matching

Schema Matching

How is the match result expressed?

  • Pairs of paths

  • Lists of paths

  • Schema names

DB2

DB3

course *

student *

courseID

studentID

title

name

type

student *

studentID

course *

name

courseID

type

title ?

evaluation

type

letter


Schema matching1

Schema Matching

What do we match?

  • Depends on the queries we want to ask

  • Elements in isolation (leaves in particular)

  • Substructures

  • Whole schemas


Motivation

Motivation

  • Important component in many applications

    • Data Integration

    • Data Migration

    • E-Commerce

  • Model Management[Bernstein, Halevy, Pottinger ’00]

    • Algebra for manipulating models and mappings

    • Match, Merge, Compose …


Problems

Problems

  • Minimize user involvement (semi-automatic)

  • Data model independent matching (generic)

  • Schema matching is a hard problem

    • Naming and structural differences in schemas

    • Similar, but non-identical concepts modeled

    • Multiple data models – SQL DDL, XML, ODMG…


Schema matching approaches

Schema Matching Approaches

Combined matchers

Individual matchers

Schema-based

Content-based

Hybrid

Composite

manual composition

automatic composition

Per-Element

Structural

Per-Element

Constraint-based

Constraint-based

Constraint-based

Linguistic

Linguistic

  • Names

  • Descriptions

  • Types

  • Keys

  • Graph matching

  • IR (word frequencies, key terms)

  • Value pattern and ranges

How to match?

Taxonomy based survey: Rahm and Bernstein, VLDB J, 2001


Cupid

Cupid

Individual matchers

Schema-based

Content-based

Per-Element

Structural

Per-Element

Constraint-based

Constraint-based

Constraint-based

Linguistic

Linguistic

  • Names

  • Descriptions

  • Types

  • Keys

  • Graph matching

  • IR (word frequencies, key terms)

  • Value pattern and ranges

Madhavan, Bernstein and Rahm, VLDB, 2001

Combined matchers

Hybrid

Composite

manual composition

automatic composition


Cupid example

Cupid Example

Address

City

Street

PO

PurchaseOrder

POLines

Items

POShipTo

DeliverTo

Item

Item

Name

Name

City

Street

Line

ItemNumber

UnitofMeasure

UoM

Qty

Quantity


Cupid architecture

Cupid Architecture

Linguistic

Matching

Schema 1

Schema 2

Thesaurus

Structure

Matching

Generate

Mapping

Output Mapping

LSIM

SSIM

WSIM


Linguistic matching

Linguistic Matching

  • Heuristic name matching

    • Tokenization of names

      POOrderNum  PO, Order, Num

    • Expansion of short-forms, acronyms

      PO  Purchase, Order; Num Number

    • Clustering of schema elements based on keywords and data-types

      Street, City, POAddress  Address

    • Thesaurus of synonyms, hypernyms, acronyms

    • Linguistic Similarity coefficient (LSIM)  [0,1]


Structure matching

Structure Matching

DeliverTo

POShipTo

PO

PurchaseOrder

POLines

Items

Item

Item

Name

Address

Name

City

Street

Line

ItemNumber

City

Street

UoM

UnitofMeasure

Qty

Quantity


Structure matching mutually reinforcing similarity

Structure MatchingMutually Reinforcing Similarity

SSIM++

SSIM++

SSIM++

PO

PurchaseOrder

WSIM > thhigh

POLines

Items

WSIM > thhigh

Item

Item

Line

ItemNum

UoM

UnitofMeasure

Qty

Quantity


Structure matching context dependent disambiguation

Structure MatchingContext Dependent Disambiguation

Address

SSIM--

City

Street

SSIM++

PO

PurchaseOrder

POBillTo

InvoiceTo

DeliverTo

POShipTo

Address

City

Street

City

Street

SSIM++

City

Street


Intuition

Intuition

  • Atomic elements are similar

    • Linguistically and data-type similar

    • Their ancestors are similar

  • Compound elements (non-leaf) are similar if

    • Linguistically similar

    • Subtrees rooted at the elements are similar

  • Mutually recursive

    • Leaves determine internal node similarity

    • Similarity of internal nodes leads to increase in leaf similarity


Structure match details

Structure Match Details

  • Subtrees are similar if

    • Immediate children are similar

    • Leaf sets are similar

  • Subtree Similarity (nodes s and t)

    • Fraction of leaves in subtree s that can be mapped to a leaf in the other subtree t and vice-versa

    • Less sensitive to variation in intermediate structure

  • Pruning the number of comparisons

    • Elements must have comparable number of leaves


Referential integrity

Referential Integrity

Order-Customer-fk

Customer-Purchase-Order

Order-Customer-fk

  • Join nodes added to the schema tree for each referential integrity constraint

  • Views can be similarly used

Customer

Purchase Order

Customer ID

Order ID

Address

Customer ID

Name

Product Name

Schema A

Schema B


Cupid architecture1

Cupid Architecture

Linguistic

Matching

Schema 1

Schema 2

Thesaurus

Structure

Matching

Generate

Mapping

Output Mapping

LSIM

SSIM

WSIM

Linguistic Similarity (LSIM)

Structural (SSIM), Weighted (WSIM) Similarity


Mapping generation

Mapping Generation

  • Individual mapping elements computed from WSIM values:

    • Consider only mapping pairs that have WSIM greater than threshold

    • For each element of target find most similar source element

    • Not accepted mappings with high similarity are returned in order to help user modify map


Cupid architecture2

Cupid Architecture

Linguistic

Matching

Schema 1

Schema 2

Thesaurus

Structure

Matching

Input hint

LSIM

SSIM

WSIM

Generate

Mapping

Output Mapping


Work needed

Work Needed

  • A more robust solution

    • Auto-tuning parameters

    • Thesaurus Generation and Evolution

  • Schema matching component architecture

    • Easily extensible by adding multiple techniques

    • Data Instances for matching

    • Look at COMA & ProtoPlasm systems


References

References

  • J. Madhavan, P. A. Bernstein, E. RahmGeneric Schema Matching with CupidVLDB, 2001

  • H. H. Do, E. Rahm:COMA - A System for Flexible Combination of Schema Matching ApproachesVLDB, 2002

  • P. A. Bernstein, S. Melnik, M. Petropoulos, C. QuixIndustrial-Strength Schema MatchingSIGMOD Record 33(4), 2004


  • Login