A metadata integration assistant generator for heterogeneous databases
This presentation is the property of its rightful owner.
Sponsored Links
1 / 34

A Metadata Integration Assistant Generator for Heterogeneous Databases PowerPoint PPT Presentation


  • 83 Views
  • Uploaded on
  • Presentation posted in: General

A Metadata Integration Assistant Generator for Heterogeneous Databases. Young-Kwang Nam Joseph Goguen Guilian Wang. Data Integration in Synthetic Scientific Applications. Applications. Integrated result without inconsistency, etc. Query. global unified schema/ontology.

Download Presentation

A Metadata Integration Assistant Generator for Heterogeneous Databases

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


A metadata integration assistant generator for heterogeneous databases

A Metadata Integration Assistant Generator for Heterogeneous Databases

Young-Kwang Nam

Joseph Goguen

Guilian Wang


Data integration in synthetic scientific applications

Data Integration in Synthetic Scientific Applications

Applications

Integrated result without inconsistency, etc.

Query

global unified

schema/ontology

Integration System

local

schema/ontology

local

schema/ontology

local

schema/ontology

data

source 1

data

source 2

data

source n


Why difficult data heterogeneity

Platform & System Heterogeneity

OS, Hardware

DBMSs, Concurrency control and recovery capabilities

Syntactic & Structural Heterogeneity

Machine readable aspects of representation

Data models, Schemas,

Semantic Heterogeneity

Naming conflicts: synonyms, homonyms

Scaling & precision conflicts

Sampling rates, error distribution, etc.

Why Difficult: Data Heterogeneity


More difficult flexible integration

No all-encompassing system satisfies everyone:

frequent update of sources

frequent change of user requirements

non-published data from one’s own lab

Simplicity and readability are more desirable than completeness or exhaustiveness to domain scientists

Domain knowledge is crucial for

solving heterogeneities

query optimization

Desirable to support domain scientists to do data integration on their own

More Difficult: Flexible Integration


A common data integration architecture

A Common Data Integration Architecture

Query

Result

An Integrated View Materialized or Virtual

Mediator

Wrapper

Wrapper

Wrapper

data

source 1

data

source 2

data

source n


Structural vs semantic wrt mediation level

Structural approach (Mediated schema approach)

integration by generating mediated schema that characterize a set of data sources

Semantic approach (Ontology-based approach)

difficult to integrate structural aspects of sources from semantic perspective due to inherent embedded semantics within local schemas & implicit assumptions

integration by sharing a common ontology among the differentdata sources

Structural vs. Semanticwrt Mediation Level


Global as view vs local as view wrt mapping direction

Global-as-view approach

each item in Global schema/ontology as a view (query) over source schemas/ontologies

query(G) = query(f(S1, S2, …, Sn))

straightforward query rewriting

Local-as-view approach

Each source as a view/query over global schema/ontology

query(G) = query(f1-1 (S1), f2-1(S2), …, fn-1 (Sn))

easy adding or removing sources

Global-as-view vs. Local-as-viewwrt Mapping Direction


Representative systems

TSIMMIS (Stanford & IBM, 1995)

MedMaker (Stanford, 1996)

MIX (SDSC&UCSD, 2000)

IM (AT&T, 1996)

Clio+Garlic (IBM, 2000)

DIXSE (UT, 2001)

XYLEME (2001)

Representative Systems

  • HERMES (UMD, 1994)

  • SIMS (USC, 1996)

  • Observer (UG, 1996)

  • Infosleuth (MCC, 1997)

  • COIN (MIT, 1999)

  • Ontobroker (Ger., 2000)

  • KIND (SDSC&UCSD, 2001)


Our approach

Virtual Integration: retrieve data and resolve conflicts at query time, easy maintenance

Structural Approach: take users’ knowledge on data semantics hidden in structural information as input to achieve semantic mediation

Local-as-view: easily adds or removes sources, convenient to fit applications

GUI for specifying semantic mappings through assigning same index to same meaning nodes (paths)

Automatically generate DDXMI for query decomposition

Semantic functions

Our Approach


Current prototype architecture

User query (XML query)

Column or Path

DDXMI

query

Generator/

collector

Column or Path

For each DB

query2

queryn

resultn

query1

result2

result1

XML/DB

engine2

XML/DB

enginen

XML/DB

engine1

XML/DB1

XML/DB2

XML/DBn

Current Prototype Architecture


Distributed database xml metadata interface ddxmi

Include Database or XML document name or location information

Contain table columns or XML path information

Function or operation name for resolving semantic issues about table columns or XML elements and attributes

Distributed Database XML Metadata Interface (DDXMI)


Ddxmi dtd

DDXMI DTD

<!ELEMENT DDXMIA (DDXMI.header, DDXMI.isequivalent, documentspec)>

<!ELEMENT DDXMI.header (documentation,version,date,authorization)>

<!ELEMENT documentation (#PCDATA)>

<!ELEMENT version (#PCDATA)>

<!ELEMENT date (#PCDATA)>

<!ELEMENT authorization (#PCDATA)>

<!ELEMENT DDXMI.isequivalent (source,destination*)*>

<!ELEMENT source (#PCDATA)>

<!ELEMENT destination (#PCDATA)>

<!ELEMENT documentspec (document, (elementname,operation*)*)>

<!ELEMENT document (#PCDATA)>

<!ELEMENT elementname (#PCDATA)>

<!ELEMENT operation (#PCDATA)>


How to generate ddxmi

Define a Master DTD (global schema) based on application requirements for choosing elements or tables from the distributed systems

Parse the master DTD and generate a path for each element from root to current element

Assign the master index number to the site element node which has the same meaning of the master DTD node

May include a function name for some nodes

Generate DDXMI file automatically by collecting over same index numbers

How to generate DDXMI


Generate master index

Generate Master Index


Site1 book1 dtd tree

Site1 : Book1 DTD Tree

Index number

function

name


Book1 path information

0

book.xml

1

/book

11

/book/price

12

/book/author

121

/book/author/full_name

1211

/book/author/full_name/first_name

1212

/book/author/full_name/last_name

13

/book/title

14

/book/year

15

/book/publisher

16

/book/editor

161

/book/editor/affiliation

162

/book/editor/full_name

Book1 Path Information

Site1 Index

Master Index

  • 0 book1.xml

  • 1 /bib/book

  • 11 /bib/book/price

  • /bib/book/author

  • 1211 /bib/book/author/first

  • 1212 /bib/book/author/last

  • /bib/book/title

  • 15 /bib/book/publisher

  • 16 /bib/book/editor

  • 161 /bib/book/editor/affiliation

  • 162 /bib/book/editor/last

  • 162 /bib/book/editor/first


Site 2 book2 dtd tree

Site 2 : Book2 DTD Tree


Book2 path information

0

book.xml

1

/book

11

/book/price

12

/book/author

121

/book/author/full_name

1211

/book/author/full_name/first_name

1212

/book/author/full_name/last_name

13

/book/title

14

/book/year

15

/book/publisher

16

/book/editor

161

/book/editor/affiliation

162

/book/editor/full_name

Book2 Path Information

Site2 Index

Master Index

0book2.xml

1 /arts/book

12 /arts/book/author

1211 /arts/book/author/firstname

1212 /arts/book/author/lastname

13 /arts/book/title

15 /arts/book/publisher


Site 3 book3 dtd tree

Site 3 : Book3 DTD Tree


Book3 path information

0

book.xml

1

/book

11

/book/price

12

/book/author

121

/book/author/full_name

1211

/book/author/full_name/first_name

1212

/book/author/full_name/last_name

13

/book/title

14

/book/year

15

/book/publisher

16

/book/editor

161

/book/editor/affiliation

162

/book/editor/full_name

Book3 Path Information

Site3 Index

Master Index

0book3.xml

  • /bookstore/book

    11 /bookstore/book/price

    12 /bookstore/book/author

    1211 /bookstore/book/author/name

    1212 /bookstore/book/author/name

    13 /bookstore/book/title


Xml query languages

XQL : takes a document point of view

XML-QL : takes a database point of view

Quilt : draws from both areas

proposed by Don Chamberlin, Jonathan Robie, and Daniela Florescu

Kweelt (University of Washington), a XML query engine based on Quilt, used in our prototype

XQuery proposal follows Quilt closely

XML Query Languages


How to generate site queries

Parse the master query, a query over the global schema

If encounter a path, depending on its kind, get corresponding path name from DDXMI file and substitute it

If there is no corresponding path in the DDXMI, then put it as a null value

 no queries generated for that site

How to generate site queries


How to get site element names

Master index

Site Index

bookstore

book

book

editor

year

author

price

title

publisher

price_info

affiliation

full_name

price

full_name

last_name

first_name

DDXMI

How to get site element names

[In Quilt Query]

1.book

 bookstore/book

2. price

 bookstore/book/price_info/price

 price_info/price

cut!!

<source>book</source>

<destination>booksore/book</destination>

<source>book/price</source>

<destination>bookstore/book/price_info/price<destination>


1 1 mapping example

1:1 Mapping Example

FOR $book IN document("book.xml")//book

[publisher = "Addison-Wesley"]

RETURN <book>$book/title</book>

Book1

Master index

bib

Book2

book

arts

book

book

price

editor

publisher

title

year

author

title

publisher

publisher

title

Book3

full_name

affiliation

full_name

bookstore

book

last_name

first_name

title


Query execution result

Query Execution Result


1 n mapping example

Book1

Master index

bib

Book2

arts

book

book

book

price

editor

editor

year

author

title

publisher

first

last

DDXMI

Book3

bookstore

full_name

affiliation

full_name

<source>/book/editor/full_name</source>

<destination>/bib/book/editor/last,/bib/book/editor/first</destination>

book

last_name

first_name

1:N Mapping Example

FOR $edi IN document("book.xml")//book/editor

RETURN <editor>$edi/full_name</editor>


Query execution result1

Query Execution Result


N 1 mapping example

Book1

Book2

bib

arts

Master index

book

book

book

author

author

price

editor

year

author

title

publisher

first

Book3

last

firstname

lastname

bookstore

full_name

full_name

book

affiliation

author

first_name

last_name

name

N:1 Mapping Example

FOR $a IN document("book.xml")//book//author

RETURN <author> $a/last_name,$a/first_name </author>

<operation>fstring</operation>

<operation>lstring</operation>


Query generation result

Query Generation Result

import split as UDF_split;

FUNCTION fstring($str)

{ split(" ",$str)[1]}

FUNCTION lstring($str)

{ split(" ",$str)[2]}

FOR $a IN document("book3.xml")

//book//author

RETURN

<author>

fstring($a/name),lstring($a/name)

</author>


Query execution result2

Query Execution Result


Semantic function involved example

Book1

Master index

bib

Book2

arts

book

book

book

price

editor

price

year

author

title

publisher

Book3

bookstore

full_name

affiliation

full_name

book

last_name

first_name

price

Semantic Function Involved Example

FOR $book IN document("book.xml")//book

RETURN <book>

$book/title,$book/author,$book/price

</book>

<operation>div(100)</operation>


Query execution result3

Query Execution Result


Remaining issues

Handle attributes: one DTD has an attribute but others don’t, or an attribute in one DTD as an element in others

More efficient way for generating DDXMI file automatically when there are many paths in the master DTD

e.g., tree:tree mapping: if two paths are indicated as the same and have the same children, then the index numbers should be generated automatically

Migrate to XML schemas, instead of DTDs

Support JOIN, PRODUCT generated by queries

Move to XQuery and a query engine with distributed query support

Integrate the individual site query results as one return as a single data source ready for further analysis

Provide mechanisms for removing redundancy

Justify the semantics of the query generated

Remaining Issues


Conclusion

Our prototype uses distributed metadata to generate a GUI tool to describe mappings between master and local databases by assigning index numbers and specifying conversion function names

Uses Quilt as its XML query language.

A DDXMI file is generated based on the mappings, and is used to translate queries over the virtual master database into sub-queries to local databases

An experiment testing feasibility is reported in which 3 different bibliography databases are integrated.

Implemented with Java Webserver and JavaCC

Move to real applications, e.g. in the context of NSF project SEEK (Science Environment for Ecological Knowledge)

Conclusion


  • Login