a metadata integration assistant generator for heterogeneous databases
Download
Skip this Video
Download Presentation
A Metadata Integration Assistant Generator for Heterogeneous Databases

Loading in 2 Seconds...

play fullscreen
1 / 34

A Metadata Integration Assistant Generator for Heterogeneous Databases - PowerPoint PPT Presentation


  • 111 Views
  • Uploaded on

A Metadata Integration Assistant Generator for Heterogeneous Databases. Young-Kwang Nam Joseph Goguen Guilian Wang. Data Integration in Synthetic Scientific Applications. Applications. Integrated result without inconsistency, etc. Query. global unified schema/ontology.

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about ' A Metadata Integration Assistant Generator for Heterogeneous Databases ' - adele


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
a metadata integration assistant generator for heterogeneous databases

A Metadata Integration Assistant Generator for Heterogeneous Databases

Young-Kwang Nam

Joseph Goguen

Guilian Wang

data integration in synthetic scientific applications
Data Integration in Synthetic Scientific Applications

Applications

Integrated result without inconsistency, etc.

Query

global unified

schema/ontology

Integration System

local

schema/ontology

local

schema/ontology

local

schema/ontology

data

source 1

data

source 2

data

source n

why difficult data heterogeneity
Platform & System Heterogeneity

OS, Hardware

DBMSs, Concurrency control and recovery capabilities

Syntactic & Structural Heterogeneity

Machine readable aspects of representation

Data models, Schemas,

Semantic Heterogeneity

Naming conflicts: synonyms, homonyms

Scaling & precision conflicts

Sampling rates, error distribution, etc.

Why Difficult: Data Heterogeneity
more difficult flexible integration
No all-encompassing system satisfies everyone:

frequent update of sources

frequent change of user requirements

non-published data from one’s own lab

Simplicity and readability are more desirable than completeness or exhaustiveness to domain scientists

Domain knowledge is crucial for

solving heterogeneities

query optimization

Desirable to support domain scientists to do data integration on their own

More Difficult: Flexible Integration
a common data integration architecture
A Common Data Integration Architecture

Query

Result

An Integrated View Materialized or Virtual

Mediator

Wrapper

Wrapper

Wrapper

data

source 1

data

source 2

data

source n

structural vs semantic wrt mediation level
Structural approach (Mediated schema approach)

integration by generating mediated schema that characterize a set of data sources

Semantic approach (Ontology-based approach)

difficult to integrate structural aspects of sources from semantic perspective due to inherent embedded semantics within local schemas & implicit assumptions

integration by sharing a common ontology among the differentdata sources

Structural vs. Semanticwrt Mediation Level
global as view vs local as view wrt mapping direction
Global-as-view approach

each item in Global schema/ontology as a view (query) over source schemas/ontologies

query(G) = query(f(S1, S2, …, Sn))

straightforward query rewriting

Local-as-view approach

Each source as a view/query over global schema/ontology

query(G) = query(f1-1 (S1), f2-1(S2), …, fn-1 (Sn))

easy adding or removing sources

Global-as-view vs. Local-as-viewwrt Mapping Direction
representative systems
TSIMMIS (Stanford & IBM, 1995)

MedMaker (Stanford, 1996)

MIX (SDSC&UCSD, 2000)

IM (AT&T, 1996)

Clio+Garlic (IBM, 2000)

DIXSE (UT, 2001)

XYLEME (2001)

Representative Systems
  • HERMES (UMD, 1994)
  • SIMS (USC, 1996)
  • Observer (UG, 1996)
  • Infosleuth (MCC, 1997)
  • COIN (MIT, 1999)
  • Ontobroker (Ger., 2000)
  • KIND (SDSC&UCSD, 2001)
our approach
Virtual Integration: retrieve data and resolve conflicts at query time, easy maintenance

Structural Approach: take users’ knowledge on data semantics hidden in structural information as input to achieve semantic mediation

Local-as-view: easily adds or removes sources, convenient to fit applications

GUI for specifying semantic mappings through assigning same index to same meaning nodes (paths)

Automatically generate DDXMI for query decomposition

Semantic functions

Our Approach
current prototype architecture

User query (XML query)

Column or Path

DDXMI

query

Generator/

collector

Column or Path

For each DB

query2

queryn

resultn

query1

result2

result1

XML/DB

engine2

XML/DB

enginen

XML/DB

engine1

XML/DB1

XML/DB2

XML/DBn

Current Prototype Architecture
distributed database xml metadata interface ddxmi
Include Database or XML document name or location information

Contain table columns or XML path information

Function or operation name for resolving semantic issues about table columns or XML elements and attributes

Distributed Database XML Metadata Interface (DDXMI)
ddxmi dtd
DDXMI DTD

<!ELEMENT DDXMIA (DDXMI.header, DDXMI.isequivalent, documentspec)>

<!ELEMENT DDXMI.header (documentation,version,date,authorization)>

<!ELEMENT documentation (#PCDATA)>

<!ELEMENT version (#PCDATA)>

<!ELEMENT date (#PCDATA)>

<!ELEMENT authorization (#PCDATA)>

<!ELEMENT DDXMI.isequivalent (source,destination*)*>

<!ELEMENT source (#PCDATA)>

<!ELEMENT destination (#PCDATA)>

<!ELEMENT documentspec (document, (elementname,operation*)*)>

<!ELEMENT document (#PCDATA)>

<!ELEMENT elementname (#PCDATA)>

<!ELEMENT operation (#PCDATA)>

how to generate ddxmi
Define a Master DTD (global schema) based on application requirements for choosing elements or tables from the distributed systems

Parse the master DTD and generate a path for each element from root to current element

Assign the master index number to the site element node which has the same meaning of the master DTD node

May include a function name for some nodes

Generate DDXMI file automatically by collecting over same index numbers

How to generate DDXMI
site1 book1 dtd tree
Site1 : Book1 DTD Tree

Index number

function

name

book1 path information

0

book.xml

1

/book

11

/book/price

12

/book/author

121

/book/author/full_name

1211

/book/author/full_name/first_name

1212

/book/author/full_name/last_name

13

/book/title

14

/book/year

15

/book/publisher

16

/book/editor

161

/book/editor/affiliation

162

/book/editor/full_name

Book1 Path Information

Site1 Index

Master Index

  • 0 book1.xml
  • 1 /bib/book
  • 11 /bib/book/price
  • /bib/book/author
  • 1211 /bib/book/author/first
  • 1212 /bib/book/author/last
  • /bib/book/title
  • 15 /bib/book/publisher
  • 16 /bib/book/editor
  • 161 /bib/book/editor/affiliation
  • 162 /bib/book/editor/last
  • 162 /bib/book/editor/first
book2 path information

0

book.xml

1

/book

11

/book/price

12

/book/author

121

/book/author/full_name

1211

/book/author/full_name/first_name

1212

/book/author/full_name/last_name

13

/book/title

14

/book/year

15

/book/publisher

16

/book/editor

161

/book/editor/affiliation

162

/book/editor/full_name

Book2 Path Information

Site2 Index

Master Index

0 book2.xml

1 /arts/book

12 /arts/book/author

1211 /arts/book/author/firstname

1212 /arts/book/author/lastname

13 /arts/book/title

15 /arts/book/publisher

book3 path information

0

book.xml

1

/book

11

/book/price

12

/book/author

121

/book/author/full_name

1211

/book/author/full_name/first_name

1212

/book/author/full_name/last_name

13

/book/title

14

/book/year

15

/book/publisher

16

/book/editor

161

/book/editor/affiliation

162

/book/editor/full_name

Book3 Path Information

Site3 Index

Master Index

0 book3.xml

  • /bookstore/book

11 /bookstore/book/price

12 /bookstore/book/author

1211 /bookstore/book/author/name

1212 /bookstore/book/author/name

13 /bookstore/book/title

xml query languages
XQL : takes a document point of view

XML-QL : takes a database point of view

Quilt : draws from both areas

proposed by Don Chamberlin, Jonathan Robie, and Daniela Florescu

Kweelt (University of Washington), a XML query engine based on Quilt, used in our prototype

XQuery proposal follows Quilt closely

XML Query Languages
how to generate site queries
Parse the master query, a query over the global schema

If encounter a path, depending on its kind, get corresponding path name from DDXMI file and substitute it

If there is no corresponding path in the DDXMI, then put it as a null value

 no queries generated for that site

How to generate site queries
how to get site element names

Master index

Site Index

bookstore

book

book

editor

year

author

price

title

publisher

price_info

affiliation

full_name

price

full_name

last_name

first_name

DDXMI

How to get site element names

[In Quilt Query]

1.book

 bookstore/book

2. price

 bookstore/book/price_info/price

 price_info/price

cut!!

<source>book</source>

<destination>booksore/book</destination>

<source>book/price</source>

<destination>bookstore/book/price_info/price<destination>

1 1 mapping example
1:1 Mapping Example

FOR $book IN document("book.xml")//book

[publisher = "Addison-Wesley"]

RETURN <book>$book/title</book>

Book1

Master index

bib

Book2

book

arts

book

book

price

editor

publisher

title

year

author

title

publisher

publisher

title

Book3

full_name

affiliation

full_name

bookstore

book

last_name

first_name

title

1 n mapping example

Book1

Master index

bib

Book2

arts

book

book

book

price

editor

editor

year

author

title

publisher

first

last

DDXMI

Book3

bookstore

full_name

affiliation

full_name

<source>/book/editor/full_name</source>

<destination>/bib/book/editor/last,/bib/book/editor/first</destination>

book

last_name

first_name

1:N Mapping Example

FOR $edi IN document("book.xml")//book/editor

RETURN <editor>$edi/full_name</editor>

n 1 mapping example

Book1

Book2

bib

arts

Master index

book

book

book

author

author

price

editor

year

author

title

publisher

first

Book3

last

firstname

lastname

bookstore

full_name

full_name

book

affiliation

author

first_name

last_name

name

N:1 Mapping Example

FOR $a IN document("book.xml")//book//author

RETURN <author> $a/last_name,$a/first_name </author>

<operation>fstring</operation>

<operation>lstring</operation>

query generation result
Query Generation Result

import split as UDF_split;

FUNCTION fstring($str)

{ split(" ",$str)[1]}

FUNCTION lstring($str)

{ split(" ",$str)[2]}

FOR $a IN document("book3.xml")

//book//author

RETURN

<author>

fstring($a/name),lstring($a/name)

</author>

semantic function involved example

Book1

Master index

bib

Book2

arts

book

book

book

price

editor

price

year

author

title

publisher

Book3

bookstore

full_name

affiliation

full_name

book

last_name

first_name

price

Semantic Function Involved Example

FOR $book IN document("book.xml")//book

RETURN <book>

$book/title,$book/author,$book/price

</book>

<operation>div(100)</operation>

remaining issues
Handle attributes: one DTD has an attribute but others don’t, or an attribute in one DTD as an element in others

More efficient way for generating DDXMI file automatically when there are many paths in the master DTD

e.g., tree:tree mapping: if two paths are indicated as the same and have the same children, then the index numbers should be generated automatically

Migrate to XML schemas, instead of DTDs

Support JOIN, PRODUCT generated by queries

Move to XQuery and a query engine with distributed query support

Integrate the individual site query results as one return as a single data source ready for further analysis

Provide mechanisms for removing redundancy

Justify the semantics of the query generated

Remaining Issues
conclusion
Our prototype uses distributed metadata to generate a GUI tool to describe mappings between master and local databases by assigning index numbers and specifying conversion function names

Uses Quilt as its XML query language.

A DDXMI file is generated based on the mappings, and is used to translate queries over the virtual master database into sub-queries to local databases

An experiment testing feasibility is reported in which 3 different bibliography databases are integrated.

Implemented with Java Webserver and JavaCC

Move to real applications, e.g. in the context of NSF project SEEK (Science Environment for Ecological Knowledge)

Conclusion
ad