data integration l.
Skip this Video
Loading SlideShow in 5 Seconds..
Data integration PowerPoint Presentation
Download Presentation
Data integration

Loading in 2 Seconds...

play fullscreen
1 / 7

Data integration - PowerPoint PPT Presentation

  • Uploaded on

Data integration. Chitta Baral Arizona State University. Example 1. Data Source 1 List of course#s with the title `Database Systems’ taught anywhere, their instructors, and university names. View : R1(prof, course#, university) Data Source 2

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
Download Presentation

PowerPoint Slideshow about 'Data integration' - dewitt

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
data integration

Data integration

Chitta Baral

Arizona State University

example 1
Example 1
  • Data Source 1
    • List of course#s with the title `Database Systems’ taught anywhere, their instructors, and university names.
  • View: R1(prof, course#, university)
  • Data Source 2
    • List of Ph.D level courses taught at ASU, professors name and course#
  • View: R2(title,prof,course#)
  • Query: List the course#s of courses taught at ASU, and the professor names who teach the course.
  • Partial answer: obtained by using the following query

SELECT course#, prof

FROM R1 WHERE university = ASU


SELECT course#, prof


general question
General question
  • Given: Several sources and a query
  • Problem: How do we best answer this query using the several sources that are available?
  • First Step: Need to model the sources; Need to have a global picture.
  • Two basic approaches: Global as view (GaV) and Local as view (LaV).
  • LaV for the last example.
    • Global schema: Teaches(prof,course#,title,semester,university)
    • Create view R1 as

SELECT prof, course#, university FROM Teaches

WHERE title = `Database systems’ .

    • Create view R2 as

SELECT title, prof, course# FROMTeaches

WHERE univesrity = ASU and course# >= 500

    • Now given a query (in English), we need to express it in terms of the global schema, and then reformulate it (to the extent possible) in terms of the sources (R1 and R2 here).
      • To do that use the relation between R1 (and R2) and the global schema.
a global as view gav example
A Global as View (GaV) example
  • 3 movie sources
    • S1(title,dir,year,genre) from until 1980.
    • S2(title,dir) since 1970
    • S3(title, year, genre) all movies
  • A global view: S1 union (S2 join S3)
    • SELECT * FROM S1


SELECT S2.title, S2.dir, S3.year, S3.genre

FROM S2, S3 WHERE S2.title = S3.title

  • Another global view: union of S1, (S2 join S3) and 4-tuples made up of tuples in S2 and S3 (where the title does not appear in the other) with added null values.
    • If we have S2(xyz, uvw) and xyz is not a title that appears in S3 then we assume (xyz,uvw,null,null) is part of the global view.
lav vs gav
LaV vs GaV
  • Given a query reformulating it in terms of the sources
    • Is easier in GAV (just needs unfolding of the query)
    • Is harder in LaV
  • Adding a new source
    • Supposedly easier in LaV (just need to express the new source as a view of the global schema)
    • Harder in GaV (as the global schema needs to be revised)
steps for projects of type 2
Steps for Projects of type 2
  • Given: Some NIH/NCBI/Others data sources
  • Goal: Virtual integration of these sources
  • First Step: Explore each data source to figure out the `view’ of each source.
  • Second Step: Come up with a global schema (don’t think too much about the sources; or keep the global schema general enough that if GaV is used then adding new sources does not change the global schema)
  • Third Step (GaV based approach):
    • Define the global schema in terms of the source views.
    • Now any global query can be unfolded to a query (can be done in real time) in terms of the source views.
  • Alternative third step (LaV approach):
    • Define each source in terms of the global schema.
    • Now any global query needs to be `reformulated’ in terms of the source views.
    • Several `reformulation’ techniques are available.
    • For LaV approach a particular set of queries can be considered a priori and their reformulation could be made before hand (rather than in real time)
    • Havasu and Biohavasu follow this approach.
hw 2 review sample answers
HW 2 Review: Sample answers
  • Correct understanding about the queries
      • List all human genes, name of their discoverer and the project through which it was discovered
        • In this I tried to find out the human genes that are completely discovered.
        • Querying the genome database for human genome did this.
        • But the links field did not give links to the Pubmed articles relevant to that specific gene.
        • Had to type in another query with the specific gene information for Pubmed.
  • Misunderstanding about what queries are
    • How are the DNA probes identified in a DNA chip?
    • What are the four nucleotide bases in a DNA ?