1 / 32

Principles of Database Management Systems 10: Information Integration

Principles of Database Management Systems 10: Information Integration. Pekka Kilpeläinen University of Kuopio. Information Integration. Large family of applications concerned with combining data from heterogeneous information sources Our goal: Get an overview of the

alia
Download Presentation

Principles of Database Management Systems 10: Information Integration

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Principles of Database Management Systems10: Information Integration Pekka Kilpeläinen University of Kuopio Notes 10: Information Integration

  2. Information Integration • Large family of applications concerned with combining data from heterogeneous information sources • Our goal: Get an overview of the • modes (or approaches, or architectures) • problems, and • some applications of information integration Notes 10: Information Integration

  3. Modes of information integration • Federated databases • collection of independent, co-operating sources • Warehousing • single DB containing physical copies of data from several sources • Mediation • system providing an integrated view to the sources (non-materialized, virtual database) Notes 10: Information Integration

  4. Example An integrated DB for dealer DBs of Aardvark Automobile Co, to support locating requested models and market analysis DB of Dealer1:Cars(serialNo,model, color, autoTrans,cdPlayer, …) ---------------------------------------------------------- | 123 | BMV 1 | blue| yes | yes | … | 456 | MB 50 | red | yes | no | … . . . Notes 10: Information Integration

  5. Example (cont.) DB of Dealer2:Autos(serial, model, color) ------------------------------ | 323 | MZ-6 | BL | | 555 | MB 50| Rd | . . . Options(serial, option) ---------------------------------- | 323 | manualTrans | | 323 | cdPlayer | . . . Notes 10: Information Integration

  6. Problems of information integration • Different names • Cars/Autos, serialNo/serial, … • renaming straightforward • Different data types • variable vs. fixed-length strings vs. integers vs. … • Conceptual differences • Are trucks included in ’Cars’? • Are station wagons classified as minivans? • Treatment of missing values • represent with NULLs or some default values? Notes 10: Information Integration

  7. DB1 DB2 DB3 DB4 Information integration architectures 1. Federated database • one-to-one connections between independent databases • Requires n(n-1) interfaces to fully connect n systems together Notes 10: Information Integration

  8. yes/no Example (of federation) Dealer1 wants to search requested cars also from the DB of Dealer2, for requests stored in relation NeededCars(model,color,autoTrans) • Needs to query relations Autos(serial, model, color) and Options(serial, option)of Dealer2 (with the following embedded SQL): Notes 10: Information Integration

  9. for (each tuple (:m, :c, :a) in NeededCars) { if (:a=’yes’) { // autoTrans wanted SELECT serial FROM Autos, Options WHERE Autos.serial = Options.serial AND option=’autoTrans’ AND model = :m AND color = :c; } else { // manual transmission wanted SELECT serial FROM Autos WHERE model = :m AND color = :c AND NOT EXISTS ( SELECT * FROM Options WHERE serial = Autos.serial AND option = ’autoTrans’ ); } } Notes 10: Information Integration

  10. Warehouse Combiner Extractor Extractor Extractor Source Source Source 2. Data Warehouse • Database formed by extracting and combining data from several sources • E.g., sales data for a supermarket chain • generally read-only Notes 10: Information Integration

  11. Example(Data warehouse of Aadvark Co.) Create a warehouse of Dealer1 and Dealer2 databases with combined schema CarWH(serNo,model,color,autoTrans,dealer) • Extract Dealer1’s data:INSERT INTO CarWH(serNo,model,color, autoTrans,dealer) SELECT serialNo,model,color, autoTrans,’dealer1’ FROM Cars; • Extract Dealer2’s data: Notes 10: Information Integration

  12. INSERT INTO CarWH(serNo,model,color, autoTrans,dealer) SELECT serial,model,color, ’yes’,’dealer2’ FROM Autos, Options WHERE Autos.serial = Options.serial AND option=’autoTrans’; INSERT INTO CarWH(serNo,model,color, autoTrans,dealer) SELECT serial,model,color, ’no’,’dealer2’ FROM Autos WHERE NOT EXISTS ( SELECT * FROM Options WHERE serial = Autos.serial AND option = ’autoTrans’ ); Notes 10: Information Integration

  13. Client Client Mediator Wrapper Wrapper Wrapper Source Source Source 3. Mediators • Support views combining data from several sources Notes 10: Information Integration

  14. Example(mediation of dealer sources) Integrate Dealer1 and Dealer2 databases as a virtual database with schema CarMed(serNo,model,color,autoTrans,dealer) • Suppose user of the mediator gives query Q:SELECT serNo, model FROM CarMed WHERE color=‘red’ ; • Translation by wrapper of Dealer1 easy:Q1:SELECT serialNo AS serNo, model FROM Cars WHERE color=‘red’ ; • Query Q2 on relation Autos of Dealer2 similarly • Return UNION of results to Q1 and Q2 Notes 10: Information Integration

  15. Extractors and wrappers • An extactor for a warehouse needs • one or more queries to get data from source • SQL, or other supported language • communication mechanisms for • passing queries to source • receiving responces from the source • passing data to the warehouse • A wrapper for a mediator more complex • translates arbitrary queries for the source Notes 10: Information Integration

  16. An aside: XML wrappers • What if the source does not provide a query language? • Need to translate a piece of extracted data (file/report/message/document/…) • developing and maintaining translation programs is tedious => a wrapper specification language XW (XML Wrapper; Univ. of Kuopio, 2001-2002) Notes 10: Information Integration

  17. XML in data exchange • XML an industry standard for textual representation of structured data (W3C Recommendation 1998) • XML-based protocols developed for e-business, medical messages, … • old message formats need to be converted to XML • this translation addressed by XW Notes 10: Information Integration

  18. AA x1 x2 BB y1y2 z1 z2 XW wrapping wrapper description result document source data <xw:wrapper … > … </xw:wrapper> <part-a> <e1>x1</e1> <e2>x2</e2> </part-a> <part-b> <line-1> <d1>y1</d1> <d2>y2</d2> </line-1> <d3>z2</d3> </part-b> XW-engine Notes 10: Information Integration

  19. Back to DB wrappers: Query templates • Translating queries of mediator to queries on the source possible through templates • with parameters (place-holders for constants of the original query) • Notation T => S: template T is translated to source query S • Example: Translating color-based selections on mediated view to queries on Dealer1 DB Notes 10: Information Integration

  20. Example(of query templates) SELECT * FROM CarMed WHERE color = ‘$c’ ; => SELECT serialNo AS serNo, model, color, autoTrans, ‘dealer1’ FROM Cars WHERE color = ‘$c’ ; • Similar templates for model and autoTrans • supporting all combinations of n attributes requires 2n templates! • Can do with fewer by applying filtering (see later) Notes 10: Information Integration

  21. Wrapper generators • Templates defining a wrapper translated into code of a wrapper • by a wrapper generator • How? • Two parts of a wrapper • table of query patterns specified by templates • a fixed driver Notes 10: Information Integration

  22. Source Wrapper-generator architecture Queries from mediator Templates Results Driver table Wrapper generator Wrapper queries results Notes 10: Information Integration

  23. Filtering • Often impossible to write templates for each possible form of mediator query • Possible to wrap more complex queries • if some template returns a superset of the original query (by fixing only a part of conditions), then • filter the result of the template by the additional query conditions Notes 10: Information Integration

  24. Example(of filtering by wrapper) How to evaluate query SELECT * FROM CarMed WHERE color=‘blue’ AND model=‘Gobi’ ;if we have only the below template? SELECT * FROM CarMed WHERE color = ‘$c’ ;=> SELECT serialNo AS serNo, model, color, autoTrans, ‘dealer1’ FROM Cars WHERE color = ‘$c’ ; Notes 10: Information Integration

  25. Example(of filtering cont.) 1. Apply template with $c = ‘blue’ and store the result:INSERT INTO Temp(serNo,model,color, autoTrans,dealer) SELECT serialNo AS serNo, model, color, autoTrans, ‘dealer1’ FROM Cars WHERE color = ‘blue’ ; 2. Filter out Gobis from relation Temp: SELECT * FROM Temp WHERE model = ‘Gobi’ ; Notes 10: Information Integration

  26. Warehouses and their applications • Warehouses • aimed at executive, decision maker, analyst • often a copy of operational data • Advantages • High query performance • Local processing at sources unaffected • Can operate when sources unavailable Notes 10: Information Integration

  27. Applications of warehouses (1) • OLAP (On-line Analytic Processing) • finding important patterns or trends from data stored at the warehouse • involves complex queries on large parts of the database • Contrast: OLTP (On Line Transaction Processing) • Describes processing at operational sites, not at the warehouse Notes 10: Information Integration

  28. Reads & updates Many small transactions MB-TB of data Raw, up-to-date data Clerical users Consistency, recoverability critical Mostly reads Queries long, complex GB-TB of data Summarized data, updated periodically Decision-makers, analysts as users OLTP vs. OLAP OLTP OLAP Notes 10: Information Integration

  29. Applications of warehouses (2) • Data mining • active area of R&D since 1990’s • analysis of large amounts of data for finding interesting regularities or exceptions • Why? • To get valuable information out of huge data collections of research commerce and industry Notes 10: Information Integration

  30. support: how many market-baskets contain bread, cheese and butter confidence: how many buyers of bread&cheese also bought butter? Example (data mining) • “Association rules” (IBM 1993) • way to analyse co-occurrence of values • E.g. the market-basket analysis: bread, cheese  butter (53%, 74%) Notes 10: Information Integration

  31. What’s new in Data Mining? • Statistics: • methods to test hypotheses on data • OLAP: • user-driven examination of data • Data mining • automatic generation of hypotheses among hundreds or thousands of explaining variables Notes 10: Information Integration

  32. Summary • We’ve had an overview of information integration • Different approaches • federation, warehousing, mediation • Applications of data warehouses • OLAP • data mining Notes 10: Information Integration

More Related