1 / 22

Data Warehousing Issues

Data Warehousing Issues. Lecture Notes. Outline of Lecture. Extraction Wrappers, monitors (change detectors) Integration Cleansing & merging Maintenance Warehousing Specification Optimizations. Warehouse Architecture. Client. Client. Query & Analysis. Loading. Evolution. Warehouse.

ilar
Download Presentation

Data Warehousing Issues

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Data Warehousing Issues Lecture Notes

  2. Outline of Lecture • Extraction • Wrappers, monitors (change detectors) • Integration • Cleansing & merging • Maintenance • Warehousing Specification • Optimizations

  3. Warehouse Architecture Client Client Query & Analysis Loading Evolution Warehouse Optimization Maintenance Integrator Metadata Extractor/ Monitor Extractor/ Monitor Extractor/ Monitor ...

  4. Data Extraction • Source types • Relational, flat file, WWW, etc. • How to get data out? • Replication tool • Dump file • Create report • ODBC or third-party “wrappers”

  5. Wrapper Data Model B Wrapper Source • Converts data and queries from one data model to another Queries Data Model A Data • Extends query capabilities for sources with limited capabilities Queries

  6. Wrapper Generation • Solution 1: Hard code for each source • Solution 2: Automatic wrapper generation Wrapper Generator Definition Wrapper

  7. Data Transformations • Convert data to uniform format • Byte ordering, string termination • Internal layout • Remove, add & reorder attributes • Add key • Add data to get history • Sort tuples

  8. Monitors • Goal: Detect changes of interest and propagate to integrator • How? • Triggers • Replication server • Log sniffer • Compare query results • Compare snapshots/dumps • Different classes of information sources yield different problems

  9. Data Integration • Receive data (changes) from multiple wrappers/monitors and integrate into warehouse • Rule-based • Actions • Resolve inconsistencies • Eliminate duplicates • Integrate into warehouse (may not be empty) • Summarize data • Fetch more data from sources (wh updates) • etc.

  10. Data Cleansing • Find (& remove) duplicate tuples • e.g., Jane Doe vs. Jane Q. Doe • Detect inconsistent, wrong data • Attribute values that don’t match • Patch missing, unreadable data • Insert default values • Notify sources of errors found

  11. Warehouse Maintenance • Warehouse data  materialized view • Initial loading • View maintenance • View maintenance

  12. Differs from Conventional View Maintenance... • Warehouses may be highly aggregated and summarized • Warehouse views may be over history of base data • Process large batch updates • Schema may evolve

  13. Differs from Conventional View Maintenance... • Base data doesn’t participate in view maintenance • Simply reports changes • Loosely coupled • Absence of locking, global transactions • May not be queriable

  14. Warehouse Maintenance Anomalies • Materialized view maintenance in loosely coupled, non-transactional environment • Simple example Data Warehouse Sold (item,clerk,age) Sold = Sale Emp Integrator Sales Comp. Sale(item,clerk) Emp(clerk,age)

  15. Warehouse Maintenance Anomalies Data Warehouse Sold (item,clerk,age) Integrator Sales Comp. Sale(item,clerk) Emp(clerk,age) 1. Insert into Emp(Mary,25), notify integrator 2. Insert into Sale (Computer,Mary), notify integrator 3. (1)  integrator adds Sale (Mary,25) 4. (2)  integrator adds (Computer,Mary) Emp 5. View incorrect (duplicate tuple)

  16. Maintenance Anomaly - Solutions • Incremental update algorithms (ECA, Strobe, etc.) • Research issues: Self-maintainable views • What views are self-maintainable • Store auxiliary views so original + auxiliary views are self-maintainable

  17. Self-Maintainability: Examples Sold(item,clerk,age) = Sale(item,clerk) Emp(clerk,age) • Inserts into Emp If Emp.clerk is key and Sale.clerk is foreign key (with ref. int.) then no effect • Inserts into Sale Maintain auxiliary view: Emp-clerk,age(Sold) • Deletes from Emp Delete from Sold based on clerk

  18. Self-Maintainability: Examples • Deletes from Sale Delete from Sold based on {item,clerk} Unless age at time of sale is relevant • Auxiliary views for self-maintainability • Must themselves be self-maintainable • One solution: all source data • But want minimal set

  19. Partial Self-Maintainability • Avoid (but don’t prohibit) going to sources Sold=Sale(item,clerk) Emp(clerk,age) • Inserts into Sale • Check if clerk already in Sold, go to source if not • Or replicate all clerks over age 30 • Or ...

  20. Warehouse Specification (ideally) View Definitions Warehouse Configuration Module Warehouse Integration rules Change Detection Requirements Integrator Metadata Extractor/ Monitor Extractor/ Monitor Extractor/ Monitor ...

  21. Optimization • Update filtering at extractor • Similar to irrelevant updates in constraint and view maintenance • Multiple view maintenance • If warehouse contains several views • Exploit shared sub-views

  22. Additional Research Issues • Historical views of non-historical data • Expiring outdated information • Crash recovery • Addition and removal of information sources • Schema evolution

More Related