the data warehouse environment n.
Skip this Video
Loading SlideShow in 5 Seconds..
The Data Warehouse Environment PowerPoint Presentation
Download Presentation
The Data Warehouse Environment

The Data Warehouse Environment

173 Views Download Presentation
Download Presentation

The Data Warehouse Environment

- - - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

  1. The Data Warehouse Environment

  2. Data Warehouse Usage • Three kinds of data warehouse applications • Information processing • supports querying, basic statistical analysis, and reporting using crosstabs, tables, charts and graphs • Analytical processing and Interactive Analysis • multidimensional analysis of data warehouse data • supports basic OLAP operations, slice-dice, drilling, pivoting • Data mining • knowledge discovery from hidden patterns • supports associations, constructing analytical models, performing classification and prediction, and presenting the mining results using visualization tools

  3. Why Separate Data Warehouse? • Performance • Op dbs designed & tuned for known OLTP uses & workloads. • Complex OLAP queries would degrade performance. • Special data organization, access & implementation methods needed for multidimensional views & queries. • Function • Missing data: Decision support requires historical data, which op dbs do not typically maintain. • Data consolidation: Decision support requires consolidation (aggregation, summarization) of data from many heterogeneous sources: op dbs, external sources. • Data quality: Different sources typically use inconsistent data representations, codes, and formats which have to be reconciled.

  4. What are Operational Systems? • They are OLTP systems • Run mission critical applications • Need to work with stringent performance requirements for routine tasks • Run the business in real time • Based on up-to-the-second data • Optimized to handle large numbers of simple read/write transactions • Optimized for fast response to predefined transactions • Used by people who deal with customers, products -- clerks, salespeople etc. • They are increasingly used by customers

  5. RDBMS used for OLTP • Database Systems have been used traditionally for OLTP • clerical data processing tasks • detailed, up to date data • structured repetitive tasks • read/update a few records

  6. Data Industry Usage Technology Volumes Customer All Track Legacy application, flat Small-medium File Customer files, main frames Details Account Finance Control Legacy applications, Large Balance account hierarchical databases, activities mainframe Point-of- Retail Generate ERP, Client/Server, Very Large Sale data bills, manage relational databases stock Call Telecommu- Billing Legacy application, Very Large Record nications hierarchical database, mainframe Production Manufactu- Control ERP, Medium Record ring Production relational databases, AS/400 Examples of Operational Data

  7. So, what’s different?

  8. OLTP vs. Data Warehouse • OLTP systems are tuned for known transactions and workloads while workload is not known a priori in a data warehouse • Special data organization, access methods and implementation methods are needed to support data warehouse queries (typically multidimensional queries) • e.g., average amount spent on phone calls between 9AM-5PM in Charlotte during the month of December

  9. OLTP Application Oriented Used to run business Detailed data Current up to date Isolated Data Repetitive access Clerical User Warehouse (DSS) Subject Oriented Used to analyze business Summarized and refined Snapshot data Integrated Data Ad-hoc access Knowledge User (Manager) OLTP vs Data Warehouse

  10. OLTP Performance Sensitive Few Records accessed at a time (tens) Read/Update Access No data redundancy Database Size 100MB -100 GB Data Warehouse Performance relaxed Large volumes accessed at a time(millions) Mostly Read (Batch Update) Redundancy present Database Size 100 GB - few terabytes OLTP vs Data Warehouse

  11. OLTP Transaction throughput is the performance metric Thousands of users Managed in entirety Data Warehouse Query throughput is the performance metric Hundreds of users Managed by subsets OLTP vs Data Warehouse

  12. OLTP Systems are used to “run” a business The Data Warehouse helps to “optimize” the business To summarize ...

  13. Why Now? • Data is being produced • ERP provides clean data • The computing power is available • The computing power is affordable • The competitive pressures are strong • Commercial products are available

  14. Subject Orientation • DW is organized by major subject areas and entities of the business organization • Data warehouse model aligns with the corporate logical data model • Example of major subject areas for Insurance • Customer • Product • Transaction activity • Claim • Policy • Account

  15. Application-Orientation Subject-Orientation Operational Database Data Warehouse Credit Card Customer Loans Vendor Product Trust Savings Activity Application-Orientation vs. Subject-Orientation

  16. Integrated data • There is not application consistency in the operational data • As data from different systems is entered into the DW, entities and attributes are encoded using a consistent key or measurement

  17. Time • Data warehouse is nothing more than a sophisticated series of snapshots, taken at one moment in time • The key structure of the DW always contains some element of time

  18. RelationalDatabases ExtractionCleansing Optimized Loader ERP Systems Data Warehouse Engine AnalyzeQuery Purchased Data LegacyData Metadata Repository Data Warehouse Architecture

  19. Components of the Warehouse • Data Extraction and Loading • The Warehouse • Analyze and Query -- OLAP Tools • Metadata • Data Mining tools

  20. Loading the Warehouse Extracting, cleaning/transforming the data before it is loaded

  21. Data Extraction and Cleansing • Extract data from existing operational and legacy data • Issues: • Sources of data for the warehouse • Data quality at the sources • Merging different data sources • Data transformation • How to propagate updates (on the sources) to the warehouse • Terabytes of data to be loaded

  22. Source Data Operational/ Source Data Sequential • Typically host based, legacy applications • Customized applications, COBOL, 3GL, 4GL • Point of Contact Devices • POS, ATM, Call switches • External Sources • Need to go through ETL – Extract, Transform, Load Legacy Relational External

  23. Data Quality - The Reality • Legacy systems not well documented – if at all • Outside sources with questionable quality procedures • Production systems with no built-in integrity checks and no integration • Operational systems are usually designed to solve a specific business problem and are rarely developed to a corporate plan • “And get it done quickly, we do not have time to worry about corporate standards...”

  24. Data Integration Across Sources Trust Credit card Savings Loans Same data different name Different data Same name Data found here nowhere else Different keys same data

  25. Data Integrity Problems • Same person, different spellings • Different account numbers generated by different applications for the same customer • Required fields left blank • Invalid product codes collected at point of sale • manual entry leads to mistakes • “in case of a problem use 9999999”

  26. Data Quality - The Reality • Tempting to think that creating a data warehouse is simply extracting operational data and entering into a data warehouse • Nothing could be farther from the truth • Warehouse data comes from disparate questionable sources

  27. Scrubbing Data • Sophisticated transformation tools. • Used for cleaning the quality of data • Clean data is vital for the success of the warehouse

  28. Loads • After extracting, scrubbing, cleaning, validating etc. need to load the data into the warehouse • Issues • huge volumes of data to be loaded • small time window available when warehouse can be taken off line (usually nights) • when to build index and summary tables • allow system administrators to monitor, cancel, resume, change load rates • recover gracefully -- restart after failure from where you were and without loss of data integrity

  29. Structuring/Modeling Issues

  30. Data -- Heart of the Data Warehouse • Heart of the data warehouse is the data itself! • Single version of the truth • Corporate memory • Data is organized in a way that represents business -- subject orientation

  31. Data Warehouse Structure • Subject Orientation -- customer, product, policy, account etc... A subject may be implemented as a set of related tables. E.g., customer may be five tables

  32. Data Warehouse Design Issues • Major DW design issues: • Granularity and Partitioning • Determining the level of Granularity is a major design issue in the DW environment • Granularity profoundly affects the volume of data that resides in the DW and the type of query that can be answered

  33. Granularity • Granularity refers to the level of detail or summarization held in the units of data in the DW • Lower level of Granularity ---> more detail - transaction data • High level of Granularity ----> less detail - summarized data

  34. Data Granularity in Warehouse • High Level of Granularity • reduces storage costs • reduces CPU usage • increases performance since smaller number of records to be processed • design around traditional high level reporting needs • tradeoff with volume of data to be stored and detailed usage of data

  35. Granularity in Warehouse • But … • Can not answer some questions with summarized data • Did John call Jane last month? • Not possible to answer if only the total duration of calls by John over a month is maintained and individual call details are not. • Detailed data too voluminous

  36. Granularity and Data Analysis LOW LEVEL OF GRANULARITY HIGH LEVEL OF GRANULARITY LOW LEVEL OF DETAIL HIGH LEVEL OF DETAIL The details of every phone call made by a customer for a month The summary of phone calls made by a customer for a month 40,000 bytes per month 200 records per month 200 bytes 1 record per month The level of granularity is determined by what questions/queries can be answered and what resources are required to answer a question

  37. Granularity and Data Analysis LOW LEVEL OF GRANULARITY HIGH LEVEL OF GRANULARITY HIGH LEVEL OF DETAIL LOW LEVEL OF DETAIL The details of every phone call made by a customer for a month The summary of phone calls made by a customer for a month 40,000 bytes per month 200 records per month 200 bytes 1 record per month Did John call Mary in Boston last week? 01 activityrecord. 02 month 02 cumcalls 02 avglength …... 01 activityrecord. 02 date 02 time 02 to whom 02 city …... YES UNKNOWN TRADEOFF

  38. Granularity in Warehouse • How about having a dual level of granularity? • Store summary data on disks • 95% of DSS processing done against this data • Store detail on tapes • 5% of DSS processing against this data

  39. Dual Levels of Granularity • The dual level of granularity design option should be the default for almost every organization that is building a warehouse. Low level of detail High level of detail Flexibility- small volumes of data Easy to manipulated Answer to any question Large volumes of data More DASD and resources

  40. Partitioning • All currenty detail data in the DW is partitioned. • Data is partitioned when data of a like structure is divided into more than one physical unit data. • Separate small physical untis can be handled independently. • Partition of the DW data can be done at the system or at the application level.

  41. Partitioning • Small units of data can be: • restructured • indexed • sequentially scanned, if needed • reorganized • recovered • monitored

  42. Partitioning • All current detail DW data will be partitioned in order to provide flexible access to data • Choices for partitioning data are strictly up to the developer. However, TIME is always a mandatory criterion for partitioning • by date • by line of business • by geography • etc.

  43. Structuring Data in the DW • Simple Cumulative Structure • Rolling Summary Data • Simple Direct • Continuous

  44. Purging Warehouse Data • There are several ways in which data is purged or the detail of data is transformed: • Data is added to a rolling summary file where detail is lost. • Data is transferred to a bulk medium from a high-performance medium such as DASD. • Data is actually purged

  45. Data Warehouse vs. Data Marts What comes first

  46. Information Less IndividuallyStructured History Normalized Detailed DepartmentallyStructured Data Warehouse OrganizationallyStructured More Data From the Data Warehouse to Data Marts

  47. Data Warehouse and Data Marts OLAP Data Mart Lightly summarized Departmentally structured Organizationally structured Atomic Detailed Data Warehouse Data

  48. Data Mart Centric Data Sources Data Marts Data Warehouse

  49. Problems with Data Mart Centric Solution If you end up creating multiple warehouses, integrating them is a problem

  50. True Warehouse Data Sources Data Warehouse Data Marts