1 / 33

What is a Data Warehouse

What is a Data Warehouse. by W. H. Inmon http://www.cait.wustl.edu/cait/papers/prism/vol1_no1/. What is a Data Warehouse?. A data warehouse is a: subject-oriented , integrated , time-variant , nonvolatile , collection of data in support of management's decision making process

kalin
Download Presentation

What is a Data Warehouse

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. What is a Data Warehouse by W. H. Inmon http://www.cait.wustl.edu/cait/papers/prism/vol1_no1/

  2. What is a Data Warehouse? • A data warehouse is a: subject-oriented, integrated, time-variant, nonvolatile, collection of data in support of management's decision making process • The data comes from the operational environment • The data warehouse is always a physically separate store

  3. Difference Between Operational Systems and Data and the Data Warehouse (DW) • DW is oriented around the major subjects of the enterprise • The data-driven, subject orientation is in contrast to the more classical process/functional orientation of applications • The DW world focuses on data modeling and database design exclusively • DW data excludes data that will not be used for DSS processing • DW data spans a spectrum of time and the relationships found in the data warehouse are many

  4. The data warehouse has a strong subject orientation Operational Data warehouse Customer Loans Vendor Savings Product Bank card Activity Trust An application orientation A subject orientation

  5. Integration • Data found within the DW is integrated • ALWAYS • WITH NO EXCEPTIONS • consistent naming conventions • consistent measurement of variables • consistent encoding structures • consistent physical attributes of data • data needs to be stored in the DW in a singular, globally-acceptable fashion

  6. ? When data is moved to the DW from the application-oriented operational environment, the data is integrated before entering the DW Operational Data warehouse appl A - m, f appl B - 1, 0 appl C - x, y appl D - male, female m, f pipeline cm appl A - pipeline cm appl B - pipeline inches appl C - pipeline mcf appl D - pipeline yds balance dec fixed (13,2) appl A - balance dec fixed (13,2) appl B - balance pic 9(9)v99 appl C - balance dec fixed (11,0) appl D - balance pic s9(7)v99 comp 3 appl A - description appl B - description appl C - description appl D - description description appl A - bal-on-hand appl B - current-balance appl C - cash-in-house appl D - balance balance appl A - date (Julian) appl B - date (yymmdd) appl C - date (mmddyy) appl D - date (absolute) date (Julian)

  7. Integration • The collective ability of many application designers to create inconsistent applications is legendary • The integration affects almost every aspect of design - the physical characteristics of data, the dilemma of having more than one source of data, the issue of inconsistent naming standards, inconsistent date formats, and so forth

  8. Time Variancy • All data in the data warehouse is accurate as of some moment in time (i.e., not "right now") • In the operational environment data is accurate as of the moment of access • Data found in the warehouse is said to be "time variant

  9. Time Variancy Operational Data warehouse • Current value data: • time horizon -- 60 - 90 days • key may or may not have an • element of time • data can be updated • Snapshot data: • time horizon -- 5 - 10 years • key contains an element of • time • once snapshot is made, • record cannot be updated

  10. Nonvolatile Change Replace Insert Insert Load Replace Replace Access Change Operational Data warehouse Data is updated on a record-by-record basis regularly Data is loaded into the warehouse and is accessed there, but once the snapshot of data is made, the data in the warehouse does not change

  11. Nonvolatile • The basic manipulation of data that occurs in the data warehouse is simple • There are only two kinds of operations • the initial loading of data • the access of data • There is no update of data • The need to be cautious of the update anomaly is no factor • Liberties can be taken to optimize the access of data

  12. Nonvolatile • Another consequence is in the technology • Technologies to support: • record-by-record update in an on-line mode • backup and recovery • transaction and data integrity • detection and remedy of deadlock are quite complex and unnecessary for data warehouse processing • DW environment is VERY, VERY different from the classical operational environment

  13. Nonvolatile • The source of nearly all data warehouse data is the operational environment • It is a temptation to think that there is massive redundancy of data between the two environments • In fact there is a MINIMUM of data redundancy • data is filtered; much data never passes out of the operational environment • the time horizon of data is very different • the data warehouse contains summary data • data undergoes a fundamental transformation as it passes into the data warehouse

  14. The Structure of the Warehouse • Data warehouses have a distinct structure • Different components of the data warehouse are: • meta data • current detail data • older detail data • lightly summarized data • highly summarized data • The major concern is the current detail data • the most recent happenings are always of great interest • voluminous, stored at the lowest level of granularity • disk storage is fast to access but expensive and complex to manage

  15. There are different levels of summarization and detail that demark the data warehouse Highly summarized Lightly summarized META DATA Current data Older detail data

  16. The Structure of the Warehouse • Older detail data is stored on some form of mass storage • it is infrequently accessed • it is stored at a level of detail consistent with current detailed data • Lightly summarized data is distilled from the low level of detail found at the current detailed level • it is almost always stored on disk storage • the design issues are: • what unit of time is the summarization done over • what attributes will the lightly summarized data contain

  17. The Structure of the Warehouse • Highly summarized data is compact and easily accessible • Meta data plays a special and very important role in the data warehouse • It is used as: • a directory to help locate the contents • a guide to the mapping of data as the data is transformed from the operational to the DW environment • a guide to the algorithms used for summarization

  18. An example of the levels of summarization that might be found in the data warehouse national sales by month 1988-1996 monthly sales by product line 1993-1996 national sales by week 1986-1996 weekly sales by subproduct 1988-1996 META DATA sales detail 1995-1996 sales detail 1985-1994

  19. An Example of the Data Warehouse • Old sales detail is that detail about sales that is older than 1995 • The current value detail contains data from 1995 to 1996 • The sales detail is summarized weekly by subproduct line and by region to produce the lightly summarized stores of data • The weekly sales detail is further summarized monthly along even broader lines to produce the highly summarized data • Meta data contains (at the least!): • the structure of the data • the algorithms used for summarization • the mapping from the operational environment to the data warehouse

  20. Old Detail Storage Medium • A wide variety of storage media that should be considered for storing older detail data • photo optical storage • CD-ROM • micro fiche • magnetic tape • mass storage • It is entirely likely that other storage media will serve the needs

  21. The Flow of Data Inside the Data Warehouse Summarization process Operational environment Aging process

  22. Flow of Data • As data enters the data warehouse from the operational environment, it is transformed • Upon entering the data warehouse, data goes into the current detail level of detail • It resides there and is used there until one of three events occurs: • it is purged • it is summarized, and/or • it is archived

  23. The higher the levels of summarization, the more the usage of the data

  24. Summarized Data • The more summarized the data, the quicker and more efficient it is to get to the data • The DSS analyst in a pre-data warehouse environment has used data at the detailed level • One of the tasks of the data architect is to wean the DSS user from constantly using data at the lowest level of detail • installing a chargeback system • pointing out very good response time when dealing with data at a high level of summarization

  25. Other Considerations • Data at the higher levels of summarization can be freely indexed • Data at the lower levels of detail is so voluminous that it can be indexed sparingly • The data model and formal design applies almost exclusively to the current level of detail • The data modeling activities do not apply to the levels of summarization

  26. Indexes and Data Model Data model

  27. Partitioning of DW Data • Partitioning can be done in two ways • at the DBMS level • the DBMS is aware of the partitions and manages them accordingly • the automatic management of the partitions is inflexible • at the application level • the responsibility for the management of the partitions is left up to the programmer • provides flexibility in the management of data in the data warehouse

  28. Current detail data is almost always partitioned

  29. vendor/supplier history 91 - present parts/order history 91 92 93 94 95 parts manufacture history q1 93 q2 93 q3 93 q4 93 q1 94 q2 94 q3 94 q4 94 q1 95 q2 95 q3 95 q4 95 q1 96 94 95 92 93 The internal structuring of data in a sample data warehouse current detailed data customer history order/customer 88 - present part part/order part part part/assembly parts shipments parts bill of material assembly history 91 92 93 94 95 87 - present

  30. An Example of a Data Warehouse • The levels of summarization are not shown, nor is the old detail archive shown • There are tables of the same type divided over time • For different types of tables there are different units of time physically dividing the units of data • Different tables are linked by means of a common identifier

  31. Other Anomalies • Public summary data is summary data that has been calculated outside the boundaries of the data warehouse but is used throughout the corporation • Another anomaly is that of external data • Another exceptional type of data sometimes found in a data warehouse is that of permanent detail data stored for ethical or legal reasons • the medium the data is stored on must be as safety proof as possible • the data must be able to be restored • the data needs special treatment in the indexing to be accessible

  32. Summary A data warehouse is a subject-oriented, integrated, time-variant, nonvolatile collection of data in support of management's decision needs There are four levels of data warehouse data: • old detail • current detail • lightly summarized data • highly summarized data Meta data is also an important part of the data warehouse environment

More Related