1 / 57

Scalable and Flexible Data Warehouse Design for Future Information Needs

This guide provides insights on designing a scalable and flexible data warehouse architecture to meet the future information requirements of the business. It includes tips on convincing business users and the IT department about the benefits of a pilot project, building an enterprise data warehouse, and avoiding proprietary solutions. The guide also covers requirements modeling and considerations for modeling dimensions, as well as concepts like slowly changing dimensions and warehouse keys.

teresax
Download Presentation

Scalable and Flexible Data Warehouse Design for Future Information Needs

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. 3 Warehouse Design for IQ

  2. Information Framework for the future 1. Choose a Data Warehouse architecture which is scalable, flexible and provides an evolutionary path for Business information requirements. 2. Try to convince the business users and IS department that a Pilot project is a good idea. 3. Building a Enterprise Data Warehouse for the first project is OK, but build it small, try to reduce the number of data sources and minimize historical requirements 4. Don’t cut corners for expediency, keep the faith and build a new Data Warehouse environment which will meet the future Business informational needs of the organization. 5. Don’t Build a proprietary Data Warehouse, it will limit you in the future and cause career limiting observations. 6. Always consider performance of the Data Warehouse at all times.

  3. Requirements modelling • Produce detailed models for the dimensions • resulting multi-dimensional models could be a Star or Snowflake • Must capture details of the structure of the dimensions as well as attributes • Identify candidate measures to be sliced by dimensional attributes

  4. "What is the net sales, in terms of revenue (dollars) and quantities of items sold, by product, by store and by sales region, by day and aggregated over time, by customer and sales areas, over the last 2 weeks?" year quarter month day item Schema of the initial dimensional model Product department Time sub- category Sales category product brand item key week store key store customer customer key city day Sales Area country revenue Sales quantity sold Customer Region Store

  5. Role & purpose of dimensions • Dimensions contain business attributes which will be used as filters of facts and measures and and provide meaning to the measure. • Also used for aggregation paths for Facts • Dimensions are access paths and can contain text, indicators • History is not usually supported by this type of table • Can contain hierarchies sometimes not

  6. How to model a dimension • For each aggregation path: • Create an entity for each aggregation path in the hierarchy • Add keys to identify and link these entities (In hierarchy) • Populate with source attributes specified by business users • If no aggregation path exists: • All attributes must be at a similar grain, otherwise differing grains are aggregation paths

  7. Customer dimension contains Customer rolled up into Sales Areas #1

  8. Customer dimension contains Customer rolled up into Sales Areas #2

  9. Customer dimension contains Customer rolled up into Sales Areas #3

  10. Considerations • Sometimes adding other dimensional hierarchies than those defined by end users can produce good results. Especially low level detail grains. Hard to reengineer if requirements change • In all instances, attempt to create ‘Conformed’ dimensions - ‘A conformed dimension is a dimension that means the same thing with every possible fact table to which it can be joined’ (R Kimball) • Examples: Product, Time, Customer

  11. Modelling product conformed dimensions - Collapsed hierarchies

  12. Modelling Store conformed dimensions - Collapsed hierarchies Store dimension Store Dimension Key Store Sales fact City Country Product key Sales Area Store key Descriptors.... Customer key Properties.... Time key MEASURES

  13. Modelling Time conformed dimensions - Collapsed hierarchies Time dimension Sales fact Time dimension Key Day Product key Month Store key Quarter Customer key Year Time key Week Descriptors.... Properties..... MEASURES Non derivable attributes Derivable attributes

  14. Advance Modelling Concepts • Modelling Slowing changing dimensions • Dimensions change over time, how do we handle these changes • Model optimisation techniques • Design techniques

  15. Warehouse keys • What is a Warehouse key? • It is an integer value that is assigned to each dimensional row instance in the Data Staging area. • ‘All data warehouse keys must be meaningless surrogate keys. You must not use the original production keys’ (Ralph Kimball)

  16. Advantages High performance because of numeric integer value Can accept data from multiple data sources that have different production key structures If production keys are recycled the warehouse key will not be changed Many Front ends cannot use concatenated keys Disadvantages More complex Data Staging because the warehouse key must be assigned A lookup of related Foreign keys must occur to establish the relationship Loading dependencies can occur eg Dimensional rows must be loaded first before Facts can Warehouse keys Always use Warehouse keys, it is a strategic design rule that will pay off in the future as new data sources are found and production keys recycle.

  17. Slowly changing dimensions • Dimensions values change over time, these must be supported by population and loading techniques and are specified by dimensional modellers

  18. Temporal modelling issues • Dimensional attribute changes, eg Customer marital status changes from ‘single’ to ‘married’ etc • Dimensional Hierarchy changes, eg adding an aggregation level within the dimension or deleting an aggregation level etc • Reflecting changed data over time

  19. Key Start D. End D. Flag B. Data 001 01-01 31-12-2999 A XYZ…. 002 01-01 02-01-1998 A XYZ…. 003 01-01 31-12-2999 A XYZ…. 004 01-01 31-12-2999 A XYZ…. 005 01-04 31-12-2999 C XYZ…. 006 01-04 31-12-2999 C XYZ…. Constructive merge NK ABCD EFGH IJKL MNOP Constructive Merge process - Type 2 with end date. DW table rows day number 2 Delta records added to DW Table NK Key Start D. End D. Flag B. Data 001 01-01 31-12-2999 A XYZ ABCD EFGH 002 01-01 02-01-1998 A XYZ…. IJKL 003 Updated End D. rows 31-03-1998 01-01 A XYZ…. MNOP 004 31-03-1998 01-01 A XYZ…. IJKL New Rows added to DW MNOP Note: New rows are added to DW, old row instances for same Natural key value have had their End Date updated to a day before the Start date of the new row instance. Rows 003 has the same NK as inserted row 005, row 004 has the same NK as inserted row 006.

  20. Binary history

  21. What are State and Event models? Modelling the history of an entity: two base modelling approaches State model Each record represents a history StateO State1 State2 record of the state of the entity, (original valid during a given period of state) time State2 Event model State1 One record represents the initial state of the entity State State State0 All others represent state change1 change2 (original change events state) No initial state record if the entity itself is an event or a transaction

  22. Event Model dimensional entities for which state changes are of interest In principle states can be reconstructed from an event model State model dimensional entities for that state which is prime interest State versus Event models In many cases , state and event models are mixed in a so called hybrid history model

  23. Timestamp approach #1 • Instance timestamp • a timestamp in the form of a single time value which is associated with the history record Entity1 Entity_Identifier InstantTimestamp Attribute1 Attribute2 .... AttributeN

  24. Timestamp approach #2 • Interval timestamping • a timestamp in the form of an interval or a duration which is associated with the history record eg • interval = (BEGIN-TIME, END-TIME), OR (BEGIN-TIME, DURATION) Entity2 Entity3 Entity4 Entity_Identifier Entity_Identifier Entity_Identifier (BeginTime, EndTime) (BeginTime, Duration) DurationTimestamp Attribute1 Attribute1 Attribute1 Attribute2 Attribute2 Attribute2 .... .... .... AttributeM AttributeM AttributeM

  25. Other time domain ideas • A temporal model may contain several timestamps possibly belonging to different time domains • ‘Start date’ are most valuable for end-users • ‘Start date’ can be assigned at extract of source data, ‘End date’ is used when row is obsolete • Other date/s are appropriate in some instances Example of additional Time domains Time Valid Source Capture Apply Time Transaction Time Time Time

  26. Consequences of varying attributes

  27. Adding Dimension to dimension relationships • Simple one-to-many relationship • These can consist of direct relationship or via a Fact less Fact table when a many-to-many relationship exists • Updating relationship keys can be a useful technique For example (later)

  28. Adding Dimension to dimension relationship

  29. Example Demographics Table, row Instances

  30. Changing Dimension Example • John Smith • Birthday 30/6/67 • Initial Load on 3/11/97 has DemoID =1 • After Birthday on 7/1/97 DemoID now = 2 • No longer in the 20-29 age range Pre Load Post Load As of 7/1/97 Dimension row for John Smith ‘Demoid’ is updated to 2 by post load process

  31. Accurately Recording History • John Smith • Birthday 30/6/67 • Charge on 28/6/97 has DemoID =1 • Charge on 7/4/97 DemoID now = 2 • No longer in the 20-29 age range • Reflect correct demographics at time of charge

  32. Model enhancements & optimisations • ‘Adding aggregated facts is a technique that aims at optimising data retrieval by adding facts to the model that store measures which are aggregated along one or more dimensions’ • Why? • Optimises performance

  33. Aggregated Facts using time

  34. Aggregated Facts using product category

  35. Advantages improves performance if atomic detail not required (usually 70 % or queries can be supported by aggregated data) Less group by clauses Disadvantages Greater disk storage time to load and summarise Difficult to aggregate for all BU’s Advantages and disadvantages of aggregated facts Beware of aggregation explosion, very good for predefined query environment

  36. Merging fact tables • Merge Fact tables when ‘compatible (Conformed) dimensions are present • Advantages • Storage savings under certain conditions • Less indexes • simpler for BU • Disadvantages • Grains could change unless very careful

  37. Merging fact tables

  38. Dimensions Summary • Always attempt to build ‘Conformed’ dimensions • Attributes in the dimensions will be used for selection. Not the Key! • Ensure grains are consistent • Always assign a Warehouse key

  39. Facts Summary • Facts are meaningless without dimensions • No table scans • Are dependent on dimensions for filtering and selection • May not need warehouse key • Ensure grain is lowest as possible. This will facilitate flexibility

  40. Product The ‘Star’ Time department sub- year category Sales quarter category product month day item brand product key week store key customer key store customer customer sales city time key area country revenue Customer sales quantity sold region Store

  41. Levels of DW Information • Core Warehouse Data • Standard Aggregates • Special Aggregates • General Denorms • Multidimensional • Time series • Star Joins • Snowflake • Hypercube

  42. Core Warehouse Data:Atomic Data • Lowest level of detail (Transaction) • Once stored, is never changed • Essential to an enterprise data warehouse • 15-30% of management questions cannot be answered without transaction level data • Huge • Up to Billions of rows • Takes tremendous horsepower to process

  43. Aggregation • Summarization of detail fact tables • No need to read all the detail records if what we’re interested in is Sales by Region • Single most effective tool to control performance • May be lightly or highly summarized • Ten-fold to a thousand-fold improvement in runtime • Reduces total number of CPU cycles • May be created during load process • Needs special “aggregation” dimensions • Break off hierarchy before lowest level of detail

  44. The Information Cube Concept • View in multiple dimensions • Each dimension is composed of attributes • Slice & dice along each of its dimensions City Pairs Time Class of Service 3-Dimensional Cube

  45. Normalized Geography Data Model Geography Entity Hierarchy

  46. Geographic Dimension GeoKey Continent Country Region State/Province County/Area City PostalCode GridReference Denormalized Geography Dimension • Entire Attribute Hierarchy in each Geography Dimension Record • One record for each combination of attribute combinations kept in Legacy Systems

  47. The Information Cube Concept • View in multiple dimensions • Each dimension is composed of attributes • Slice & dice along each of its dimensions City Pairs Time Class of Service 3-Dimensional Cube

  48. Drilling • Drilling down • Increasing the number of records returned • Drilling up • Decreasing the number of records returned • Let’s look at dimension attribute hierarchies to understand how drill up and drill down are dependent on them

  49. One Attribute (Drill Up) Less Detail More Detail Many Attributes (Drill Down) What’s the Purpose of Dimension Attributes? • Controls the level of detail returned in the answer set • The fewer Dimension attributes in a query, the higher level the information returned • Let’s look at dimension attributes in more detail The Information Pyramid

  50. Geography Dimension Attributes Attribute Hierarchy Attributes Region State City….. Boston MA Natick Northeast Quincy City Pairs Windham NH Nashua Salem San Francisco CA Los Angeles West Burbank Portland OR Time Eugene Oregon City Class of Service 3-Dimensional Cube

More Related