Scalable and Flexible Data Warehouse Design for Future Information Needs

3 Warehouse Design for IQ

Information Framework for the future 1. Choose a Data Warehouse architecture which is scalable, flexible and provides an evolutionary path for Business information requirements. 2. Try to convince the business users and IS department that a Pilot project is a good idea. 3. Building a Enterprise Data Warehouse for the first project is OK, but build it small, try to reduce the number of data sources and minimize historical requirements 4. Don’t cut corners for expediency, keep the faith and build a new Data Warehouse environment which will meet the future Business informational needs of the organization. 5. Don’t Build a proprietary Data Warehouse, it will limit you in the future and cause career limiting observations. 6. Always consider performance of the Data Warehouse at all times.

Requirements modelling • Produce detailed models for the dimensions • resulting multi-dimensional models could be a Star or Snowflake • Must capture details of the structure of the dimensions as well as attributes • Identify candidate measures to be sliced by dimensional attributes

"What is the net sales, in terms of revenue (dollars) and quantities of items sold, by product, by store and by sales region, by day and aggregated over time, by customer and sales areas, over the last 2 weeks?" year quarter month day item Schema of the initial dimensional model Product department Time sub- category Sales category product brand item key week store key store customer customer key city day Sales Area country revenue Sales quantity sold Customer Region Store

Role & purpose of dimensions • Dimensions contain business attributes which will be used as filters of facts and measures and and provide meaning to the measure. • Also used for aggregation paths for Facts • Dimensions are access paths and can contain text, indicators • History is not usually supported by this type of table • Can contain hierarchies sometimes not

How to model a dimension • For each aggregation path: • Create an entity for each aggregation path in the hierarchy • Add keys to identify and link these entities (In hierarchy) • Populate with source attributes specified by business users • If no aggregation path exists: • All attributes must be at a similar grain, otherwise differing grains are aggregation paths

Customer dimension contains Customer rolled up into Sales Areas #1

Considerations • Sometimes adding other dimensional hierarchies than those defined by end users can produce good results. Especially low level detail grains. Hard to reengineer if requirements change • In all instances, attempt to create ‘Conformed’ dimensions - ‘A conformed dimension is a dimension that means the same thing with every possible fact table to which it can be joined’ (R Kimball) • Examples: Product, Time, Customer

Modelling product conformed dimensions - Collapsed hierarchies

Modelling Store conformed dimensions - Collapsed hierarchies Store dimension Store Dimension Key Store Sales fact City Country Product key Sales Area Store key Descriptors.... Customer key Properties.... Time key MEASURES

Modelling Time conformed dimensions - Collapsed hierarchies Time dimension Sales fact Time dimension Key Day Product key Month Store key Quarter Customer key Year Time key Week Descriptors.... Properties..... MEASURES Non derivable attributes Derivable attributes

Advance Modelling Concepts • Modelling Slowing changing dimensions • Dimensions change over time, how do we handle these changes • Model optimisation techniques • Design techniques

Warehouse keys • What is a Warehouse key? • It is an integer value that is assigned to each dimensional row instance in the Data Staging area. • ‘All data warehouse keys must be meaningless surrogate keys. You must not use the original production keys’ (Ralph Kimball)

Advantages High performance because of numeric integer value Can accept data from multiple data sources that have different production key structures If production keys are recycled the warehouse key will not be changed Many Front ends cannot use concatenated keys Disadvantages More complex Data Staging because the warehouse key must be assigned A lookup of related Foreign keys must occur to establish the relationship Loading dependencies can occur eg Dimensional rows must be loaded first before Facts can Warehouse keys Always use Warehouse keys, it is a strategic design rule that will pay off in the future as new data sources are found and production keys recycle.

Slowly changing dimensions • Dimensions values change over time, these must be supported by population and loading techniques and are specified by dimensional modellers

Temporal modelling issues • Dimensional attribute changes, eg Customer marital status changes from ‘single’ to ‘married’ etc • Dimensional Hierarchy changes, eg adding an aggregation level within the dimension or deleting an aggregation level etc • Reflecting changed data over time

Key Start D. End D. Flag B. Data 001 01-01 31-12-2999 A XYZ…. 002 01-01 02-01-1998 A XYZ…. 003 01-01 31-12-2999 A XYZ…. 004 01-01 31-12-2999 A XYZ…. 005 01-04 31-12-2999 C XYZ…. 006 01-04 31-12-2999 C XYZ…. Constructive merge NK ABCD EFGH IJKL MNOP Constructive Merge process - Type 2 with end date. DW table rows day number 2 Delta records added to DW Table NK Key Start D. End D. Flag B. Data 001 01-01 31-12-2999 A XYZ ABCD EFGH 002 01-01 02-01-1998 A XYZ…. IJKL 003 Updated End D. rows 31-03-1998 01-01 A XYZ…. MNOP 004 31-03-1998 01-01 A XYZ…. IJKL New Rows added to DW MNOP Note: New rows are added to DW, old row instances for same Natural key value have had their End Date updated to a day before the Start date of the new row instance. Rows 003 has the same NK as inserted row 005, row 004 has the same NK as inserted row 006.

Binary history

What are State and Event models? Modelling the history of an entity: two base modelling approaches State model Each record represents a history StateO State1 State2 record of the state of the entity, (original valid during a given period of state) time State2 Event model State1 One record represents the initial state of the entity State State State0 All others represent state change1 change2 (original change events state) No initial state record if the entity itself is an event or a transaction

Event Model dimensional entities for which state changes are of interest In principle states can be reconstructed from an event model State model dimensional entities for that state which is prime interest State versus Event models In many cases , state and event models are mixed in a so called hybrid history model

Timestamp approach #1 • Instance timestamp • a timestamp in the form of a single time value which is associated with the history record Entity1 Entity_Identifier InstantTimestamp Attribute1 Attribute2 .... AttributeN

Timestamp approach #2 • Interval timestamping • a timestamp in the form of an interval or a duration which is associated with the history record eg • interval = (BEGIN-TIME, END-TIME), OR (BEGIN-TIME, DURATION) Entity2 Entity3 Entity4 Entity_Identifier Entity_Identifier Entity_Identifier (BeginTime, EndTime) (BeginTime, Duration) DurationTimestamp Attribute1 Attribute1 Attribute1 Attribute2 Attribute2 Attribute2 .... .... .... AttributeM AttributeM AttributeM

Other time domain ideas • A temporal model may contain several timestamps possibly belonging to different time domains • ‘Start date’ are most valuable for end-users • ‘Start date’ can be assigned at extract of source data, ‘End date’ is used when row is obsolete • Other date/s are appropriate in some instances Example of additional Time domains Time Valid Source Capture Apply Time Transaction Time Time Time

Consequences of varying attributes

Adding Dimension to dimension relationships • Simple one-to-many relationship • These can consist of direct relationship or via a Fact less Fact table when a many-to-many relationship exists • Updating relationship keys can be a useful technique For example (later)

Adding Dimension to dimension relationship

Example Demographics Table, row Instances

Changing Dimension Example • John Smith • Birthday 30/6/67 • Initial Load on 3/11/97 has DemoID =1 • After Birthday on 7/1/97 DemoID now = 2 • No longer in the 20-29 age range Pre Load Post Load As of 7/1/97 Dimension row for John Smith ‘Demoid’ is updated to 2 by post load process

Accurately Recording History • John Smith • Birthday 30/6/67 • Charge on 28/6/97 has DemoID =1 • Charge on 7/4/97 DemoID now = 2 • No longer in the 20-29 age range • Reflect correct demographics at time of charge

Model enhancements & optimisations • ‘Adding aggregated facts is a technique that aims at optimising data retrieval by adding facts to the model that store measures which are aggregated along one or more dimensions’ • Why? • Optimises performance

Aggregated Facts using time

Aggregated Facts using product category

Advantages improves performance if atomic detail not required (usually 70 % or queries can be supported by aggregated data) Less group by clauses Disadvantages Greater disk storage time to load and summarise Difficult to aggregate for all BU’s Advantages and disadvantages of aggregated facts Beware of aggregation explosion, very good for predefined query environment

Merging fact tables • Merge Fact tables when ‘compatible (Conformed) dimensions are present • Advantages • Storage savings under certain conditions • Less indexes • simpler for BU • Disadvantages • Grains could change unless very careful

Merging fact tables

Dimensions Summary • Always attempt to build ‘Conformed’ dimensions • Attributes in the dimensions will be used for selection. Not the Key! • Ensure grains are consistent • Always assign a Warehouse key

Facts Summary • Facts are meaningless without dimensions • No table scans • Are dependent on dimensions for filtering and selection • May not need warehouse key • Ensure grain is lowest as possible. This will facilitate flexibility

Product The ‘Star’ Time department sub- year category Sales quarter category product month day item brand product key week store key customer key store customer customer sales city time key area country revenue Customer sales quantity sold region Store

Levels of DW Information • Core Warehouse Data • Standard Aggregates • Special Aggregates • General Denorms • Multidimensional • Time series • Star Joins • Snowflake • Hypercube

Core Warehouse Data:Atomic Data • Lowest level of detail (Transaction) • Once stored, is never changed • Essential to an enterprise data warehouse • 15-30% of management questions cannot be answered without transaction level data • Huge • Up to Billions of rows • Takes tremendous horsepower to process

Aggregation • Summarization of detail fact tables • No need to read all the detail records if what we’re interested in is Sales by Region • Single most effective tool to control performance • May be lightly or highly summarized • Ten-fold to a thousand-fold improvement in runtime • Reduces total number of CPU cycles • May be created during load process • Needs special “aggregation” dimensions • Break off hierarchy before lowest level of detail

The Information Cube Concept • View in multiple dimensions • Each dimension is composed of attributes • Slice & dice along each of its dimensions City Pairs Time Class of Service 3-Dimensional Cube

Normalized Geography Data Model Geography Entity Hierarchy

Geographic Dimension GeoKey Continent Country Region State/Province County/Area City PostalCode GridReference Denormalized Geography Dimension • Entire Attribute Hierarchy in each Geography Dimension Record • One record for each combination of attribute combinations kept in Legacy Systems

The Information Cube Concept • View in multiple dimensions • Each dimension is composed of attributes • Slice & dice along each of its dimensions City Pairs Time Class of Service 3-Dimensional Cube

Drilling • Drilling down • Increasing the number of records returned • Drilling up • Decreasing the number of records returned • Let’s look at dimension attribute hierarchies to understand how drill up and drill down are dependent on them

One Attribute (Drill Up) Less Detail More Detail Many Attributes (Drill Down) What’s the Purpose of Dimension Attributes? • Controls the level of detail returned in the answer set • The fewer Dimension attributes in a query, the higher level the information returned • Let’s look at dimension attributes in more detail The Information Pyramid

Geography Dimension Attributes Attribute Hierarchy Attributes Region State City….. Boston MA Natick Northeast Quincy City Pairs Windham NH Nashua Salem San Francisco CA Los Angeles West Burbank Portland OR Time Eugene Oregon City Class of Service 3-Dimensional Cube

Scalable and Flexible Data Warehouse Design for Future Information Needs

Scalable and Flexible Data Warehouse Design for Future Information Needs

Presentation Transcript

The Warehouse Design and Control Problem

CSD305 Data Warehouse Design

The Warehouse Design and Control Problem

Data Warehouse Design

Is IQ enough for success

Innovative Warehouse Design

Data Warehouse Design Considerations

Database Design (for IQ-M)

The Warehouse Design and Control Problem

How to design a DATA WAREHOUSE

BID201: Sybase IQ Data Warehouse

Data Warehouse Design

Data Warehouse : Modeling and Design

iQ for intelligent lead generation

Data Warehouse Design

Business Process Modeling and Analysis for Data Warehouse Design

Data Warehouse Design

Design Metrics for Data Warehouse Evolution

Warehouse Design

Elite Automation - Warehouse Design and Engineering

Warehouse Design-steinservicesupply.com