slide1 l.
Download
Skip this Video
Loading SlideShow in 5 Seconds..
Presentation # 506 PowerPoint Presentation
Download Presentation
Presentation # 506

Loading in 2 Seconds...

play fullscreen
1 / 71

Presentation # 506 - PowerPoint PPT Presentation


  • 126 Views
  • Uploaded on

IOUG Live! 2005. Design Tips for the Warehouse Architect. Presentation # 506. David Stanford President Red Sky Data Inc. david.stanford@redskydata.com. Objectives. Obtain a clear understanding of data warehouse design ‘hot points’ Identify solutions and alternatives for these ‘hot points’

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Presentation # 506' - dalton


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
slide1

IOUG Live! 2005

Design Tips for the Warehouse Architect

Presentation # 506

David Stanford

President

Red Sky Data Inc.

david.stanford@redskydata.com

objectives
Objectives
  • Obtain a clear understanding of data warehouse design ‘hot points’
  • Identify solutions and alternatives for these ‘hot points’
  • See how real world solutions are implemented
agenda
Agenda
  • Top 10 Gotchya’s
  • Design Traps
    • Loading Dirty Fact Data
    • Surrogate Keys
    • The Staging Area
  • Slowly Changing Dimensions
  • Tracking All History
  • Audit Considerations
  • Bad & Missing Data
  • Administrative Fields
  • Other Tidbits of Advice
slide4

Dave’s Top 10 Gotchya’s

  • Failing to model for both a) view of the data when the event occurred and b) view of the data as of today’s reality
  • Limiting the number of dimensions
  • Failing to model and populate a meta data repository
  • Failing to provide sufficient audit capabilities to verify loads against source systems
  • Not using surrogate keys for everything
slide5

Dave’s Top 10 Gotchya’s

  • Failing to design an error correction process
  • Normalizing too much
  • Not using a staging area
  • Failing to load ALL of the fact data
  • Failing to classify incorrect data
  • Making it too complex!
design traps

Design Traps

Design Review

Staging Area

Surrogate Keys

Facts – Surrogates and Dirty Data

data warehouse process
Data Warehouse Process

Data Characteristics

  • Raw Detail
  • No/Minimal History
  • Integrated
  • Scrubbed
  • History
  • Summaries
  • Targeted
  • Specialized (OLAP)

Source OLTP

Systems

Data Marts

Staging

Area

Data

Warehouse

  • Design
  • Mapping
  • Extract
  • Scrub
  • Transform
  • Load
  • Index
  • Aggregation
  • Replication
  • Data Set Distribution
  • Access & Analysis
  • Resource Scheduling & Distribution

Meta Data

System Monitoring

Source: Enterprise Group

where the work is
Where The Work Is

Source OLTP

Systems

Data Marts

Over 80% of the work is here

Staging

Area

Data

Warehouse

  • Design
  • Mapping
  • Extract
  • Scrub
  • Transform
  • Load
  • Index
  • Aggregation
  • Replication
  • Data Set Distribution
  • Access & Analysis
  • Resource Scheduling & Distribution

Meta Data

System Monitoring

Source: Enterprise Group

warehouse design
Warehouse Design
  • Normalized (Relational) Design
  • Dimensional Design – Star and Snowflake
  • Hybrid Design
    • In reality, the DW is more normalized but has elements of dimensional design
    • The data marts are star schemas but have elements of normalization
modelling is not straight forward
Modelling is not straight forward

Marital Status

Age

Member

Donation

Time

Gender

Campaign

Income

Location

should these be combined
Should These Be Combined?

MaritalStatus

Age

Member

Donation

Time

Gender

Campaign

Income

Location

behind the scenes
Behind The Scenes
  • There are several aspects of a design that users don’t directly see:
    • Meta Data
    • Error Correction
    • Audit
    • Load Control (if not using a scheduling tool)
    • Transformation Tables (used for transforming the data prior to being loaded into the DW)
behind the scenes13

Error Correction

Audit

Load Control

Transform Tables

Meta Data

Behind The Scenes

Data Marts

Source OLTP

Systems

Staging

Area

Data

Warehouse

a 10 step design process
A 10 Step Design Process
  • Identify major subject areas or topics
  • Declare the Grain
  • Add element of time to the tables
  • Create appropriate names for tables, columns, and views
  • Add derived fields where applicable
  • Add administrative fields
  • Consider security and privacy in design
  • Make sure data model answers the critical business questions
  • Consider meta data
  • Consider error correction
  • Performance considerations: Tune, Tune, Tune
independent of approach
Independent of Approach…

…the goal of the data model is to satisfy two primary criteria:

1. Meet Business Objectives

2. Provide Good Performance

staging area17
Staging Area

Data Characteristics

  • Raw Detail
  • No/Minimal History
  • Integrated
  • Scrubbed
  • History
  • Summaries
  • Targeted
  • Specialized (OLAP)

Source OLTP

Systems

Data Marts

Staging

Area

Data

Warehouse

  • Design
  • Mapping
  • Extract
  • Scrub
  • Transform
  • Load
  • Index
  • Aggregation
  • Replication
  • Data Set Distribution
  • Access & Analysis
  • Resource Scheduling & Distribution

Meta Data

System Monitoring

Source: Enterprise Group

the staging area
The Staging Area
  • Holds a mirror copy of the extract files
  • Allows pre-processing of the data before loading
  • Allows easier reloading (you WILL do this)
  • Keeps more control with the DW team, rather than with an external group (the extract team)
the staging area19
The Staging Area
  • Facilitates easier audit processes
  • Can facilitate error correction processes
  • Helps identifying the Record Type (translates into easier ETL processing and logic)
surrogate keys21
Surrogate Keys
  • A surrogate key is a system generated, unintelligent, single column, unique identifier for each row within a table
  • Always use surrogate keys for dimensions
  • Always use surrogate keys for the time dimension
  • Always use surrogate keys for transformation tables
  • Always use surrogate keys for EVERY table
  • ..and this includes FACT tables
surrogate keys avoid
Surrogate Keys Avoid…
  • Duplicate keys from different source systems
  • Recycling of primary keys
  • Use of the same key for different business rows
  • Lengthy composite key joins
  • Space in fact tables
  • Application changes or upgrades in source systems
using surrogates in fact tables
Using Surrogates In Fact Tables
  • You will need a surrogate key on the fact table if you allow ‘unknown’ values into the fact table (which is recommended by the way)
  • The Primary Key of a fact table is typically the combination of the base dimensions
slide27

Surrogates In Fact Tables

This results in a duplicate primary key in the table

surrogates in fact tables28
Surrogates In Fact Tables

Thus the need for a surrogate primary key

load dirty data into the fact
Load “Dirty” Data Into The Fact
  • Ties out to source systems
  • Gains credibility with end users
  • Requires a few design resolutions:
    • Bad & Missing (BAM) Logic
    • Surrogate Keys in the Fact tables
  • Still 100% accurate – we don’t load the bad values, we identify the bad values for correction later
  • Empowers the End Users to decide if the “dirty” data will invalidate their analysis
tracking history in dimensions
Tracking History in Dimensions
  • Type 1 – No history
  • Type 2 – All history
  • Type 3 – Some history
more dimension types combinations
More Dimension Types…Combinations
  • Type 3 Prime – Types 1 and 2 (the most common)
  • Type 4 – Types 1 and 3
  • Type 5 – Types 2 & 3
  • Type 6 – Types 1, 2, and 3 (the second most common)
trigger fields
Trigger Fields
  • Trigger Fields are fields within a table that you want to track history
  • Non-Trigger fields are those which you do not want to track history
type 3 prime all and no history
Type 3 Prime –All and No History

Non Trigger Fields

Trigger Field

expect to track everything
Expect To Track Everything
  • Users want to view the data as it was when the transaction or event occurred

AND…

  • Users want to view the data in the context of today’s realities

THUS, model for both!

add current columns
In order to provide these two views, consider adding ‘current’ columns to tables. This is a special Type 6.

These fields get updated in historical records when a trigger field changes value in the current record.

This simplifies the use of the DW by the users

It’s easier to understand than having to write complex SQL

Add ‘Current’ Columns
most recent flag
Most Recent Flag
  • Tracks the Most Recent record in time (not loaded, but based on a time series)
  • Should be added to the dimensions as a Yes/No (1/0) field
  • The most recently loaded record is set to Yes, all other records are set to No
  • Allows user to restrict on the Most Recent Flag to get a view of the world today
double keying type 2 dimensions
Double Keying Type 2 Dimensions
  • Double surrogate key in Type 2 dimensions
  • 1 key is unique for each individual row
  • 1 key is unique for each individual business key
  • Protects against:
    • Authoritative source system changes / duplication
rapidly changing dimensions
Rapidly Changing Dimensions
  • Rapidly Changing Dimensions (RCD’s) need to be partitioned
    • Use Oracle partitioning
    • Include the native partition key in the dimension
    • Or split into several tables
bad missing fact data
Bad & Missing Fact Data
  • Bad and/or missing data will be always be an issue
  • The source data is never completely clean
  • There are always exceptions
  • Recall that you need to tie back into the source systems for your audit, thus you must load this ‘incorrect’ data
  • Put the decisions into the hands of your users – don’t decide for them whether the data is good enough or not
  • Need to develop Bad & Missing (BAM) Rules
bam rules
BAM Rules
  • Used in the ETL process when loading data that references other tables (e.g. loading a fact table and looking up the dimension record)
  • Need a series of rules to follow if the lookup fails
  • Create a set of ‘dummy’ records for each referenced table (for Referential Integrity purposes)
  • In snapshots, may need a set of dummy records per snapshot period
error correction process
Error Correction Process
  • An area that you can report from and reload from
  • Hold or point to the original source record and be able to recreate it (the DW has lost the original value once tagged to a BAM rule)
  • Can be one summary table with standard error types
  • For more detail, create one error table for each target table
  • Create a series of error flag columns in the error table indicating what went wrong
error correction model detail mode

Stage

Target

Source

Reload

Error

Exists

Error Correction Model – Detail Mode

Load

Process

An Error Occurred

Per Table

Get The Record Ready

For Reloading

audit considerations
Audit Considerations
  • A key area that is quite often ignored
  • You must match to the source systems or be able to explain the differences
  • Auditing data loads (when did we start a load and what is the status?)
  • Without proof, you will not get the credibility!
administrative fields
Administrative Fields
  • Supports the ‘behind the scenes’ aspects
    • Loading
    • Querying
  • Different requirements for dimensions and facts
  • But try to standardize across all tables, even if the fields aren’t utilized today
dimension tables
Dimension Tables
  • Record Type – indicates New, Trigger Field Modify, Non-Trigger Field Modify, Delete, Correction
  • Active Flg - indicates a business key is active
  • Most Recent Flg - indicates the most recent row loaded within a business key
  • Effective Date - for the instance of that row
  • End Date - for the instance of that row
  • Create Date
  • Update Date
  • Create User
  • Update User
fact tables
Fact Tables
  • Record Type
  • Active Flg
  • Most Recent Flg
  • Row Cnt
  • Partition Date – store the actual date value
  • Create Date
  • Update Date
  • Create User
  • Update User
administrative field values
Administrative Field Values
  • Use 1’s and 0’s in flag and count fields – they’re easier to add (but it really depends on what the user can best understand)
  • Always fill in date fields (use dummy start and end dates in time if needed)
  • Use triggers to populate the create/update dates and users
random thoughts
Random Thoughts
  • Ensure you secure…
    • Budget
    • Top management commitment
  • Have focus (scope definition)
  • Develop incrementally
  • Have a business driven solution
  • Use experienced designers and implementers
  • Use industry tools for development
more random thoughts
…More Random Thoughts
  • Generally, make all of your column names unique across tables
  • Conform fact table measures (same name)
  • Don’t normalize too much – jump right into a dimensional design
  • Avoid retroactive changes
  • Don’t be afraid of many dimensions
too many dimensions
Too Many Dimensions?
  • 1 Fact, 41 CONFORMED dimensions
12 common dw design mistakes intelligent enterprise ralph kimball oct 2001
12 Common DW Design Mistakes(Intelligent Enterprise: Ralph Kimball Oct 2001)
  • Place text attributes in a fact table when you want to use them as constraints and groupings
  • Limit the use of verbose descriptions in your dimensions to save space
  • Split hierarchy and hierarchy levels into multiple dimension tables
  • Delay dealing with slowly changing dimensions
  • Use smart keys to join dimension and fact tables
  • Add dimensions to fact tables before declaring the grain
12 common dw design mistakes intelligent enterprise ralph kimball oct 200169
12 Common DW Design Mistakes(Intelligent Enterprise: Ralph Kimball Oct 2001)
  • Declare that the dimensional model is based on a specific report
  • Mixing different grains in one fact table
  • Leave lowest-level atomic data in non-dimensional format
  • Avoid building aggregates and use hardware for performance improvements
  • Fail to conform fact data
  • Fail to conform dimension data
in summary
In Summary
  • Avoid “Dave’s Gotchya’s”
  • Be careful in your design
  • Meet business requirements
  • Address the ‘behind the scenes’ issues
  • Remember: DW design is not a science, it is an art…so be creative!
slide71

Q

&

Q U E S T I O N S

A N S W E R S

Thank You!

David Stanford

david.stanford@redskydata.com

A