Implementing Slowly Changing Dimensions in SQL Server 2008: A Comprehensive Guide

Choices, Choices - Implementing Slowly Changing Dimensions in SQL Server 2008 – Level 200Colin Hardie

Agenda… …or how long do I have to sit here for? • Who am I? • Introduction – What is a Slowly Changing Dimension (SCD)? • Types of SCD • The Kimball Approach to SCDs – modelling an SCD in the Data Warehouse • Alternative designs and why you might want to use them • SCDs in the OLAP layer • Loading an SCD 1 – The SSIS SCD Wizard • Loading an SCD 2 – BYO (Build Your Own!) • Loading an SCD 3 – BYO Reloaded - The one that works! (I hope) • Loading an SCD 4 – SQL Server 2008 TSQL Approach – the MERGE statement! • Questions (and possibly answers)

Who am I? The name is familiar from somewhere… • Senior Business Intelligence Consultant with IMGROUP • Specialising in SSIS and SSAS with a smattering of RDBMS • MCTS & MCITP qualified for Business Intelligence • Co-Author of the most popular post on Chris Webb’s Blog – OLAP Jokes • http://cwebbbi.spaces.live.com/Blog/cns!1pi7ETChsJ1un_2s41jm9Iyg!290.entry • A girl goes speed dating and after a run of duff prospects a dimension sits down opposite her and they engage in sprightly conversation. • After the event is over, her friend asks "Did you get that dimension's number? You two really seemed to hit it off" • "No" replies the girl "He was really deep at times - had a lot of levels - but you know parent-child dimensions, just too much of a mummy's boy"

Subject Matter Specific Humour… …otherwise known as the tenuous link Q: Why did the dimension take all day to take off its suit and put on a pair of jeans? A: It was a slowly-changing dimension! (Yeah it was Chris’s)

What is a Slowly Changing Dimension? Time to Ask Kimball…or not • Somewhat cryptically, Kimball has started defining SCDs as “nearly constant dimensions”* • A more helpful definition might be “a dimension that records change over time” • For most people in the BI community there are nominally 3 different types of SCD, each with a different method of handling change • Type 1 where change is handled by overwriting old data, hence you always get the latest view • Type 2 where change is stored using a combination of effective dates and new rows of data (more to follow!) • Type 3 where change is handled by adding a new column! • Wikipedia also defines Types 0, 4 and 6 • Type 0 – where nothing is done at all (so this is a dimension then???) • Type 4 – where “history tables” are used to store old records • Type 6 – which is a hybrid (invented by Kimball) of Types 1 + 2 + 3 = 6 • * The Data Warehouse Toolkit, Second Edition (Ralph Kimball et al, 2002)

TYPES OF SCD

Types of SCD Type 1 SCD Handling • TECHNIQUE: Replace an old dimension attribute value with a new one • Example is the Customer Dimension in the Adventure Works DW database • After a freak accident one day, Jon Yang becomes Jane Yang and the row then becomes

Types of SCD Type 2 Handling • TECHNIQUE: Add a new row with all of the same values apart from the changed one • Kimball states that “A type 2 response is the predominant technique for supporting [the representation of prior data correctly]” • Example is the Employee dimension in the Adventure Works Data Warehouse database • One day after an unfortunate incident at the office party, Guy get re-assigned to a new department with a new boss and a new title. • Note that the new row receives a new surrogate key, the old row now has CurrentFlag status of 0 and a non-null EndDate

Types of SCD Type 3 Handling TECHNIQUE: Add a new column to capture the previous state of the attribute • This technique is used in those cases when we want the ability to compare results using both the old and current values of an attribute • Type 2 cannot support this as it partitions the Fact table by attribute value. It is equivalent to a UNION of results for all attribute values. • Type 3 is the equivalent of a JOIN of results for current and prior attribute values.

Types of SCD Type 6 Handling TECHNIQUE: Add new rows and columns with updates (1 + 2 + 3) • To be used solely in the case where the solution is required to preserve all historical changes whilst also allowing you to compare historical data with the current view

Summary What do we know so far? • Slowly Changing Dimensions are a method of handling change in dimensional data over time • There are various “Types” of SCD that each involve one or more techniques including updates, new rows and new columns • The most commonly used are Type 1 and Type 2 • For the remainder of the session we will be focusing on Type 2 and how they can be implemented

SCDs IN THE RELATIONAL LAYER

Modelling Type 2 SCDs The Relational Layer – The Kimball Approach • SCDs like all other Dimensions in the Kimball approach to Data Warehouse design are predominantly modelled using a de-normalized star schema. • This can lead to some data redundancy (e.g. repeated attribute values) but generally on the basis of ease of use and query performance (TSQL that is) this is acceptable. • Kimball specifically recommends against the use of normalized or snowflake schemas on the grounds that “since dimension tables are geometrically smaller than fact tables, improving storage efficiency…has virtually no impact“* • * Data Warehouse Toolkit, Second Edition (Ralph Kimball et al, 2002)

Modelling Type 2 SCDs The Relational Layer – The Kimball Approach • An important point to note about the Dimension table is the use of a surrogate key (EmployeeKey) rather than the natural business key • Without a surrogate key in place, Type 2 changes would not be possible as the business key could not be repeated in a new row without change, at which point it would represent something else • The surrogate key becomes the basis for the join to the fact table and for the partitioning of the fact table data based on dimensional change • Note that in a single dimension table, both Type 1 and Type 2 attributes can be present

Modelling Type 2 SCDs When one table is never enough… • In general, using a Star-Schema dimension is the best approach • They are easier to understand and in a relational sense perform better • ETL is more straightforward • Hierarchy relationships can be easily seen by reading across a row • However, it should be noted that Kimball tends to exclusively talk about relational BI systems • The use of an OLAP layer will significantly mitigate against the problems of a multi-table design • In particular there are two main scenarios where a multi-table approach offers a better solution • A long chain of Type 2 hierarchy attributes • A large, rapidly changing Type 2 dimension

Modelling Type 2 SCDs A Long Type 2 Hierarchy Chain • Consider the following scenario… • A large multinational has implemented a central BI tool to manage its global sales force • The sales force has the following structure Global Head of Sales Regional Head Country Manager Area Manager Team Manager Sales Rep

Modelling Type 2 SCDs A Long Type 2 Hierarchy Chain • In a Star Schema format this type of chain is modelled as one table • If all of the managers and efficiency targets are modelled as type 2, then data explosion can occur very quickly, especially for large numbers of granularity members • Large amounts of data redundancy is introduced

Modelling Type 2 SCDs A Long Type 2 Hierarchy Chain • In a Snowflake Scheme this is modelled as a series of tables • This approach has the advantage that if a change occurs in a mid-hierarchy attribute, then the subsequent changes are restricted to that attribute and below • This not only reduces storage space but also the burden placed on the ETL system • Analysis Services also tends to process slightly more performantly against Snowflake schemas than Star schemas

Modelling Type 2 SCDs A Long Type 2 Hierarchy Chain • Suppose we have the following row of data from the Star schema implementation of the dimension

Modelling Type 2 SCDs A Long Type 2 Hierarchy Chain

Modelling Type 2 SCDs A Rapidly Changing Type 2 Dimension • Kimball describes a scenario where you find yourself with what he terms a “rapidly changing dimension” - essentially very large dimensions with many Type 2 attributes that change frequently e.g. a Customer dimension • For multi-million row dimension tables, the standard Type 2 handling techniques fall down due to the data explosion they generate • Kimball’s suggested solution is to split the dimensions in two, with 2 keys attached to the fact table instead of one • In one table the original surrogate key and all of the slowly changing Type 2 attributes are left. • In the other, all of the remaining rapidly changing Type 2 attributes are stored with another surrogate key. The relationship between the two is then maintained through the fact table • This technique should only be considered for those scenarios described as “rapidly changing monster dimension”

Modelling Type 2 SCDs A Rapidly Changing Type 2 Dimension • The split dimension technique involves taking one dimension with a combination of slowly and rapidly changing Type 2 attributes and creating two

SCDs IN THE OLAP LAYER

Modelling Type 2 SCDs The OLAP Layer • Modelling Type 2 SCDs in SSAS is extremely straight forward • Because we use a surrogate key for the granularity of the dimension in the relational layer, all of the hard work has already been done • The changing surrogate key means that over time the “parent” attributes to which this granularity attribute belongs will change over time seamlessly and with no extra effort on our part • If we want to see a view across time for a particular member then we can also create an attribute hierarchy based on the business key of the dimension

Modelling Type 2 SCDs The OLAP Layer – An Example • Suppose we are a pharmaceutical company who want to track sales of a particular drug over time • We have a Type 2 SCD Product dimension that records details of the drug over time • And a Fact table detailing drug sales over time

Modelling Type 2 SCDs The OLAP Layer – An Example DEMO

LOADING SCDs

Loading a Hybrid Type1/Type 2 SCD Our example dimension • Over the next few slides and demos we’ll be looking at a fairly realistic scenario. What are the methods we can use to load an SCD and in this case a hybrid Type 1/ Type 2 SCD • The dimension table I’ll using as my test subject is one I created by using the new Red Gate SQL Generator. • It looks like this • It has 1 million rows • Each method of loading the table will employ the same data “feed”, a SQL table containing • 100, 000 new rows • 100, 000 existing rows • 100, 000 update rows

Loading a Hybrid Type1/Type 2 SCD The SCD Wizard • Out of the box, SSIS comes with the Slowly Changing Dimension task • It allows you to rapidly develop Slowly Changing Dimension handling in a very short amount of time • It has one major drawback though… IT’S RUBBISH!!!

Loading a Hybrid Type1/Type 2 SCD The SCD Wizard • In all but the smallest of tables with very few updates per cycle the task does not perform at all • The reason for this poor performance is twofold: • The Algorithm - column by column comparison between the destination and incoming rows can quickly become incredibly inefficient and a performance bottleneck • Row by Row Processing - When a row has been identified as an update row it is sent to an OLE DB Command which performs a single update statement for that row. All the traditional advantages of SQL set based processing are lost

Loading a Hybrid Type1/Type 2 SCD The SCD Wizard • DEMO

Loading a Hybrid Type1/Type 2 SCD Replacing the SCD Task • Any solution to the performance problem of the SCD wizard must be able to solve the indicated problems, whilst still retaining the accuracy of the logic • Essentially this means that any solution must be able to do the following: • Be able to differentiate between new or existing rows • Be able to remove the need for column by column comparison • Be able to leverage set based rather than row by row based processing

Loading a Hybrid Type1/Type 2 SCD Use MERGE JOINS & HASHING • We can create our own solution using MERGE JOIN tasks that will mimic the behaviour of the SCD Task • We can combine this with a workflow that allows us to use set based rather than row based processing • We can also use CHECKSUM or HASHBYTES to ensure we only have to compare one column

Loading a Hybrid Type1/Type 2 SCD CHECKSUM or HASHBYTES • CHECKSUM and HASHBYTES are both hashing algorithms within TSQL • Used for combining many columns together in one “hash” value and building hash indexes over • For our purposes this is important – it allows us to combine all of the incoming attributes into one column, thus instead of checking multiple columns we only need to check one • There is a problem though, the number of “collisions” with CHECKSUM – the number of times the same hash value will be created by different combinations of values is too high • Microsoft recommends the use of the HASHBYTES function instead, along with the SHA1 algorithm

Loading a Hybrid Type1/Type 2 SCD Use Lookups DEMO

Loading a Hybrid Type1/Type 2 SCD Use Lookups • There are obviously problems with this technique! • HASHBYTES, the VARBINARY(8000) it creates and the function used to transform it is a major bottleneck.. • MERGE JOIN is not a particularly efficient JOIN algorithm

Loading a Hybrid Type1/Type 2 SCD Tweaking the Solution • To get around the second point we can use Lookup tasks • The Lookup task essentially caches a whole table or query in memory at runtime and then compares incoming rows with the result of that cache • This means queries can be very quickly resolved • However, in very large dimensions the MERGE JOIN may perform better due to the memory limitation of the Lookup • As for the first point, in practice due to the size of hashing output we are better off comparing the separate columns (assuming they total less than 8000 bytes!)

Loading a Hybrid Type1/Type 2 SCD Tweaking the Solution DEMO

Loading a Hybrid Type1/Type 2 SCD SQL Server 2008 New TSQL Features • In SQL Server 2008 (as you have doubtless seen elsewhere today) there are additions to Transact SQL (TSQL) that are relevant for Slowly Changing Dimension handling • Specifically these are • The MERGE statement • The INSERT from DML output

Loading a Hybrid Type1/Type 2 SCD SQL Server 2008 New TSQL Features - MERGE • The MERGE statement performs insert, update, or delete operations on a target table based on the results of a join with a source table. • The MERGE statement provides three types of WHEN clauses: • WHEN MATCHED enables you to UPDATE or DELETE the given row in the target table when the source and target rows match some criteria or criterion. • WHEN NOT MATCHED [BY TARGET] enables you to INSERT a row into the target when it exists in the source but not in the target. • WHEN NOT MATCHED BY SOURCE enables you to UPDATE or DELETE the given row in the target table when it exists in the target but not in the source. • You can specify a search condition with each of the WHEN clauses to choose which type of DML operation should be performed on the row. • The OUTPUT clause for the MERGE statement includes a new virtual column called $action that you can use to identify the DML action that was performed on each row.

Loading a Hybrid Type1/Type 2 SCD SQL Server 2008 New TSQL Features - INSERT • In SQL Server 2008, INSERT has the new ability to consume the results of DML statements • If this new capability is combined with the output of a MERGE, we have the ability to perform Type 2 inserts based on Type 2 updates performed in the MERGE • If these pieces of syntax are nested in the right way, then we can perform hybrid updates as we have been doing previously

Loading a Hybrid Type1/Type 2 SCD SQL Server 2008 New TSQL Features - INSERT • DEMO

Summary What have we learned? • Slowly Changing Dimensions and in particular Type 2 SCDs are a powerful tool in tracking history in a data warehouse • There are various ways of modelling Type 2 SCDs in the relational layer • The easiest to understand is the Star Schema approach advocated by Kimball and generally this is the best choice. However in certain scenarios, a Snowflake or Split Dimension technique is more suitable • Whichever method you choose to model, in the OLAP layer the hard work has already been done and the Type 2 model should just drop out • SQL Server 2008 offers several options in loading SCDs • The in built SCD Wizard should only be used for the smallest of dimensions • It is possible to construct a performant alternative using Lookups (or MERGE JOINS for very large dimensions) and set based UPDATE statements • MERGE and INSERT functionality offers a viable alternative for loading in environments where a code based ETL approach has been taken

THE END QUESTIONS?

Implementing Slowly Changing Dimensions in SQL Server 2008: A Comprehensive Guide