1 / 44

# Advanced Dimensional Modelling - PowerPoint PPT Presentation

Advanced Dimensional Modelling. SQLBits 8, 9 th April 2011, Brighton. Vincent Rainardi [email protected] Blog: dwbi1.wordpress.com. Advanced Dimensional Modelling. 1. Dimensions - Structure SCD Type 6 1 or 2 Dimensions When To Snowflake A Dimension with Only 1 Attribute

Related searches for Advanced Dimensional Modelling

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

SQLBits 8, 9th April 2011, Brighton

Vincent Rainardi

Blog: dwbi1.wordpress.com

• 1. Dimensions - Structure

• SCD Type 6

• 1 or 2 Dimensions

• When To Snowflake

• A Dimension with Only 1 Attribute

• Transaction Level Dimension

• 2. Fact Tables

• Fact Table Primary Key

• Snapshotting Transaction Fact Tables

• Aggregate Fact Tables

• Vertical Fact Tables

• 3. Dimensions - Behavior

• Rapidly Changing Dimension

• Very Large Dimensions

• Banding Dimension Rows

• Stamping Dimension Rows

• Dimensions with Multi Valued Attributes

• 4. Combinations

• Real Time Fact Table

• Dealing with Currency Rates

• Dealing with Status

• 4 sections: 2 dims, 1 fact, 1 combi. Lots of material, may not able to finish.

• 44 slides, some slides we may have to touch lightly.

• Questions between sections, available after.

1/2

• SCD Type 6 is a combination of Type 1, 2 & 3

• e.g. type 2 + type 1 : DimAccount (telco example)

• 6 = 1 + 2 + 3 (Ref: Ross & Kimball 2005, Wikipedia)

http://www.rkimball.com/html/articles_search/articles%202005/0503IE.html

http://en.wikipedia.org/wiki/Slowly_changing_dimension

2/2

• Used for “As Was” reporting

• e.g. balances by tariff (price plan) at the end of last year,if the customers were on today’s tariff.

Fact

“Type 12”

Dim

Natural Key

1 or 2 dimensions

1/4

a) One Dimension

b) Two Dimensions

Fact

Table

DimAccount

Fact

Table

DimAccount

customerattributes

DimCustomer

• We can get the customer attributes without knowing the account key

• Disadvantage: can’t go from account to customer without going through the fact table - performance

• Simplicity, 1 dim

• Hierarchy from customer attribute &account attribute

• Use when we don’t have fact tables requiring customer grain.

1 or 2 dimensions

2/4

c) Snowflake

Fact

Table

DimCustomer

DimAccount

• Dim customer is needed by another fact table

• Modular: 2 separate dim tables but we can combine them easily to create a bigger dimension

• To get the breakdown of a measure by a customer attribute is a bit more complicated than a)

select c. attribute, sum(f.measure1) from fact1 finner join dim_account a on f.account_key = a.account_keyinner join dim_customer c on a.customer_key = c.customer_keygroup by c. attribute

1 or 2 dimensions

3/4

d) Two Dimensions with inter-dimension link

• Try to fix weakness on b and c:

• We can “go” direct from account dim to customer dim

• We can access dim customer directly from the fact table.

Fact

Table

DimAccount

DimCustomer

Weakness: maintain customer key in 2 places: fact table and dim account.

a.k.a. “Star with a Back Door”

1 or 2 dimensions

4/4

e) One Dimension with Customer Key

Fact

Table

Fact

Table

DimAccount

Try to fix weakness of a:

unable to build a fact table with grain = customer.

Add a column in dim account: customer key

Not as popular as c) and d) in solving Dim Customer issue. It is “indecisive” :

trying to create Dim Customer but doesn’t want to create Dim Customer.

Disadvantage: Dim Customer is hidden inside Dim Account, making it:

a) more difficult to maintain (especially for a type 2), and

b) less modular/flexible

1/3

1. When the sub dim is used by several dims

City-Country-Region columns exist in DimBroker, DimPolicy, DimOffice and DimInsured

Replaced by Location/GeoKey pointing to DimLocation / DimGeography

Advantage: consistent hierarchy, i.e. relationship between City, Country & Region.

Weakness: we would lose flexibility. City to Country are more or less fixed, but the grouping of countries might be different between dimensions.

2/3

2. When the sub dim is used by both the main dim and the fact table(s)

• DimCustomer is used in DimAccount, and is also used in the fact table.

• DimManufacturer is used in DimProduct, and is also used in the fact table.

• DimProductGroup is used in DimProduct, and is also used in some fact table.

The alternative is maintaining two full dimensions (star classic).

3/3

3. To make “base dim” and “detail dim”

4. To enrich a date attribute

Insurance classes, account types (banking), product lines, diagnosis, treatment (health care)

Policies for marine, aviation & property classes have different attributes.

Pull common attributes into 1 dim: DimBasePolicy

Put class-specific attributes into DimMarine, DimProperty, DimAviation

Ref: Kimball DW Toolkit 2nd edition page 213

Month, Quarter, Year, etc.

Like #1, a sub dim used by several dims.

1/2

Reasons for putting single attribute in its own dim:

• Keep fact table slim  (4 bytes int not 100 bytes varchar)

• When the value changes, we don’t have to update the BIGfact table – ETL performance

• Grain is much lower than fact table – small dim

• Yes it’s only 1 attribute today, but in the future there could be another attribute. Could become a junk dim.

Should we put the attribute in the fact table? (like DD = Degenerate Dim)

Probably, if the grain = fact table,and it’s short or it’s a number.

2/2

Exception: snapshot month (or day/week/quarter)

Snapshot month is used in periodic snapshot fact table. Snapshot month is in the form of an integer (201104 for April 2011). Doesn’t violate the 3 points above.

• It is an integer, not char(6).

• The value never changes, April 2011 will be April 2011 forever

• There will not be other attributes in the dim

1/5

A dim with grain = the transaction fact table

Transaction, not accumulative or periodic snapshot

Examples:

• IT Helpdesk DW: Dim Ticket

• Telco DW: Dim Call

TransactionLevel Dim

Most granular event in any business process

2/5

Query PerformanceDD columns are moved to a dim, away from the heavy traffic in fact tables. DW queries don’t touch those DD columns unless they need to– performance. DD attributes totalling 30 bytes, replaced by 4 bytes int column. Slimmer fact table, better for queries.

Periodic Snapshot Fact TableFor periodic snapshot fact table, saving is even greater. Monthly snapshot fact, 10 years / 120 months. Rather than specifying the DDs repeatedly 120x, they are specified once in the transaction dim. All that is left on the fact table is a slim 1 intcol: the transaction key.

3/5

Some fact tables have grains greater than the transactionA payment from a customer is posted into 4 accounts in the GL fact table. That single financial transaction becomes 4 fact rows but only has 1 row in the trans dim. Fact table with 10m rows, trans dim only 3 million rows.

Related TransactionsSome transactions are related, e.g. in retail, a purchase of a kitchen might need to be created as 2 related orders, because the worktop is made-to-order. Rather than creating a ‘related order’ column on the fact tables, it might be better (depends on how it’s used) to create it on the trans dim because: a) an order can consist of many fact rows (1 row per item) so the “related order number” will be duplicated across these fact rowsb) slimmer fact tablec) the transaction could be on many fact tables, not only one.

4/5

• Transaction fact table and the grain of the trans dim = grain of the fact table, and only 1 DD column: perhaps better leave the DD in the fact table. Not a lot of space/speed gain by putting it on trans dim.

• Mart/DW only used for SSAS: there is little point of having trans dim physically. In SSAS we can create the transaction dimension “on the fly” from the fact table (“fact dimension”).

• Using trans dim to put attributes as opposed to put them in the main dimensions, with the argument of: that’s the value of the attribute when the transaction happened – this is not right, use type 2 SCD for this.

Main

Acct type

Trans

Location

5/5

• Any dim with grain = fact table (like trans dim) is questionableDo we really need this dim at this grain? Perhaps it should be divided into several dims instead?

• A dim with grain = fact table - potential performance issue (unless the fact table is small). e.g. fact table = 10m rows, trans dim = 10m rows. Joining 10m to 10m potentially slow, especially if the physical ordering of the trans dim is not the joining column.

• 1. Dimensions - Structure

• SCD Type 6

• 1 or 2 Dimensions

• When To Snowflake

• A Dimension with Only 1 Attribute

• Transaction Level Dimension

• 2. Fact Tables

• Fact Table Primary Key

• Snapshotting Transaction Fact Tables

• Aggregate Fact Tables

• Vertical Fact Tables

• 3. Dimensions - Behavior

• Rapidly Changing Dimension

• Very Large Dimensions

• Banding Dimension Rows

• Stamping Dimension Rows

• Dimensions with Multi Valued Attributes

• 4. Combinations

• Real Time Fact Table

• Dealing with Currency Rates

• Dealing with Status

25%

Time, Questions

1/3

Should we have a PK?

Yes, if we need to be able to identify each fact row

• Need to refer to a fact row from another fact row e.g. chain of events

• Many identical fact rows and we need to update/delete only one

• To link the fact table to another fact table

Some experts totally disagree

Uniqueness

Related Trans

PK

PK

FK

PK

FK (no RI)

(not enforced)

previous/next transaction

2/3

Single or Multi Column?

Single Column: Generated Identity Multi Column: Dimension Keys

Single-column PK is better than multi-column PK because :

1) A multi-column PK may not be unique. A single-column PK guarantees that the PK is unique, because it is an identity column.

2) A single-column PK is slimmer than a multi-column PK, better query performance. To do a self join in the fact table (e.g. to link the current fact row to the previous fact row), we join on a single integer column.

3/3

• Advantage: Prevent duplicate rows, query performance

• Indexing the PK: cluster or not?

• Cluster the PK if: the PK is an identity column

• Don’t cluster the PK if: the PK is a composite, or when you need the cluster index for query performance (with partitioning)

Example of not having a PK

If duplicate fact rows are allowed.

e.g. retail DW: Store Key, Date Key, Product Key, Customer Key

Same customer buying the same milk in the same shop on the same day twice

--- Order Line ID as DD to make it unique (not all EPOS has it)

1/1

• Potentially huge – billions rows

• Only take what you need

• Smart date key/month, e.g. 20110409

• Monthly or daily

• Daily (4 wk), Weekly (1 yr), Monthly (10 yr)

• Purging & Archiving

• Index/partition on snapshot date

Trans

Staging

Snapshot

1/2

What are they?

• High level aggregation of base fact tables

• A “select group by” query on a 2 billion rows fact table can take 30 mins if it joins with two big fact tables, even with indexes in place

• So we do this query in advance as part of the DW load and store it as an Aggregate Fact Table

• The report only takes 1 second to run.

Base Fact Tables

30 mins

Aggregate

Fact Table

1 sec

Report

2/2

What For?

• For report performance (group by is costly)

• BO: aggregate aware

• Not SSAS: aggregate in cubes, not tables

• Best to load from staging (at the same time as loading the main fact table) not from the main fact table (this would be working 2x)

• Partition for data distribution or narrow query

• Indexing: by the main dim keys

1/1

• Normalised

• 1 measure column

• The meaning of that measure column depends on “measure type” column

• Used for Finance/GL mart

• Advantage: flexibility: using accounts, balance, Dr Cr

“Normal”

Fact Table

many measures

(actual & budget)

Measure Type

Dim Key

Vertical

Fact Table

1 measure

• 1. Dimensions - Structure

• SCD Type 6

• 1 or 2 Dimensions

• When To Snowflake

• A Dimension with Only 1 Attribute

• Transaction Level Dimension

• 2. Fact Tables

• Fact Table Primary Key

• Snapshotting Transaction Fact Tables

• Aggregate Fact Tables

• Vertical Fact Tables

• 3. Dimensions - Behavior

• Rapidly Changing Dimension

• Very Large Dimensions

• Banding Dimension Rows

• Stamping Dimension Rows

• Dimensions with Multi Valued Attributes

• 4. Combinations

• Real Time Fact Table

• Dealing with Currency Rates

• Dealing with Status

50%

Time, Questions

1/1

• Why is it a problem

• Large SCD2 dim – Attributes change every day

• Slow query when join with large fact tables

• What to do

• Put into a separate dim, link direct to fact table.

• Just store the latest, type 1 attributes (or dual)

• Store in the fact table (for small attribute, e.g. indicator)

Type2

Type2

Type2

Type2

Type1

Very Large Dimension

1/2

Why is it a problem

• SSAS: 4 GB string store limit for dimension

• SSAS: dim is “select distinct” on each attribute – long processing time

• “Valid date” join on SCD2 for as was

• Usually customer dim where the “quality stamp” changes daily or because of high number of distinct values

• Difficult to browse high cardinality attribute

• Join with fact tables – performance

2/2

What to do

• Split into 2 dims, same grain. Always cut vertically.

• Remove SCD2, or at least only certain columns.

• Most common: separate the attributes with high cardinality/change frequency

• Bucketing/banding, group values into ranges

VLD

1/1

• It is grouping numerical values (numerical attributes, not measure) into several bands, e.g. engine size, distance from station, amount purchased (last complete year).

• Benefits: easier for analysis & reporting, comparing between categories.

• Issue/problem: limit e.g. bucketing criteria1 hour to implement, 3 months to argue

1/1

Calculate internally or buy data from outside

• Customer categories (loyalty programme) e.g. A, B, C of customer class.

• To reflect c0nsumer interest on the product (product categorisation based on customer interest level)

• Any other dates or measures summarized as stamped attribute, i.e. “new customer”, “big spender”, or results from recommendation analysis/algorithm e.g. customer behaviour based on previous purchases.

• Used for analysis / reporting

“Stamped”

Attributes

1/4

What is a Multi Valued Attribute?

An attribute which has more than 1 value per dimension row.

MV Attribute or MV Dimension?

• MV Dim = For each fact row there could be more than 1 dimension row

Why do I need to know this?

• To be able to model it

• If wrong, difficult at BI/report

2/4

Approaches to deal with MV Attributes

1. Lower the grain of the dim

After

Before

Fact table requires that the product dimension is at Product Code grain, e.g. no sales info per size, but only per product code.

2. Put the MV attributes in a separate dim, link direct to the fact table

Before

After

Often we don’t have the allocation information e.g. 50-50 or 30-70, we only know that product1has 2 sizes

3/4

3. Use a bridge table to link the 2 dims

Fact Table

Dim Size

Bridge Table

Dim Product

4. Have several columns in the dim for that attribute

If the number of attributes is small and fixed, this is a popular approach. But if the number of attributes is large (e.g. >10) or if it’s variable (e.g. sometimes 2, sometimes 20), approach 2 and 3 above are more popular, and more appropriate.

4/4

5. Put the attribute in a snowflake sub dim

We can’t really do this, as it is 1 to many (1 row in the main dim corresponds to many rows in the sub dim). So we need a bridge table, which brings us back to approach 3.

6. Keep in one column using delimiters

e.g. “Small|Medium″. A crazy idea. More flexible than having several columns (approach 4) and simpler than approach 3 or 2.

If the purpose of the attribute is “display only” on a report (rather than analyse or slice & dice), there is an argument for using this approach, particularly if the number of attributes is small (e.g. 1 to 4).

• Dimensions - Structure

• SCD Type 6

• 1 or 2 Dimensions

• When To Snowflake

• A Dimension with Only 1 Attribute

• Transaction Level Dimension

• Fact Tables

• Fact Table Primary Key

• Snapshotting Transaction Fact Tables

• Aggregate Fact Tables

• Vertical Fact Tables

• Dimensions - Behavior

• Rapidly Changing Dimension

• Very Large Dimensions

• Banding Dimension Rows

• Stamping Dimension Rows

• Dimensions with Multi Valued Attributes

• Combinations

• Real Time Fact Table

• Dealing with Currency Rates

• Dealing with Status

75%

Time, Questions

1/1

• Reporting the transaction system in real time

• View to union with the normal fact table, or use partitions

• Freezing the dims for key lookup, -3 unknown key

• Key corrections next day

Dims as of

yesterday

Main partition

(up to last night)

Unknown keys:

-1 null in source

-2 not in dim table

-3 not in dim table as dim was frozen to be resolved next batch

Real time partition

dimkey

1/3

What for/background/requirements

• Report in 3 reporting currencies, using today rates or past

• Analyse over time without the impact of currency rates (using fixed currency rates, e.g. 2010 EOY rates)

• Had the transactions happened today

• Currency rates historical analysis

Reporting

Currency

DW

Currency

Transaction

Currency

Transaction

Rates

Reporting

Rates

100 countries

40 currencies

3-4 currencies

GBP, USD, EUR,

Original

1 currency

(many transaction

dates)

( 1 reporting

date)

e.g. GBP

2/3

Approaches

• Store in original currencies, convert to DW currency at runtime.Or convert at load, store in DW currency – inaccuracy. Or store in both original and DW currency

• Currency rate fact table (date, currency, rate)Or store rates in the fact table

• On report/cube: date input at run time (default = today)

Fact Tables

FX Fact Table

Rate

3/3

in original currency, DW currency or both

Concept of FX Rate Type/Profile

1/2

What/background

• Workflow (policies, contracts, documents)

• Bottleneck analysis (no of days between stages)

• How many on each stage

Status 1

Status 2

Status 4

Status 6

date2

date3

date4

date1

Status 3

Status 5

2/2

Approaches

• Accumulative Snapshot Fact, 1 row per application

• SCD2 on DimApp

• App Status fact table

• Email: [email protected]

• Blog: http://dwbi1.wordpress.comCovers many of the topics in this presentation

• This PowerPoint: in my blog, scroll to bottom, click on “SQLBit8”

• Special thanks to Guang Ming Xing and Simon Jensen who helped reviewing this presentation and provided useful comments (doesn’t mean that they agree with the content)