Dimensional Data Modeling

1. Dimensional Data Modeling Lecture 2 � Dimensional Modeling Fundamentals

2. 2 When to Begin Modeling We require to major inputs to create our dimensional model: Information requirements from the requirements analysis phase Source Data Analysis results

3. 3 Bus Architecture Monolithic approach does not work (Why?) �Stovepiped� independent marts don�t work (Why?) Kimball recommends an approach that is a compromise between these two extremes called the �Bus Architecture�: Start with a planning/architecture phase for entire warehouse Build data marts, in a sequence dictated by requirements, but in an architected fashion

4. 4 Bus Architecture Kimball�s Bus Architecture is based on conformed dimensions and conformed facts Conformed Dimension: A dimension has the same meaning for each fact that references it. One of the major responsibilities of centralized data warehouse design is development of conformed dimensions

5. 5 Bus Architecture � Conformed Dimensions Building Conformed dimensions can be a very big deal. Consider the customer dimension:

6. 6 Bus Architecture - Conformed Dimensions Customer data may need to assembled from a variety of operational systems. To create this dimension, may need to: Choose �best� information from that available (which system provides the best mailing address? Is this always true?) Clean up information (addresses without Postal Codes, etc) Conformed dimensions are the most important aspect of the bus architecture and must be implemented for warehouse to be successful

7. 7 Bus Architecture � Conformed Dimensions Conformed dimensions give the user the ability to Perform analysis across facts Consistently definition of attributes, enhancing users ability to understand the data in the warehouse. For example, everyone agrees on what �product� dimension is Consistent rollup of data across facts

8. 8 Bus Architecture � Conformed Dimensions Design considerations: Need to define the lowest �grain� of each dimension, e.g., Customer: an individual Product: Depends � is it the product sold? What about products that have sub-assemblies (computers) Date: Day? Time: Hour? Minute? Second? Assign surrogate keys. What to avoid any dependence on source system

9. 9 Surrogate Keys Assign Surrogate keys because: Production may reuse keys that it has purged Production may make a mistake and reuse a key even when it isn�t supposed to: UPCs Production may recompact its key space and reassign keys Production may legitimately overwrite some part of a product description or a customer description with new values but not change the product key or the customer key to a new value..

10. 10 Surrogate Keys Your company has just made an acquisition, and you need to merge more than a million new customers into the master customer list. You will now need to extract from two production systems, but the newly acquired production system has nasty customer keys that don�t look remotely like the others. Production may generalize its key format to handle some new situation in the transaction system. Now the production keys that used to be integers become alphanumeric. Or perhaps the 12-byte keys you are used to have become 20-byte keys.

11. 11 Surrogate Keys You can assign a 4 byte (typical) integer as the surrogate key which: Saves storage space (this is an issue in a data warehouse where there are billions of rows) Increases join performance � the smaller the key, the faster the join Isolates data warehouse design from operational system issues

12. 12 Bus Architecture � Conformed Facts Need consistent definition of each type of fact. Means that definition of things like �revenue� are consistent throughout the warehouse Also means that calculation of the fact (if required) is always the same. Example: Revenue = Gross Revenue � customer-specific discount � promotional discount Profit = Revenue � fixed cost � variable cost Units of measure for each fact must be consistent as well Product quantity � cases vs individual units

13. 13 Bus Architecture � Conformed Facts Where there are different interpretations required � create separate facts Calendar Month-End Revenue vs Fiscal Month-End Revenue Granularity: The lowest grain of each fact should correspond to the lowest grain of the dimensions associated with the fact This approach provides a great deal of flexibility in terms of how the data can be aggregated and used over time

14. 14 Dimensional Modeling Techniques Basic concept: most business data can be represented in a �cube� structure where cells are measures (facts) and edges are dimensions More than 3 dimensions allowed (hypercube) Typical cubes � 4 to 15 dimensions (based on industry experience

15. 15 Modeling Techniques Facts: Typically, facts are observations gleaned from business transactions Generally numeric, although there may be some textual facts Attributes: Usually text and describe a characteristic of a dimension. Example: product description, geography description, etc. Dimensions: Organize the attributes Designer discretion which dimensions get created to organize the attributes

16. 16 Dimensions Important to note that within a dimension there may be multiple logical hierarchies

17. 17 Dimensions Typically, Dimensions are denormalized into a single table for the dimension. This done to: Simplify the dimension for the user Speeds retrieval � eliminate some joins (remember: we constrain first on the dimensions, then �attack� the fact tables to retrieve just those fact records required) Allows use of specialized data warehouse oriented indexes, bit map, star join etc. When a dimension is not denormalized, this is called snowflaking. Permitted, but not recommended Benefit: space saving

18. 18 Snowflake vs Denormalized Dimension

19. 19 Dimension Attributes Typically, the Attributes of a dimension are employed by users to constrain query: Show me all sales for 3 horsepower mowers in Nova Scotia for June, 2001 As such, the attributes must be readable and understandable Warehouse team must ensure the quality of the attributes by: Eliminating descriptions that are codes (3HPMWR) to represent 3 Horsepower Mower Eliminating multi-valued descriptions (Green Paper Towel 3OZ 2 Ply) � create separate attributes

20. 20 Dimension Attributes Goal of the designer is to make all attributes Verbose (use ful words) Descriptive Complete � no missing values Quality assured Indexed, for those most heavily used Equally available (single dimension table) Documented in metadata

23. Bit-Map Index

24. 24 Slowly Changing Dimensions One major difference between an OLTP system and a data warehouse is the ability to accurately describe the past. OLTP systems are usually very poor at correctly representing a business as of a month or a year ago. The large volume of data in an OLTP system is typically purged every 90 to 180 days. it is difficult for an OLTP system to correctly represent the past. In an OLTP system, do you really want to keep old order statuses, product descriptions, supplier descriptions, and customer descriptions over a multiyear period?

25. 25 Slowly Changing Dimensions The data warehouse must accept the responsibility of accurately describing the past. By doing so, the data warehouse simplifies the responsibilities of the OLTP system. Not only does the data warehouse relieve the OLTP system of almost all forms of reporting, but the data warehouse contains special structures that have several ways of tracking historical data.

26. 26 Slowly Changing Dimensions There are three main techniques for handling slowly changing dimensions in a data warehouse: overwriting creating another dimension record creating a current value field. Each technique handles the problem differently. The designer chooses among th ese techniques depending on the users' needs.

27. 27 Overwriting The first technique is the simplest and fastest. But it doesn't maintain past history! Overwriting is frequently used when the data warehouse team legitimately decides that the old value of the changed dimension attribute is not interesting . For example, if you find incorrect values in the city and state attributes in a customer record, then overwriting would almost certainly be used.

28. 28 Create a new dimension record The second technique is the most common and has a number of powerful advantages. Example: Suppose you work in a manufacturing company and one of your main data warehouse schemas is the company's shipments. The product dimension is one of the most important dimensions in this dimensional schema. A typical product dimension in a shipments schema would have several thousand detailed records, each representing a distinguishable product capable of being shipped. A good product dimension table would have at least 50 attributes describing the products, including hierarchical attributes such as brand and category, as well as nonhierarchical attributes such as flavor and package type. An important attribute provided by manufacturing operations is the SKU number assigned to the product. You should start by using the SKU number as the key to the product dimension table.

29. 29 Create a New Dimension Record This technique for tracking slowly changing dimensions is very powerful because new dimension records automatically partition history in the fact table. The old version of the dimension record points to all history in the fact table prior to the change. The new version of the dimension record points to all history after the change. Another advantage of this technique is that you can gracefully track as many changes to a dimensional item as you wish. Each change generates a new dimension record, and each record partitions history perfectly.

30. 30 Create a new Dimension Record Suppose that manufacturing operations makes a slight change in packaging of SKU #38, and the packaging description changes from "glued box" to "pasted box." Along with this change, manufacturing operations decides not to change the SKU number of the product, or the bar code (UPC) that is printed on the box. If the data warehouse team decides to track this change, the best way to do this is to issue another product record, as if the pasted box version were a brand new product. The only difference between the two product records is the packaging description.

31. 31 Create a New Dimension Record

32. 32 Create a Current Value Field You use the third technique when you want to track a change in a dimension value, but it is legitimate to use the old value both before and after the change. This situation occurs most often in sales force realignments, where although you have changed the names of your sales regions, you still have a need to state today's sales in terms of yesterday's region names, just to "see how they would have done" using the old organization. You can attack this requirement, not by creating a new dimension record as in the second technique, but by creating a new "current value" field.

33. 33 Create a Current Value Field Suppose in a sales team dimension table, where the records represent sales teams, you have a field called "region." When you decide to rearrange the sales force and assign each team to newly named regions, you create a new field in the sales dimension table called "current_region." You should probably rename the old field "previous_region." These two fields now allow an application to group all sales fact records by either the old sales assignments (previous region) or the new sales assignments (current region).

34. 34 Create a Current Value Field This schema allows only the most recent sales force change to be tracked, but it offers the immense flexibility of being able to state all of the history by either of the two sales force assignment schemas. It is conceivable, although somewhat awkward, to generalize this approach to the two most recent changes. If many of these sales force realignments take place and it is desired to track them all, then the second technique should probably be used.

35. 35 Examples - technique Every 6 months, the GL Account associated with �Revenue� changes. Assume we do reporting by GL Account Dimension We determine that the Postal Code associated with Vendors has been incorrectly entered The geographic region that our customer is assigned is changed due to the acquisition of another company. We want to do analysis based on new and old geographies

Dimensional Data Modeling

Dimensional Data Modeling

Presentation Transcript

Dimensional Modeling

Dimensional Modeling

Dimensional Modeling

Dimensional Modeling

Three Dimensional Modeling Transformations

Dimensional Modeling Primer

Successful Dimensional Modeling of Very Large Data Warehouses

Dimensional Modeling

Kimball Method Dimensional Modeling

3-Dimensional Modeling

MIS2502: Data Analytics Dimensional Data Modeling

Principles of Dimensional Modeling

Introduction To Dimensional Modeling

3-Dimensional Modeling

MIS2502: Data Analytics Dimensional Data Modeling

More Relational Dimensional Modeling

Dimensional Modeling

3-Dimensional Modeling

Advanced Dimensional Modeling Concepts

Three-Dimensional Modeling

High Dimensional Data