session 12 13 data management for decision support l.
Download
Skip this Video
Loading SlideShow in 5 Seconds..
Data Models for Warehouse PowerPoint Presentation
Download Presentation
Data Models for Warehouse

Loading in 2 Seconds...

play fullscreen
1 / 61

Data Models for Warehouse - PowerPoint PPT Presentation


  • 102 Views
  • Uploaded on

Session-12/13 Data Management for Decision Support . Data Models for Warehouse. Data Models. Data Models relations stars & snowflakes cubes Operators slice & dice roll-up, drill down pivoting other. Data Models.

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Data Models for Warehouse' - zoie


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
session 12 13 data management for decision support
Session-12/13

Data Management for Decision Support

Data Models for Warehouse

data models
Data Models
  • Data Models
    • relations
    • stars & snowflakes
    • cubes
  • Operators
    • slice & dice
    • roll-up, drill down
    • pivoting
    • other
data models3
Data Models
  • Star schemas are database schemas that exploit the structure of data for decision support query
    • Queries in DSS tend to
      • Examine a set of factual transactions- POS, Customer events
      • Facts are analyzed in variety of ways - POS transaction by week, or store
    • For example a retail store
      • POS is at the center
      • Product information - SKU, hierarchy of ( section dept, BU)
      • Time information - day, week, month, year
      • Stores - Store-id, hierarchy (regions, city, locality)
      • Suppliers- Sup-id, location, discounts
data models4
Data Models

Products

Time

Sales Transactions

Stores

Suppliers

Information is split between two classes- Factual information and Reference information

fact data
FACT DATA
  • Fact data records the information on factual event that occurred in the business- POS, Phone calls, Banking transactions
  • Typically 70% of Warehouse data is Fact data
  • Important to identify and define structure right in the first place as restructuring is an expensive process
  • Detail content of FACT is derived from the business requirement
  • Recorded Facts do not change as they are events of past
dimension data
Dimension Data
  • Information that is used for analyzing the elemental data, for example, product hierarchy, time periods, customers, stores
  • It is the reference data used for analysis of Facts
  • Organizing the information in separate reference tables offers better query performance
  • It differs from Fact data as it changes over time, due to changes in business, reorganization
  • It should be structured to permit rapid changes
fact and dimensions
FACT and Dimensions
  • Tens to millions of rows
  • One primary key
  • Textual decription
  • Frequently modifies
  • Millions to billions of rows
  • Multiple foreign keys
  • Numeric
  • Does not change
decision support queries
Decision Support Queries
  • Examples
  • Average number of sales of Haldiram per store over last month (various types within the brand)
  • Projected sales of Deepavali gift packs against the actual
  • The top 20% customers (spending) over last quarter
  • The customers with average balance in excess of Rs. 25000 for past one year
  • ==> Each of these queries is based on Factual data
decision support queries9
Decision Support Queries
  • Examples

Sales of Haldiram

Customer Spend

Account Balance

POS Transaction

Membership card Transaction

Account transactions

Quantity Sold

Product

Store

Date, Time

Revenue Realized

Customer-Id

Store

Transaction Value

Date and Time

Customer

AC number

type of transaction

amount

star schema
Star Schema
  • The star schema is a data-modeling technique used to map multidimensional decision support into a relational database.
  • Star schemas yield an easily implemented model for multidimensional data analysis while still preserving the relational structure of the operational database.
  • Four Components:
    • Facts
    • Dimensions
    • Attributes
    • Attribute hierarchies
star schema12
Star Schema
  • Facts
    • Facts are numeric measurements (values) that represent a specific business aspect or activity.
    • The fact table contains facts that are linked through their dimensions.
    • Facts can be computed or derived at run-time (metrics).
  • Dimensions
    • Dimensions are qualifying characteristics that provide additional perspectives to a given fact.
    • Dimensions are stored in dimension tables.
identifying facts and dimensions
Identifying Facts and Dimensions

Elemental Transaction

Determine Key Dimensions

Check if Fact is a dimension

Check if dimensions is a Fact

identification step 1
Identification: Step 1
  • Examine the enterprise model and identify the transaction that or of interest- driven by business requirement analysis
  • These will be transaction that describes events fundamental to the business e.g., #calls for Telecom, account transactions in banking
  • For each potential Fact ask a question- Is this information operated upon by business process? Daily sales versus POS, even if system reports daily sales POS may be the FACT
  • The limit of current recording should not influence Warehouse design
identification step 115
Identification: Step 1

Fact Table

POS Transaction

Stock movement and position

Customer events

Account transactions

Claims and receipts

Call events

Customer events(install, disconnect, payment)

Sector and Business

Retail

Sales

Shrinkage

Retail Banking

Customer profiling

Profitability

Insurance

Product Profitability

Telecom

Call Analysis

Customer Analysis

identification step 2
Identification: Step 2
  • Look at the logical model to find the entities associated with entities in the fact table. List out all such logically associate entities.
  • These are candidate References, the task is to find key dimension entities that may not be directly associated.
  • For example, retail banking account transaction are candidate fact table. The account transaction is candidate reference. But, the customer I indirectly related to transaction. Although, a better choice.
  • Analyze account transaction by account?
  • Analyze how customers use our services?
  • You store both relationships but customer becomes a dimension
identification step3
Identification: Step3

FACT is not actually a denormalized dimension table

  • Consider the following:
    • house-details
    • Cable-laid
    • Sales-persons visit
    • connected to the service
    • promotional material sent
    • subscription cancelled
  • Home-details - candidate fact
  • Operational events
  • Report on number of connections quarter-to-date
  • Time-lag between laying and subscrition
identification step 4
Identification: Step 4

Dimension is not a FACT

  • Lot depends on DSS requirements-
    • Customer can be FACT or Dimension
    • Promotions can be fact or dimensions
  • Ask questions using other dimensions- Using how many other dimensions, Can I view this entity.
    • Can I view promotion by Time?
    • Can I view promotions by product?
    • Can I view promotion by store?
    • Can I vie promotions by suppliers?
  • If answer to these question is yes, then it is a FACT
star schema19
Star Schema
  • Attributes
    • Each dimension table contains attributes. Attributes are often used to search, filter, or classify facts.
    • Dimensions provide descriptive characteristics about the facts through their attributes.

Possible Attributes For Sales Dimensions

star schema22
Star Schema
  • Attribute Hierarchies
    • Attributes within dimensions can be ordered in a well-defined attribute hierarchy.
    • The attribute hierarchy provides a top-down data organization that is used for two main purposes:
      • Aggregation
      • Drill-down/roll-up data analysis
star schema25
Star Schema
  • Star Schema Representation
    • Facts and dimensions are normally represented by physical tables in the data warehouse database.
    • The fact table is related to each dimension table in a many-to-one (M:1) relationship.
    • Fact and dimension tables are related by foreign keys and are subject to the primary/foreign key constraints.
the multi dimensional model
“Sales by product line over the past six months”

“Sales by store between 1990 and 1995”

The Multi-Dimensional Model

Store Info

Key columns joining fact table

to dimension tables

Numerical Measures

Prod Code Time Code Store Code Sales Qty

Fact table for

measures

Product Info

Dimension tables

Time Info

. . .

dimensional modeling
Dimensions are organized into hierarchies

E.g., Time dimension: days  weeks  quarters

E.g., Product dimension: product  product line  brand

Dimensions have attributes

Dimensional Modeling
dimension hierarchies
Dimension Hierarchies

Store Dimension

Product Dimension

Total

Total

Region

Manufacturer

District

Brand

Stores

Products

rolap dimensional modeling using relational dbms
ROLAP: Dimensional Modeling Using Relational DBMS
  • Special schema design: star, snowflake
  • Special indexes: bitmap, multi-table join
  • Special tuning: maximize query throughput
  • Proven technology (relational model, DBMS), tend to outperform specialized MDDB especially on large data sets
  • Products
    • IBM DB2, Oracle, Sybase IQ, RedBrick, Informix
molap dimensional modeling using the multi dimensional model
MOLAP: Dimensional Modeling Using the Multi Dimensional Model
  • MDDB: a special-purpose data model
  • Facts stored in multi-dimensional arrays
  • Dimensions used to index array
  • Sometimes on top of relational DB
  • Products
    • Pilot, Arbor Essbase, Gentia
the classic star schema
The “Classic” Star Schema
  • A single fact table, with detail and summary data
  • Fact table primary key has only one key column per dimension
  • Each key is generated
  • Each dimension is a single table, highly denormalized

Benefits: Easy to understand, easy to define hierarchies, reduces # of physical joins, low maintenance, very simple metadata

Drawbacks: Summary data in the fact table yields poorer performance for summary levels, huge dimension tables a problem

the classic star schema37
The “Classic” Star Schema

The biggest drawback: dimension tables must carry a level indicator for every record and every query must use it. In the example below, without the level constraint, keys for all stores in the NORTH region, including aggregates for region and district will be pulled from the fact table, resulting in error.

Example:

Select A.STORE_KEY, A.PERIOD_KEY, A.dollars from Fact_Table A

where A.STORE_KEY in (select STORE_KEY

from Store_Dimension B

where region = “North” and Level = 2)

and etc...

Level is needed

whenever aggregates

are stored with detail

facts.

the level problem
The “Level” Problem
  • Level is a problem because because it causes potential for error. If the query builder, human or program, forgets about it, perfectly reasonable looking WRONG answers can occur.
  • One alternative: the FACT CONSTELLATION model...
the fact constellation schema
The “Fact Constellation” Schema

District Fact Table

Region Fact Table

District_ID

PRODUCT_KEY

PERIOD_KEY

Region_ID

PRODUCT_KEY

PERIOD_KEY

Dollars

Units

Price

Dollars

Units

Price

the fact constellation schema40
The “Fact Constellation” Schema

In the Fact Constellations,

aggregate tables are created

separately from the detail, therefor

it is impossible to pick up, for

example, Store detail when querying

the District Fact Table.

Major Advantage: No need for the “Level” indicator in the dimension tables,

since no aggregated data is stored with lower-level detail

Disadvantage:Dimension tables are still very large in some cases, which can slow performance; front-end must be able to detect existence of aggregate facts, which requires more extensive metadata

another alternative to level
Another Alternative to “Level”
  • Fact Constellation is a good alternative to the Star, but when dimensions have very high cardinality, the sub-selects in the dimension tables can be a source of delay.
  • An alternative is to normalize the dimension tables by attribute level, with each smaller dimension table pointing to an appropriate aggregated fact table, the “Snowflake Schema” ...
the snowflake schema
The “Snowflake” Schema

Store Dimension

STORE KEY

District_ID

Region_ID

Store Description

City

State

District ID

District Desc.

Region_ID

Region Desc.

Regional Mgr.

District Desc.

Region_ID

Region Desc.

Regional Mgr.

Store Fact Table

District Fact Table

RegionFact Table

Region_ID

PRODUCT_KEY

PERIOD_KEY

District_ID

PRODUCT_KEY

PERIOD_KEY

STORE KEY

PRODUCT KEY

Dollars

Units

Price

PERIOD KEY

Dollars

Units

Price

Dollars

Units

Price

the snowflake schema43
The “Snowflake” Schema
  • No LEVEL in dimension tables
  • Dimension tables are normalized by decomposing at the attribute level
  • Each dimension table has one key for each level of the dimensionís hierarchy
  • The lowest level key joins the dimension table to both the fact table and the lower level attribute table

How does it work? The best way is for the query to be built by understanding which summary levels exist, and finding the proper snowflaked attribute tables, constraining there for keys, then selecting from the fact table.

the snowflake schema44
The “Snowflake” Schema
  • Additional features: The original Store Dimension table, completely de-normalized, is kept intact, since certain queries can benefit by its all-encompassing content.
  • In practice, start with a Star Schema and create the “snowflakes” with queries. This eliminates the need to create separate extracts for each table, and referential integrity is inherited from the dimension table.

Advantage: Best performance when queries involve aggregation

Disadvantage: Complicated maintenance and metadata, explosion in the number

of tables in the database

advantages of rolap dimensional modeling
Advantages of ROLAP Dimensional Modeling
  • Define complex, multi-dimensional data with simple model
  • Reduces the number of joins a query has to process
  • Allows the data warehouse to evolve with rel. low maintenance
  • HOWEVER! Star schema and relational DBMS are not the magic solution
    • Query optimization is still problematic
aggregates
Aggregates
  • Add up amounts for day 1
  • In SQL: SELECT sum(amt) FROM SALE

WHERE date = 1

81

aggregates47
Aggregates
  • Add up amounts by day
  • In SQL: SELECT date, sum(amt) FROM SALE

GROUP BY date

another example
Another Example
  • Add up amounts by day, product
  • In SQL: SELECT date, sum(amt) FROM SALE
  • GROUP BY date, prodId

rollup

drill-down

aggregates49
Aggregates
  • Operators: sum, count, max, min, median, ave
  • “Having” clause
  • Using dimension hierarchy
    • average by region (within store)
    • maximum by month (within date)
rolap vs molap
ROLAP vs. MOLAP
  • ROLAP:Relational On-Line Analytical Processing
  • MOLAP:Multi-Dimensional On-Line Analytical Processing
the molap cube
The MOLAP Cube

Fact table view:

Multi-dimensional cube:

dimensions = 2

3 d cube
3-D Cube

Fact table view:

Multi-dimensional cube:

day 2

day 1

dimensions = 3

example
Example

roll-up to region

Dimensions:

Time, Product, Store

Attributes:

Product (upc, price, …)

Store …

Hierarchies:

Product  Brand  …

Day  Week  Quarter

Store  Region  Country

NY

Store

SF

roll-up to brand

LA

10

34

56

32

12

56

Juice

Milk

Coke

Cream

Soap

Bread

Product

roll-up to week

M T W Th F S S

Time

56 units of bread sold in LA on M

cube aggregation roll up

rollup

drill-down

Cube Aggregation: Roll-up

Example: computing sums

day 2

. . .

day 1

129

cube operators for roll up
Cube Operators for Roll-up

day 2

. . .

day 1

sale(s1,*,*)

129

sale(s2,p2,*)

sale(*,*,*)

extended cube
Extended Cube

*

day 2

sale(*,p2,*)

day 1

aggregation using hierarchies
Aggregation Using Hierarchies

store

day 2

day 1

region

country

(store s1 in Region A;

stores s2, s3 in Region B)

slicing
Slicing

day 2

day 1

TIME = day 1

slide59

Slicing &

Pivoting

summary of operations
Summary of Operations
  • Aggregation (roll-up)
    • aggregate (summarize) data to the next higher dimension element
    • e.g., total sales by city, year  total sales by region, year
  • Navigation to detailed data (drill-down)
  • Selection (slice) defines a subcube
    • e.g., sales where city =‘Gainesville’ and date = ‘1/15/90’
  • Calculation and ranking
    • e.g., top 3% of cities by average income
  • Visualization operations (e.g., Pivot)
  • Time functions
    • e.g., time average
query analysis tools
Query & Analysis Tools
  • Query Building
  • Report Writers (comparisons, growth, graphs,…)
  • Spreadsheet Systems
  • Web Interfaces
  • Data Mining