Data Models for Warehouse

Session-12/13 Data Management for Decision Support Data Models for Warehouse

Data Models • Data Models • relations • stars & snowflakes • cubes • Operators • slice & dice • roll-up, drill down • pivoting • other

Data Models • Star schemas are database schemas that exploit the structure of data for decision support query • Queries in DSS tend to • Examine a set of factual transactions- POS, Customer events • Facts are analyzed in variety of ways - POS transaction by week, or store • For example a retail store • POS is at the center • Product information - SKU, hierarchy of ( section dept, BU) • Time information - day, week, month, year • Stores - Store-id, hierarchy (regions, city, locality) • Suppliers- Sup-id, location, discounts

Data Models Products Time Sales Transactions Stores Suppliers Information is split between two classes- Factual information and Reference information

FACT DATA • Fact data records the information on factual event that occurred in the business- POS, Phone calls, Banking transactions • Typically 70% of Warehouse data is Fact data • Important to identify and define structure right in the first place as restructuring is an expensive process • Detail content of FACT is derived from the business requirement • Recorded Facts do not change as they are events of past

Dimension Data • Information that is used for analyzing the elemental data, for example, product hierarchy, time periods, customers, stores • It is the reference data used for analysis of Facts • Organizing the information in separate reference tables offers better query performance • It differs from Fact data as it changes over time, due to changes in business, reorganization • It should be structured to permit rapid changes

FACT and Dimensions • Tens to millions of rows • One primary key • Textual decription • Frequently modifies • Millions to billions of rows • Multiple foreign keys • Numeric • Does not change

Decision Support Queries • Examples • Average number of sales of Haldiram per store over last month (various types within the brand) • Projected sales of Deepavali gift packs against the actual • The top 20% customers (spending) over last quarter • The customers with average balance in excess of Rs. 25000 for past one year • ==> Each of these queries is based on Factual data

Decision Support Queries • Examples Sales of Haldiram Customer Spend Account Balance POS Transaction Membership card Transaction Account transactions Quantity Sold Product Store Date, Time Revenue Realized Customer-Id Store Transaction Value Date and Time Customer AC number type of transaction amount

Star Schema • The star schema is a data-modeling technique used to map multidimensional decision support into a relational database. • Star schemas yield an easily implemented model for multidimensional data analysis while still preserving the relational structure of the operational database. • Four Components: • Facts • Dimensions • Attributes • Attribute hierarchies

A Simple Star Schema

Star Schema • Facts • Facts are numeric measurements (values) that represent a specific business aspect or activity. • The fact table contains facts that are linked through their dimensions. • Facts can be computed or derived at run-time (metrics). • Dimensions • Dimensions are qualifying characteristics that provide additional perspectives to a given fact. • Dimensions are stored in dimension tables.

Identifying Facts and Dimensions Elemental Transaction Determine Key Dimensions Check if Fact is a dimension Check if dimensions is a Fact

Identification: Step 1 • Examine the enterprise model and identify the transaction that or of interest- driven by business requirement analysis • These will be transaction that describes events fundamental to the business e.g., #calls for Telecom, account transactions in banking • For each potential Fact ask a question- Is this information operated upon by business process? Daily sales versus POS, even if system reports daily sales POS may be the FACT • The limit of current recording should not influence Warehouse design

Identification: Step 1 Fact Table POS Transaction Stock movement and position Customer events Account transactions Claims and receipts Call events Customer events(install, disconnect, payment) Sector and Business Retail Sales Shrinkage Retail Banking Customer profiling Profitability Insurance Product Profitability Telecom Call Analysis Customer Analysis

Identification: Step 2 • Look at the logical model to find the entities associated with entities in the fact table. List out all such logically associate entities. • These are candidate References, the task is to find key dimension entities that may not be directly associated. • For example, retail banking account transaction are candidate fact table. The account transaction is candidate reference. But, the customer I indirectly related to transaction. Although, a better choice. • Analyze account transaction by account? • Analyze how customers use our services? • You store both relationships but customer becomes a dimension

Identification: Step3 FACT is not actually a denormalized dimension table • Consider the following: • house-details • Cable-laid • Sales-persons visit • connected to the service • promotional material sent • subscription cancelled • … • Home-details - candidate fact • Operational events • Report on number of connections quarter-to-date • Time-lag between laying and subscrition

Identification: Step 4 Dimension is not a FACT • Lot depends on DSS requirements- • Customer can be FACT or Dimension • Promotions can be fact or dimensions • Ask questions using other dimensions- Using how many other dimensions, Can I view this entity. • Can I view promotion by Time? • Can I view promotions by product? • Can I view promotion by store? • Can I vie promotions by suppliers? • If answer to these question is yes, then it is a FACT

Star Schema • Attributes • Each dimension table contains attributes. Attributes are often used to search, filter, or classify facts. • Dimensions provide descriptive characteristics about the facts through their attributes. Possible Attributes For Sales Dimensions

Three Dimensional View Of Sales

Slice And Dice View Of Sales

Star Schema • Attribute Hierarchies • Attributes within dimensions can be ordered in a well-defined attribute hierarchy. • The attribute hierarchy provides a top-down data organization that is used for two main purposes: • Aggregation • Drill-down/roll-up data analysis

A Location Attribute Hierarchy

Attribute Hierarchies In Multidimensional Analysis

Star Schema • Star Schema Representation • Facts and dimensions are normally represented by physical tables in the data warehouse database. • The fact table is related to each dimension table in a many-to-one (M:1) relationship. • Fact and dimension tables are related by foreign keys and are subject to the primary/foreign key constraints.

Star Schema For Sales

Orders Star Schema

“Sales by product line over the past six months” “Sales by store between 1990 and 1995” The Multi-Dimensional Model Store Info Key columns joining fact table to dimension tables Numerical Measures Prod Code Time Code Store Code Sales Qty Fact table for measures Product Info Dimension tables Time Info . . .

Dimensions are organized into hierarchies E.g., Time dimension: days  weeks  quarters E.g., Product dimension: product  product line  brand Dimensions have attributes Dimensional Modeling

Dimension Hierarchies Store Dimension Product Dimension Total Total Region Manufacturer District Brand Stores Products

ROLAP: Dimensional Modeling Using Relational DBMS • Special schema design: star, snowflake • Special indexes: bitmap, multi-table join • Special tuning: maximize query throughput • Proven technology (relational model, DBMS), tend to outperform specialized MDDB especially on large data sets • Products • IBM DB2, Oracle, Sybase IQ, RedBrick, Informix

MOLAP: Dimensional Modeling Using the Multi Dimensional Model • MDDB: a special-purpose data model • Facts stored in multi-dimensional arrays • Dimensions used to index array • Sometimes on top of relational DB • Products • Pilot, Arbor Essbase, Gentia

Star Schema (in RDBMS)

Star Schema Example

Star Schema with Sample Data

The “Classic” Star Schema • A single fact table, with detail and summary data • Fact table primary key has only one key column per dimension • Each key is generated • Each dimension is a single table, highly denormalized Benefits: Easy to understand, easy to define hierarchies, reduces # of physical joins, low maintenance, very simple metadata Drawbacks: Summary data in the fact table yields poorer performance for summary levels, huge dimension tables a problem

The “Classic” Star Schema The biggest drawback: dimension tables must carry a level indicator for every record and every query must use it. In the example below, without the level constraint, keys for all stores in the NORTH region, including aggregates for region and district will be pulled from the fact table, resulting in error. Example: Select A.STORE_KEY, A.PERIOD_KEY, A.dollars from Fact_Table A where A.STORE_KEY in (select STORE_KEY from Store_Dimension B where region = “North” and Level = 2) and etc... Level is needed whenever aggregates are stored with detail facts.

The “Level” Problem • Level is a problem because because it causes potential for error. If the query builder, human or program, forgets about it, perfectly reasonable looking WRONG answers can occur. • One alternative: the FACT CONSTELLATION model...

The “Fact Constellation” Schema District Fact Table Region Fact Table District_ID PRODUCT_KEY PERIOD_KEY Region_ID PRODUCT_KEY PERIOD_KEY Dollars Units Price Dollars Units Price

The “Fact Constellation” Schema In the Fact Constellations, aggregate tables are created separately from the detail, therefor it is impossible to pick up, for example, Store detail when querying the District Fact Table. Major Advantage: No need for the “Level” indicator in the dimension tables, since no aggregated data is stored with lower-level detail Disadvantage:Dimension tables are still very large in some cases, which can slow performance; front-end must be able to detect existence of aggregate facts, which requires more extensive metadata

Another Alternative to “Level” • Fact Constellation is a good alternative to the Star, but when dimensions have very high cardinality, the sub-selects in the dimension tables can be a source of delay. • An alternative is to normalize the dimension tables by attribute level, with each smaller dimension table pointing to an appropriate aggregated fact table, the “Snowflake Schema” ...

The “Snowflake” Schema Store Dimension STORE KEY District_ID Region_ID Store Description City State District ID District Desc. Region_ID Region Desc. Regional Mgr. District Desc. Region_ID Region Desc. Regional Mgr. Store Fact Table District Fact Table RegionFact Table Region_ID PRODUCT_KEY PERIOD_KEY District_ID PRODUCT_KEY PERIOD_KEY STORE KEY PRODUCT KEY Dollars Units Price PERIOD KEY Dollars Units Price Dollars Units Price

The “Snowflake” Schema • No LEVEL in dimension tables • Dimension tables are normalized by decomposing at the attribute level • Each dimension table has one key for each level of the dimensionís hierarchy • The lowest level key joins the dimension table to both the fact table and the lower level attribute table How does it work? The best way is for the query to be built by understanding which summary levels exist, and finding the proper snowflaked attribute tables, constraining there for keys, then selecting from the fact table.

The “Snowflake” Schema • Additional features: The original Store Dimension table, completely de-normalized, is kept intact, since certain queries can benefit by its all-encompassing content. • In practice, start with a Star Schema and create the “snowflakes” with queries. This eliminates the need to create separate extracts for each table, and referential integrity is inherited from the dimension table. Advantage: Best performance when queries involve aggregation Disadvantage: Complicated maintenance and metadata, explosion in the number of tables in the database

Advantages of ROLAP Dimensional Modeling • Define complex, multi-dimensional data with simple model • Reduces the number of joins a query has to process • Allows the data warehouse to evolve with rel. low maintenance • HOWEVER! Star schema and relational DBMS are not the magic solution • Query optimization is still problematic

Aggregates • Add up amounts for day 1 • In SQL: SELECT sum(amt) FROM SALE WHERE date = 1 81

Aggregates • Add up amounts by day • In SQL: SELECT date, sum(amt) FROM SALE GROUP BY date

Another Example • Add up amounts by day, product • In SQL: SELECT date, sum(amt) FROM SALE • GROUP BY date, prodId rollup drill-down

Aggregates • Operators: sum, count, max, min, median, ave • “Having” clause • Using dimension hierarchy • average by region (within store) • maximum by month (within date)

ROLAP vs. MOLAP • ROLAP:Relational On-Line Analytical Processing • MOLAP:Multi-Dimensional On-Line Analytical Processing

Data Models for Warehouse

Data Models for Warehouse

Presentation Transcript

Data Warehouse

Data Warehouse Models and OLAP Operations

Data Warehouse

Data Warehouse

Data Warehouse

Data Warehouse

Data Warehouse

Data Warehouse

Data Warehouse Models and OLAP Operations

Data Warehouse

Data Warehouse

Data Modeling for Data Warehouse