1 / 61

Data Models for Warehouse

Session-12/13 Data Management for Decision Support . Data Models for Warehouse. Data Models. Data Models relations stars & snowflakes cubes Operators slice & dice roll-up, drill down pivoting other. Data Models.

zoie
Download Presentation

Data Models for Warehouse

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Session-12/13 Data Management for Decision Support Data Models for Warehouse

  2. Data Models • Data Models • relations • stars & snowflakes • cubes • Operators • slice & dice • roll-up, drill down • pivoting • other

  3. Data Models • Star schemas are database schemas that exploit the structure of data for decision support query • Queries in DSS tend to • Examine a set of factual transactions- POS, Customer events • Facts are analyzed in variety of ways - POS transaction by week, or store • For example a retail store • POS is at the center • Product information - SKU, hierarchy of ( section dept, BU) • Time information - day, week, month, year • Stores - Store-id, hierarchy (regions, city, locality) • Suppliers- Sup-id, location, discounts

  4. Data Models Products Time Sales Transactions Stores Suppliers Information is split between two classes- Factual information and Reference information

  5. FACT DATA • Fact data records the information on factual event that occurred in the business- POS, Phone calls, Banking transactions • Typically 70% of Warehouse data is Fact data • Important to identify and define structure right in the first place as restructuring is an expensive process • Detail content of FACT is derived from the business requirement • Recorded Facts do not change as they are events of past

  6. Dimension Data • Information that is used for analyzing the elemental data, for example, product hierarchy, time periods, customers, stores • It is the reference data used for analysis of Facts • Organizing the information in separate reference tables offers better query performance • It differs from Fact data as it changes over time, due to changes in business, reorganization • It should be structured to permit rapid changes

  7. FACT and Dimensions • Tens to millions of rows • One primary key • Textual decription • Frequently modifies • Millions to billions of rows • Multiple foreign keys • Numeric • Does not change

  8. Decision Support Queries • Examples • Average number of sales of Haldiram per store over last month (various types within the brand) • Projected sales of Deepavali gift packs against the actual • The top 20% customers (spending) over last quarter • The customers with average balance in excess of Rs. 25000 for past one year • ==> Each of these queries is based on Factual data

  9. Decision Support Queries • Examples Sales of Haldiram Customer Spend Account Balance POS Transaction Membership card Transaction Account transactions Quantity Sold Product Store Date, Time Revenue Realized Customer-Id Store Transaction Value Date and Time Customer AC number type of transaction amount

  10. Star Schema • The star schema is a data-modeling technique used to map multidimensional decision support into a relational database. • Star schemas yield an easily implemented model for multidimensional data analysis while still preserving the relational structure of the operational database. • Four Components: • Facts • Dimensions • Attributes • Attribute hierarchies

  11. A Simple Star Schema

  12. Star Schema • Facts • Facts are numeric measurements (values) that represent a specific business aspect or activity. • The fact table contains facts that are linked through their dimensions. • Facts can be computed or derived at run-time (metrics). • Dimensions • Dimensions are qualifying characteristics that provide additional perspectives to a given fact. • Dimensions are stored in dimension tables.

  13. Identifying Facts and Dimensions Elemental Transaction Determine Key Dimensions Check if Fact is a dimension Check if dimensions is a Fact

  14. Identification: Step 1 • Examine the enterprise model and identify the transaction that or of interest- driven by business requirement analysis • These will be transaction that describes events fundamental to the business e.g., #calls for Telecom, account transactions in banking • For each potential Fact ask a question- Is this information operated upon by business process? Daily sales versus POS, even if system reports daily sales POS may be the FACT • The limit of current recording should not influence Warehouse design

  15. Identification: Step 1 Fact Table POS Transaction Stock movement and position Customer events Account transactions Claims and receipts Call events Customer events(install, disconnect, payment) Sector and Business Retail Sales Shrinkage Retail Banking Customer profiling Profitability Insurance Product Profitability Telecom Call Analysis Customer Analysis

  16. Identification: Step 2 • Look at the logical model to find the entities associated with entities in the fact table. List out all such logically associate entities. • These are candidate References, the task is to find key dimension entities that may not be directly associated. • For example, retail banking account transaction are candidate fact table. The account transaction is candidate reference. But, the customer I indirectly related to transaction. Although, a better choice. • Analyze account transaction by account? • Analyze how customers use our services? • You store both relationships but customer becomes a dimension

  17. Identification: Step3 FACT is not actually a denormalized dimension table • Consider the following: • house-details • Cable-laid • Sales-persons visit • connected to the service • promotional material sent • subscription cancelled • … • Home-details - candidate fact • Operational events • Report on number of connections quarter-to-date • Time-lag between laying and subscrition

  18. Identification: Step 4 Dimension is not a FACT • Lot depends on DSS requirements- • Customer can be FACT or Dimension • Promotions can be fact or dimensions • Ask questions using other dimensions- Using how many other dimensions, Can I view this entity. • Can I view promotion by Time? • Can I view promotions by product? • Can I view promotion by store? • Can I vie promotions by suppliers? • If answer to these question is yes, then it is a FACT

  19. Star Schema • Attributes • Each dimension table contains attributes. Attributes are often used to search, filter, or classify facts. • Dimensions provide descriptive characteristics about the facts through their attributes. Possible Attributes For Sales Dimensions

  20. Three Dimensional View Of Sales

  21. Slice And Dice View Of Sales

  22. Star Schema • Attribute Hierarchies • Attributes within dimensions can be ordered in a well-defined attribute hierarchy. • The attribute hierarchy provides a top-down data organization that is used for two main purposes: • Aggregation • Drill-down/roll-up data analysis

  23. A Location Attribute Hierarchy

  24. Attribute Hierarchies In Multidimensional Analysis

  25. Star Schema • Star Schema Representation • Facts and dimensions are normally represented by physical tables in the data warehouse database. • The fact table is related to each dimension table in a many-to-one (M:1) relationship. • Fact and dimension tables are related by foreign keys and are subject to the primary/foreign key constraints.

  26. Star Schema For Sales

  27. Orders Star Schema

  28. “Sales by product line over the past six months” “Sales by store between 1990 and 1995” The Multi-Dimensional Model Store Info Key columns joining fact table to dimension tables Numerical Measures Prod Code Time Code Store Code Sales Qty Fact table for measures Product Info Dimension tables Time Info . . .

  29. Dimensions are organized into hierarchies E.g., Time dimension: days  weeks  quarters E.g., Product dimension: product  product line  brand Dimensions have attributes Dimensional Modeling

  30. Dimension Hierarchies Store Dimension Product Dimension Total Total Region Manufacturer District Brand Stores Products

  31. ROLAP: Dimensional Modeling Using Relational DBMS • Special schema design: star, snowflake • Special indexes: bitmap, multi-table join • Special tuning: maximize query throughput • Proven technology (relational model, DBMS), tend to outperform specialized MDDB especially on large data sets • Products • IBM DB2, Oracle, Sybase IQ, RedBrick, Informix

  32. MOLAP: Dimensional Modeling Using the Multi Dimensional Model • MDDB: a special-purpose data model • Facts stored in multi-dimensional arrays • Dimensions used to index array • Sometimes on top of relational DB • Products • Pilot, Arbor Essbase, Gentia

  33. Star Schema (in RDBMS)

  34. Star Schema Example

  35. Star Schema with Sample Data

  36. The “Classic” Star Schema • A single fact table, with detail and summary data • Fact table primary key has only one key column per dimension • Each key is generated • Each dimension is a single table, highly denormalized Benefits: Easy to understand, easy to define hierarchies, reduces # of physical joins, low maintenance, very simple metadata Drawbacks: Summary data in the fact table yields poorer performance for summary levels, huge dimension tables a problem

  37. The “Classic” Star Schema The biggest drawback: dimension tables must carry a level indicator for every record and every query must use it. In the example below, without the level constraint, keys for all stores in the NORTH region, including aggregates for region and district will be pulled from the fact table, resulting in error. Example: Select A.STORE_KEY, A.PERIOD_KEY, A.dollars from Fact_Table A where A.STORE_KEY in (select STORE_KEY from Store_Dimension B where region = “North” and Level = 2) and etc... Level is needed whenever aggregates are stored with detail facts.

  38. The “Level” Problem • Level is a problem because because it causes potential for error. If the query builder, human or program, forgets about it, perfectly reasonable looking WRONG answers can occur. • One alternative: the FACT CONSTELLATION model...

  39. The “Fact Constellation” Schema District Fact Table Region Fact Table District_ID PRODUCT_KEY PERIOD_KEY Region_ID PRODUCT_KEY PERIOD_KEY Dollars Units Price Dollars Units Price

  40. The “Fact Constellation” Schema In the Fact Constellations, aggregate tables are created separately from the detail, therefor it is impossible to pick up, for example, Store detail when querying the District Fact Table. Major Advantage: No need for the “Level” indicator in the dimension tables, since no aggregated data is stored with lower-level detail Disadvantage:Dimension tables are still very large in some cases, which can slow performance; front-end must be able to detect existence of aggregate facts, which requires more extensive metadata

  41. Another Alternative to “Level” • Fact Constellation is a good alternative to the Star, but when dimensions have very high cardinality, the sub-selects in the dimension tables can be a source of delay. • An alternative is to normalize the dimension tables by attribute level, with each smaller dimension table pointing to an appropriate aggregated fact table, the “Snowflake Schema” ...

  42. The “Snowflake” Schema Store Dimension STORE KEY District_ID Region_ID Store Description City State District ID District Desc. Region_ID Region Desc. Regional Mgr. District Desc. Region_ID Region Desc. Regional Mgr. Store Fact Table District Fact Table RegionFact Table Region_ID PRODUCT_KEY PERIOD_KEY District_ID PRODUCT_KEY PERIOD_KEY STORE KEY PRODUCT KEY Dollars Units Price PERIOD KEY Dollars Units Price Dollars Units Price

  43. The “Snowflake” Schema • No LEVEL in dimension tables • Dimension tables are normalized by decomposing at the attribute level • Each dimension table has one key for each level of the dimensionís hierarchy • The lowest level key joins the dimension table to both the fact table and the lower level attribute table How does it work? The best way is for the query to be built by understanding which summary levels exist, and finding the proper snowflaked attribute tables, constraining there for keys, then selecting from the fact table.

  44. The “Snowflake” Schema • Additional features: The original Store Dimension table, completely de-normalized, is kept intact, since certain queries can benefit by its all-encompassing content. • In practice, start with a Star Schema and create the “snowflakes” with queries. This eliminates the need to create separate extracts for each table, and referential integrity is inherited from the dimension table. Advantage: Best performance when queries involve aggregation Disadvantage: Complicated maintenance and metadata, explosion in the number of tables in the database

  45. Advantages of ROLAP Dimensional Modeling • Define complex, multi-dimensional data with simple model • Reduces the number of joins a query has to process • Allows the data warehouse to evolve with rel. low maintenance • HOWEVER! Star schema and relational DBMS are not the magic solution • Query optimization is still problematic

  46. Aggregates • Add up amounts for day 1 • In SQL: SELECT sum(amt) FROM SALE WHERE date = 1 81

  47. Aggregates • Add up amounts by day • In SQL: SELECT date, sum(amt) FROM SALE GROUP BY date

  48. Another Example • Add up amounts by day, product • In SQL: SELECT date, sum(amt) FROM SALE • GROUP BY date, prodId rollup drill-down

  49. Aggregates • Operators: sum, count, max, min, median, ave • “Having” clause • Using dimension hierarchy • average by region (within store) • maximum by month (within date)

  50. ROLAP vs. MOLAP • ROLAP:Relational On-Line Analytical Processing • MOLAP:Multi-Dimensional On-Line Analytical Processing

More Related