Data Warehouse Schema • Star Schema • Snowflake Schema • Fact constellation
Star Schema • A single, large and central fact table and one table for each dimension. • The star schema separates business data into facts, which hold the measurable, quantitative data about a business, and dimensions which are descriptive attributes related to fact data. Examples of fact data include sales price, sale quantity, and time, distance, speed, and weight measurements. Related dimension attribute examples include product models, product colors, product sizes, geographic locations, and salesperson names.
Star Schema (contd..) Store Dimension Fact Table Time Dimension Product Dimension Benefits:Easy to understand, easy to define hierarchies, reduces no. of physical joins.
SnowFlake Schema • Variant of star schema model. • A single, large and central fact table and one or more tables for each dimension. • Dimension tables are normalized i.e. split dimension table data into additional tables
SnowFlake Schema (contd..) Fact Table Time Dimension Store Dimension City Dimension Product Dimension Drawbacks: Time consuming joins,report generation slow
Promotion Hotels Booking Checkout Travel Agents Room Type Customer Fact Constellation • Fact Constellation • Multiple fact tables that share many dimension tables • Booking and Checkout may share many dimension tables in the hotel industry
ROLAP and SQL • To understand a Multidimensional view of data and how it can be implemented in a relational database. ROLAP – Relational On-Line Analytical Processing. • To be aware of extensions have been added to SQL to make OLAP queries easier to write; in particular: • Rollup Operator • Cube Operator
Data Warehouse Definitions A subject-oriented, integrated, time-variant, and non-volatile collection of data in support of management’s decision-making process (Inmon, 1993). “The technique of taking large amounts of operational data and putting them into a large database, with the aim of analysing them every which way to yield useful information” Scoring Points – How Tesco Continues to Win Customer Loyalty. Terry Hunt Pub. Kogan Page
Three Complementary Trends • Data Warehousing: Consolidate data from many sources in one large repository. • Loading, periodic synchronization of replicas. • Semantic integration. • OLAP: • Complex SQL queries and views. • Queries based on spreadsheet-style operations and “multidimensional” view of data. • Interactive and “online” queries. • Data Mining: Exploratory search for interesting trends and anomalies. (Another module/topic!)
Design Issues – Further Example TIMES timeid date week month quarter year holiday_flag (Fact table) pid timeid locid sales SALES PRODUCTS LOCATIONS pid pname category price locid city state country • Fact table in BCNF; dimension tables un-normalized. • Dimension tables are small; updates/inserts/deletes are rare. So, anomalies less important than query performance. • This kind of schema is very common in OLAP applications, and is called a star schema; computing the join of all these relations is called a star join.
8 10 10 pid 11 12 13 30 20 50 25 8 15 1 2 3 timeid Multidimensional Data Model timeid locid sales pid Slice locid=1 is shown: locid • Collection of numeric measures, which depend on a set of dimensions. • E.g., measure Sales, dimensions Product (key: pid), Location (locid), and Time (timeid).
MOLAP vs ROLAP • Multidimensional data can be stored physically in a (disk-resident, persistent) array; called MOLAPsystems. Alternatively, can store as a relation; called ROLAPsystems. • The main relation, which relates dimensions to a measure, is called the fact table. Each dimension can have additional attributes and an associated dimension table. • E.g., Products(pid, pname, category, price) • Fact tables are much larger than dimensional tables.
Dimension Hierarchies PRODUCT TIME LOCATION year quarter country category week month state pname date city For each dimension, the set of values can be organized in a hierarchy:
OLAP Queries • Influenced by SQL and by spreadsheets. • A common operation is to aggregate a measure over one or more dimensions. • Find total sales. • Find total sales for each city, or for each state. • Roll-up: Aggregating at different levels of a dimension hierarchy. • E.g., Given total sales by city, we can roll-up to get sales by state.
OLAP Queries WI CA Total 63 81 144 1995 38 107 145 1996 75 35 110 1997 176 223 399 • Drill-down: The inverse of roll-up. • E.g., Given total sales by state, can drill-down to get total sales by city. • E.g., Can also drill-down on different dimensions to get total sales by product for each state. • Pivoting: Aggregation on selected dimensions. • E.g., Pivoting on Location and Time yields this cross-tabulation: Total
Comparison with SQL Queries SELECT T.Year, L.state, SUM(S.sales) FROM Sales S, Times T, Locations L WHERE S.timeid=T.timeid AND S.timeid=L.timeid GROUP BY T.year, L.state This query generates the entries in the body of the pivot chart on slide 12. The cross-tabulation obtained by pivoting can also be computed using a collection of SQLqueries:
SELECT L.state, SUM(S.sales) FROM Sales S, Location L WHERE S.timeid=L.timeid GROUP BY L.state This query produces the summary row at the bottom of the pivot chart on slide 12.
SELECT T.Year, SUM(S.sales) FROM Sales S, Times T WHERE S.timeid=T.timeid GROUP BY T.Year; This query produces the summary column on the right of the pivot chart on slide 12.
The cumulative sum in the bottom-right corner of the chart is produced by this query: SELECT SUM (S.sales) FROM Sales S, Locations L WHERE S.locid=L.locid;
The Cube Operator The Group by clause with the CUBE keyword is equivalent to a collection of GROUP BY statements: The result of the previous four queries can be produced with one query using the CUBE keyword. SELECT T.year, L.state, SUM (S.sales) FROM Sales S, Times T, Locations L WHERE S.timid=T.timeid AND S.locid.L.locid GROUP BY CUBE (T.year, L.state); The results of this query (on the following slide) is a tabular representation of the cross tabulation on slide 18.
OLAP: Example of Queries using Rollup keyword The above table (called SP) shows Suppliers who supply Parts in a certain quantity. S# = Supplier No and P# = Part number.
Queries Get the total shipment quantity. Get total shipment quantities by supplier Get total shipment quantities by part Get total shipment quantities by supplier and part.
1 Get the total shipment quantity. • Select SUM (QTY) From SP; • 2 Get total shipment quantities by supplier • Select S# , SUM(QTY) From SP Group by (S#);
3 Get total shipment quantities by part. • Select P#, SUM(QTY) From SP Group by (P#); • 4 Get total shipment quantities by supplier and part. • Select S#, P#, SUM(QTY) From SP Group by (S#, P#);
Rollup • Consider the following query Select S#, P#, Sum (QTY) As TOTQTY From SP Group by Rollup (S#, P#); • The query is a bundled SQL formulation of Queries 1, 2 and 4. Result is on the next slide. • Called Rollup because the quantities have been “rolled up” along the Supplier dimension.
CUBE • Consider the following query Select S#, P#, Sum (QTY) As TOTQTY From SP Group by CUBE (S#, P#); The result of this query, on the next slide, is a bundle of all four of the original queries. Note C. J. Date describes this as “… a table (an SQL-style table, at any rate,) but it is hardly a relation….In fact the result table in this example can be regarded as an “outer union”….outer union is not a respectable relational operation. “An introduction to Database Systems” eighth edition. C.J.Date Addison Wesley 2004
Several Levels of Aggregation in same query – Rollup The drawbacks to the previous queries without CUBE and Rollup is obvious. Formulating several similar but distinct queries is tedious Using these additions to SQL we can attempt to try and represent several levels of aggregation in a single query. This is the motivation behind “Rollup” and “Cube” which are extra options on the GROUP BY clause in SQL
Summary Multi-Dimensional views of data can be stored in a relational database. The SQL Group by clause can be used to aggregate data but queries can be tedious. The new keywords of Rollup and Cube have been added to SQL for use in Relational OnLine Analytical Processing (ROLAP) Practical exercises in the lab next week.