Summarizing Data with CUBE and ROLLUP

Summarizing Data with CUBE and ROLLUP

SQL ROLLUP and CUBE commands • Quick, efficient way to summarize the data stored in your database • Offer a valuable tool for gaining some quick and dirty insight into your data • ROLLUP and CUBE are SQL extensions and they're available in MySQL, SQL Server 6.5 (and above) and Oracle 8i (and above)

CUBE Operation SQL: select Type, Store, sum(Number) fromoPet group by type, store WITH CUBE Analyzing the data, you'll notice that our chain has 27 cats, 31 dogs and 5 turtles spread among our three stores. Our Miami store has the largest number of pets in stock with a whopping inventory of 30 pets.

ROLLUP Operation Utilizing the ROLLUP operator instead of the CUBE operator will eliminate the results that contain a NULL in the first column. SQL: select Type, Store, sum(Number) from Pet group by type, store WITH ROLLUP We're not particularly interested in the total number of pets at each store -- we'd just like to know our statewide inventory of each species along with the standard GROUP BY data. ที่มา: http://databases.about.com/od/sql/l/aacuberollup.htm

Show ‘how to use Excel to do CUBE and ROLLUP operations’

Aggregations • It simply means a summarized, typically additive value. • The level of aggregation in a star schema depends on the scenario. • Many star schemas are aggregated to some base level, called the grain, although this is becoming somewhat less common as developers rely on cube building engines to summarize to a base level of granularity.

Cartesian-Product Operation – Example • Relations r, s: • r xs: • A=C(r x s):

Natural Join – Example Same domain • Relations r, s: Join operation is very slow. • r xs: • r s • A=C(r x s):

r s Natural Join – Example • Relations r, s:

Database Design? Is there any problem with this design?

Normalization instructor department

Comparison Normalization Denormalization + Fast query: query fewer tables + Reduce number of joins (fast query processing time) + Good for data retrieval + Easy for end user to understand data Increase Redundancy data Big table size Increase update, insert, and delete anomaly + Reduce space + Small table size + Reduce update, insert, and delete anomaly + Reduce data entry + Use less index + Fast index since we use numeric to represent index + Good for update, insert, and delete - Increase number of joins in multiple tables

Which table is much easier to understand? Creating a view that looks like this, however, still uses joins in the background and therefore does not achieve the best performance on the query.

Fortunately, there is a better way. Designing the Star Schema Database

Motivation • Creating a Star Schema Database is one of the most important step in creating a data warehouse. • It is important to understand how to move from a standard, on-line transaction processing (OLTP) system to a final star schema.

For example: • Assume a company sells only two products: dog food and cat food. • Each day, the company records the sales of each product. • At the end of a couple of days, the data looks like this: • Each day contains several transactions. • This is the data as stored in a standard OLTP system. • However, the data warehouse might not record this level of detail. Instead, it could summarize, or aggregate, the data to daily totals. • The records in the data warehouse might look something like this:

This summarization of data reduces the number of records by aggregating the individual transaction records into daily records that show the number of each product purchased each day. • In this simple example, it is easy to derive the table simply by running a query against Pet table. • select date, sum(Dog Food) as Dog Food, • sum(Cat Food) as Cat Food • from pet • group by date • However, many complexities enter the picture that will be discussed later.

OLTP Systems • Online Transaction Processing, systems are standard, normalized databases. • OLTP systems are optimized for inserts, updates, and deletes; in other words, transactions. • Transactions in this context can be thought of as the entry, update, or deletion of a record or set of records. • They minimize repeated data, and they limit the number of indexes

Problem: Repeat the header information in each detail record. Flat Structure in Excel

Normalization • Database normalization is the process of removing repeated information (field) into a separate table. • Normalization first removes repeated records in a table. For example, the following order table contains much repeated information and is not recommended: If there were twenty repeated sets of fields for detail records, the table would be unable to handle an order for twenty one or more products. If an order has just has one product ordered, all the other fields are useless. The first step in the normalization process is to break the repeated fields into a separate table, and end up with this: Now, an order can have any number of detail records. Reduce the data entry required Reduce the size of an OrderDetail record

ER-Diagram To reduce repeated data, the new detail records might look like this: OLTP schemas store as little data as possible to speed inserts, updates, and deletes. Therefore, any number that can be calculated at query time is calculated and not stored.

OLTP Advantages • OLTP allows for the minimization of data entry • Only the primary key value from the OrderHeader table, the primary key of the Product table, and then the order quantity is stored • Built for data inserts, updates, and deletes, but not data retrieval • OLTP reduces the size of an OrderDetail record. • help speed inserts, updates, and deletes. • Most of the fields that link to other tables are numeric. • Queries generally perform much better against numeric fields than they do against text fields. • Replacing a series of text fields with a numeric field can help speed queries. • Numeric fields also index faster and more efficiently. • With normalization, there are frequently fewer indexes per table. • With fewer indexes to maintain, inserts, updates, and deletes run faster.

OLTP Disadvantages • With a normalized structure, queries must utilize joins across multiple tables to get all the data. • Joins tend to be slower than reading from a single table, so minimizing the number of tables in a query will boost performance. • The data in an OLTP system is not user friendly. • Most customers don't know how to make sense of the normalized structure of the database. Hard to query • Joins are somewhat mysterious, and complex table structures are difficult for the average customer to use.

OLTP Disadvantages • The fewer indexes in the database, the faster inserts, updates, and deletes will be. • The fewer indexes in the database, the slower select queries will run. • For the purposes of data retrieval, a higher number of correct indexes helps speed retrieval. • Since one of the design goals to speed transactions is to minimize the number of indexes, OLTP databases trade faster transactions at the cost of slowing data retrieval. • This is one reason for creating two separate database structures: • an OLTP system for transactions • an OLAP system for data retrieval.

OLTP Systems • OLTP allows for the minimization of data entry • Developers minimize the number of indexes in an OLTP system. • Indexes are important but they slow down inserts, updates, and deletes. • Therefore, most schemas have just enough indexes to support lookups and other necessary queries. • Over-indexing can significantly decrease performance.

Reason to Denormalize • If customers want to spend the majority of their time performing analysis by looking at the data, the IT group should support their desire for fast, easy queries. • On the other hand, maintaining the speed requirements of the transaction-processing activities is critical. • If these two requirements seem to be in conflict, they are, at least partially. • Many companies have solved this by having a second copy of the data in a structure reserved for analysis. • This copy is more heavily indexed, and it allows customers to perform large queries against the data without impacting the inserts, updates, and deletes on the main data. • This copy of the data is often not just more heavily indexed, but also denormalized to make it easier for customers to understand.

Reason to Denormalize • Users query a lot. Therefore • Need to build many index to speed up query • Need to denormalize table to reduce join operation and remove users’ confusion of tables’ structure

How Humans View Information • How many bottles of Aniseed Syrup were sold last week? • Are overall sales of Condiments up or down this year compared to previous years? • On a quarterly and then monthly basis, are Dairy Product sales cyclical? • In what regions are sales down this year compared to the same period last year? What products in those regions account for the greatest percentage of the decrease?

How Humans View Information • Tracking measure • Aggregation data (sum, count) • Time involved • By certain condition Fact Table Dimension Table

Star Schema • What people want to see (Event, Measures) • Fact Table (e.g., Sales) • How they want to see it • Dimension Table View data by time, by location, by product

Dimension Table fat , short table • All dimension tables should have a single-field primary key. • This key is typically a surrogate key and is often just an identity column, consisting of an automatically incrementing number. • The value of the primary key is meaningless, hence the surrogate key. • The real information is stored in the other fields. • These fields do notcontain codes that link to other tables. Because the fields contain full descriptions, the dimension tables are often short and fat; they contain many large fields.

Dimension Table fat , short table • Short and fat. • Cannot compare in size to a normal fact table. • For example, • Assume a company has 30,000 products in the product table. • The company may track sales for these products each day for several years. • Assuming the company actually only sells 3,000 products in any given day. • If they track these sales each day for ten years, the fact table has • 3,000 products sold X 365 day/year * 10 years = 11,000,000 records • The dimension table with 30,000 records will be short compared to the fact table.

Fact Table skinny , long table • Keep fact or measure data by conditions (long) • Measures are numeric and additive across some or all of the dimension tables • For example, sales are numeric and users can look at total sales for a product, or category, or subcategory, and by any time period. The sales figures are valid no matter how the data is sliced. • Use Primary Key of each Dimension table (skinny)

Fact Table skinny , long table Long Table - hold the number of records represented by the product of the counts in all the dimension tables. Therefore, there are many many records. Skinny table - hold the fields which are the primary key (foreign keys ) of the dimension tables. These fields are typically integer values. In addition, the measures are also numeric. Therefore, the size of each record is generally much narrower than those in the dimension tables.

Fact Table skinny , long table • 10 years of daily data • 200 stores • 500 products • Fact Table size: 365,000,000 records (3650 days * 200 stores * 500 products) • This large number of records makes the fact table long. • If add another dimension, such as a dimension of 10,000 customers, the record of the fact table can be increased up to 10,000 times.

Query1: the total sales for a particular year and month for a particular category the total SalesDollars for March, 1999 for category = “Brass Goods” SQL query: SELECT Sum(SF.SalesDollars) AS SumOfSalesDollars FROM TimeDimension T , ProductDimension P, SalesFact SF WHERE T.TimeID = SF.TimeID AND P.ProductID = SF.ProductIDAND P.Category='Brass Goods' AND T.Month=3 AND T.Year=1999

To drill down to a subcategory, the SQL would change to look like this: SELECT Sum(SF.SalesDollars) AS SumOfSalesDollars FROM TimeDimensionT, ProductDimension P, SalesFactSF WHERE P.ProductID = SF.ProductIDAND T.TimeID = SF.TimeID AND P.SubCategory='Widgets' AND T.Month=3 AND T.Year=1999;

Query2: The total SalesDollars for March, 1999 by category SQL query: SELECT P.Category AS Category, Sum(SF.SalesDollars) AS SumOfSalesDollars FROM TimeDimension T , ProductDimension P, SalesFact SF WHERE T.TimeID = SF.TimeID AND P.ProductID = SF.ProductIDAND T.Month=3 AND T.Year=1999 GROUP BY P.Category

Query3: The total SalesDollars by month and by brand SQL query: SELECT P.Month AS Month, P. Brand AS Brand, Sum(SF.SalesDollars) AS SumOfSalesDollars FROM TimeDimension T , ProductDimension P, SalesFact SF WHERE T.TimeID = SF.TimeID AND P.ProductID = SF.ProductID GROUP BY T.Month, P. Brand

Snowflake Schemas • A Dimension tables is fat so we can broken out it into separate tables. Save space. • This is a more normalized structure, but leads to more difficult queries and slower response times. • Difficult to drill-down into data

Snowflake Schemas • Increases the number of joins and can slow queries • The purpose of an OLAP system is to speed queries, snowflaking is usually not productive. • Some people try to normalize the dimension tables to save space. • However, in the overall scheme of the data warehouse, the dimension tables usually only account for about 1% of the total storage. • Any space savings from normalizing, or snowflaking, are negligible.

Fact Granularity • The granularity, or frequency, of the data is determined by the lowest level of granularity of each dimension table. • For example, a fact table may store weekly or monthly totals for individual products. • The lower the granularity, the more records that will exist in the fact table. • The granularity also determines how far users can drill down without returning to the base, transaction-level data.

Fact Table Size • 500 products sold in 200 stores and tracked for 10 years • could produce 365,000,000 records in a fact table with a daily grain (maximum size for the table) • Star schemas do not store zero values unless zero has some significance. • So, if a product did not sell at a particular store for a particular day, the system would not store a zero value. • The fact table contains only the records that have a value. 500 x 200 x 10 x 365

Fact Table Size • The lower the granularity, the larger the fact table. • In the previous example, moving from a daily to weekly grain would reduce the potential number of records to only slightly more than 52,000,000 records (500 products x 200 stores x 10 years x 52 weeks) • The data types for the fields in the fact table are numeric, which can require less storage space than the long descriptions we find in the dimension tables. • Be aware that each added dimension can greatly increase the size of the fact table. • For example, adding customer dimension table that has 20 possible values (record), the potential number of records would reach 7.3 billion. 500 x 200 x 10 x 365 x 20 = 7,300,000,000 records

Reference • Designing the Star Schema Database http://www.ciobriefings.com/whitepapers/StarSchema.asp http://databases.about.com/od/sql/l/aacuberollup.htm

Summarizing Data with CUBE and ROLLUP

Summarizing Data with CUBE and ROLLUP

Presentation Transcript

Summarizing and Presenting Data

Data Warehouse and Data Cube

Summarizing Data

Summarizing Quantitative Data

Summarizing Data

DATA CUBE

Summarizing Data

Summarizing Data

DATA CUBE

Summarizing Data

Summarizing Data

Summarizing Data

Summarizing and Displaying Data

Summarizing Measured Data

SUMMARIZING DATA

Summarizing Measured Data

Summarizing Data

Summarizing and Presenting Data

Summarizing Data

Organizing and Summarizing Data

Summarizing Quantitative Data

Summarizing Data