1 / 57

OLAP

OLAP. CS 543 – Data Warehousing. Where Does OLAP Fit In? (1). OLAP = On-line analytical processing. OLAP is a characterization of applications, not a database design technique. Idea is to provide very fast response time in order to facilitate iterative decision-making.

Download Presentation

OLAP

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. OLAP CS 543 – Data Warehousing

  2. Where Does OLAP Fit In? (1) OLAP = On-line analytical processing. • OLAP is a characterization of applications, not a database design technique. • Idea is to provide very fast response time in order to facilitate iterative decision-making. • Analytical processing requires access to complex aggregations (as opposed to record-level access). CS 543 - Data Warehousing (Sp 2007-2008) - Asim Karim @ LUMS

  3. Where Does OLAP Fit In? (2) Information is conceptually viewed as “cubes” for simplifying the way in which users access, view, and analyze data. • Quantitative values are known as “facts” or “measures.” • e.g., sales $, units sold, etc. • Descriptive categories are known as “dimensions.” • e.g., geography, time, product, scenario (budget or actual), etc. • Dimensions are often organized in hierarchies that represent levels of detail in the data (e.g., UPC, SKU, product subcategory, product category, etc.). CS 543 - Data Warehousing (Sp 2007-2008) - Asim Karim @ LUMS

  4. OLAP FASMI Test Fast: Delivers information to the user at a fairly constant rate. Most queries should be delivered to the user in five seconds or less. Analysis: Performs basic numerical and statistical analysis of the data, pre-defined by an application developer or defined ad hoc by the user. Shared: Implements the security requirements necessary for sharing potentially confidential data across a large user population. Multi-dimensional: The essential characteristic of OLAP. Information: Accesses all the data and information necessary and relevant for the application, wherever it may reside and not limited by volume. ...from the OLAP Report by Pendse and Creeth. CS 543 - Data Warehousing (Sp 2007-2008) - Asim Karim @ LUMS

  5. Need for Multidimensional Analysis • A simple analysis • How many units of product A did we sell in the store in DHA, Lahore • Typically, decision support requires more complex analyses • How much revenue did the new product X generate during the last three months, broken down by individual months, in the Southern Region, by individual stores, broken down by the promotions, compared to estimates, and compared to the previous version of the the product? CS 543 - Data Warehousing (Sp 2007-2008) - Asim Karim @ LUMS

  6. Kinds of Analyses • Roll-ups to provide summaries and aggregates along the hierarchies of the dimensions • Drill-downs from the top level to the lowest along the hierarchies of the dimensions • Calculations involving facts and metrics • Algebraic equations involving key performance indicators • Moving averages and growth percentages • Trend analyses using statistical methods CS 543 - Data Warehousing (Sp 2007-2008) - Asim Karim @ LUMS

  7. CS 543 - Data Warehousing (Sp 2007-2008) - Asim Karim @ LUMS

  8. CS 543 - Data Warehousing (Sp 2007-2008) - Asim Karim @ LUMS

  9. OLAP? • The name On-Line Analytical Processing was coined in a paper by E.F. Codd in 1993 (“Providing On-Line Analytical Processing for User Analysts”) • A definition • OLAP is a category of software technology that enables analysts, managers, and executives to gain insight into data through fast, consistent, interactive access in a a wide variety of possible views of information that has been transformed from raw data to reflect the real dimensionality of the enterprise as understood by the user CS 543 - Data Warehousing (Sp 2007-2008) - Asim Karim @ LUMS

  10. OLAP Features CS 543 - Data Warehousing (Sp 2007-2008) - Asim Karim @ LUMS

  11. Dimensional Analysis (1) CS 543 - Data Warehousing (Sp 2007-2008) - Asim Karim @ LUMS

  12. Dimensional Analysis (2) CS 543 - Data Warehousing (Sp 2007-2008) - Asim Karim @ LUMS

  13. Some Queries • Display the total sales of all products for past five years in all stores • Compare total sales for all stores, product by product, between years 2000 and 1999. • Show comparison of sales by individual stores, product by product, between years 2000 and 1999 only for those products with reduced sales. • Show the results of the previous queries, but rotating the columns with rows CS 543 - Data Warehousing (Sp 2007-2008) - Asim Karim @ LUMS

  14. Hypercubes • Multi-dimension cubes • Hard to visualize and display beyond three dimensions • Multi-dimensional domain structure (MDS) • Represents each dimension as a line showing the values CS 543 - Data Warehousing (Sp 2007-2008) - Asim Karim @ LUMS

  15. MDS CS 543 - Data Warehousing (Sp 2007-2008) - Asim Karim @ LUMS

  16. Display of Hypercubes CS 543 - Data Warehousing (Sp 2007-2008) - Asim Karim @ LUMS

  17. CS 543 - Data Warehousing (Sp 2007-2008) - Asim Karim @ LUMS

  18. CS 543 - Data Warehousing (Sp 2007-2008) - Asim Karim @ LUMS

  19. Drill-Down and Roll-Up CS 543 - Data Warehousing (Sp 2007-2008) - Asim Karim @ LUMS

  20. Slice-and-Dice or Rotation CS 543 - Data Warehousing (Sp 2007-2008) - Asim Karim @ LUMS

  21. OLAP Models/Implementations MOLAP: OLAP implemented with a multi-dimensional database. ROLAP: OLAP implemented with a relational database. HOLAP: OLAP implemented with a hybrid of multi-dimensional and relational database technologies. DOLAP: OLAP implemented for desktop decision support environments. CS 543 - Data Warehousing (Sp 2007-2008) - Asim Karim @ LUMS

  22. ROLAP and MOLAP CS 543 - Data Warehousing (Sp 2007-2008) - Asim Karim @ LUMS

  23. MOLAP Implementations OLAP has historically been implemented through use of multi-dimensional databases (MDDs). • Dimensions are key business factors for analysis: • geographies (zip, state, region,...) • products (item, product category, product department,...) • dates (day, week, month, quarter, year,...) • Very high performance via fast look-up into “cube” data structure to retrieve pre-calculated results. • “Cube” data structures allow pre-calculation of aggregate results for each possible combination of dimensional values. • Use of application programming interface (API) for access via front-end tools. CS 543 - Data Warehousing (Sp 2007-2008) - Asim Karim @ LUMS

  24. CS 543 - Data Warehousing (Sp 2007-2008) - Asim Karim @ LUMS

  25. MOLAP Implementations Need to consider both maintenance and storage implications when designing strategy for when to build cubes. • Maintenance Considerations: Every data item received into MDD must be aggregated into every cube (assuming “to-date” summaries are maintained). • Storage Considerations: Although cubes get much smaller (e.g., more dense) as dimensions get less detailed (e.g., year vs. day), storage implications for building hundreds of cubes can be significant. CS 543 - Data Warehousing (Sp 2007-2008) - Asim Karim @ LUMS

  26. MOLAP Implementations • Typically outperform relational database technology because all answers are pre-computed into cubes (and overhead for accessing cubes is very low). • Difficult to scale because of combinatorial explosion in the number and size of cubes when dimensions of significant cardinality are required. • Beyond tens (sometimes small hundreds) of thousands of entries in a single dimension will break the MOLAP model because the pre-computed cube model does not work well when the cubes are very sparse in the population of individual cells. See www.olapreport.com/DataExplosion.htm CS 543 - Data Warehousing (Sp 2007-2008) - Asim Karim @ LUMS

  27. Virtual Cubes Virtual cubes are used when there is a need to join information from two dissimilar cubes that share one or more common dimensions. • Similar to a relational view; two (or more) cubes are linked along common dimension(s). • Often used to save space by eliminating redundant storage of information. CS 543 - Data Warehousing (Sp 2007-2008) - Asim Karim @ LUMS

  28. Partitioned Cubes • One logical cube of data can be spread across multiple physical cubes on separate (or same) servers. • The divide-and-conquer approach of partitioned cubes helps to mitigate the scalability limitations of a MOLAP environment. • Ideal cube partitioning is completely invisible to end users. CS 543 - Data Warehousing (Sp 2007-2008) - Asim Karim @ LUMS

  29. ROLAP Implementations Advances in database technologies and front-end tools have begun to allow deployment of OLAP using ANSI SQL RDBMS implementations. • ROLAP facilitates deployment of much larger dimension tables than MOLAP implementations. • Front-end tools to facilitate GUI access to multi-dimensional analysis capabilities. • Aggregate awareness allows exploitation of pre-built summary tables for some front-end tools. Star schema designs are often used to facilitate OLAP against relational databases. CS 543 - Data Warehousing (Sp 2007-2008) - Asim Karim @ LUMS

  30. CS 543 - Data Warehousing (Sp 2007-2008) - Asim Karim @ LUMS

  31. Simplified Third Normal Form (Retail) CS 543 - Data Warehousing (Sp 2007-2008) - Asim Karim @ LUMS

  32. Simplified Star Schema CS 543 - Data Warehousing (Sp 2007-2008) - Asim Karim @ LUMS

  33. Simplified Star Schema A vastly simplified physical data model! Collapse dimensional hierarchies into a single table for each dimension and create a single fact table from the header and detail records: • Fewer tables. • Fewer joins to get results. CS 543 - Data Warehousing (Sp 2007-2008) - Asim Karim @ LUMS

  34. Star Schema for High Performance Business question: How many $ in raincoats did I sell in the first week of January through stores in Boston? Assume: • 4 Billion rows in fact table. • 20 different kinds (size, color, style) of raincoats (product category) out of 50,000 UPCs in store. • 8 stores out of 400 are in BOSTON SMSA. • 2 years of POS history in DBMS. CS 543 - Data Warehousing (Sp 2007-2008) - Asim Karim @ LUMS

  35. Star Schema for High Performance Simple (poor performance) approach to query execution: 1. Join item table with filtering on raincoat product category (very selective) to fact table. 2. Join date table with filtering by week (next most selective) to result table. 3. Join store table with filtering on store to result table from step 2. 4. Aggregate. CS 543 - Data Warehousing (Sp 2007-2008) - Asim Karim @ LUMS

  36. Star Schema for High Performance Advanced (better performance) approach to query execution: 1. Cartesian product join between dimensional tables. * Result is 20 x 8 x 7 = 1,120 rows. 2. Use composite index on item:store:day into fact table for very selective access. * Access less than 0.00000008 percent of data in fact table! Sophisticated cost-based optimizers will figure this out. CS 543 - Data Warehousing (Sp 2007-2008) - Asim Karim @ LUMS

  37. Forcing a Cartesian Product Join • Add an addition “join_value” column in each dimensional table. • Set join_value to same value in all rows of the dimensional tables. • Add additional where clause predicates joining on this columnbetween dimensional tables. NOTE: This shouldn't be necessary with a “smart” optimizer. CS 543 - Data Warehousing (Sp 2007-2008) - Asim Karim @ LUMS

  38. Forcing a Cartesian Product Join Sample code: select sum(sales.sales_amt) from d_sales_detail ,store ,item ,period where d_sales_detail.store_id = store.store_id and d_sales_detail.item_id = item.item_id and d_sales_detail.day_dt = period.day_dt and period.day_dt between '23-NOV-2000' and '24-DEC-2000' and item.trade_style_cd = 'BARBIE' and store.state_cd = 'CA' and store.join_value = period.join_value and store.join_value = item.join_value and period.join_value = item.join_value ; CS 543 - Data Warehousing (Sp 2007-2008) - Asim Karim @ LUMS

  39. Star Schema for High Performance Problem: What if I want to know raincoat sales in first week of January regardless of store? Answer: Performance advantage of composite index in traditional RDBMS is severely impaired! • B-tree indexing techniques do not allow for flexibility in the use of dimensions for query purposes. • Bit indexing (and variations thereof) often allows much more generality in achieving high performance from a star schema. CS 543 - Data Warehousing (Sp 2007-2008) - Asim Karim @ LUMS

  40. Star Schema for High Performance Bottom Line: • It is not at all unusual to obtain an order of magnitude (or more) in performance advantage using a star schema with advanced indexing versus a more traditional relational database implementation. • Despite what vendors may tell you, star schemas cannot be effectively implemented for all DSS business applications and/or data models. CS 543 - Data Warehousing (Sp 2007-2008) - Asim Karim @ LUMS

  41. ROLAP • Relational OLAP often makes heavy use of summary tables to provide near instantaneous access for multi-dimensional queries. • Foundation is usually star schema or snowflake database design. • Allows OLAP with much larger data sets than multi-dimensional database (MDD) products using cube structures (MOLAP). CS 543 - Data Warehousing (Sp 2007-2008) - Asim Karim @ LUMS

  42. ROLAP Number of summary tables can get very large if discipline is not enforced... Assume a retail database with the following two dimensions on the fact table... Calendar: Day, Week, Period, Quarter, Year, All Days Geography: Store, Zone, District, Region, All Stores CS 543 - Data Warehousing (Sp 2007-2008) - Asim Karim @ LUMS

  43. ROLAP Summary tables in a naive implementation require all combinations of the dimensions at each aggregation level... 30 summary tables! ... Add in item, SKU, subcategory, category, and all items...now we are up to 150 pre-aggregates! CS 543 - Data Warehousing (Sp 2007-2008) - Asim Karim @ LUMS

  44. ROLAP Summary tables are more of a maintenance issue than a storage issue in most production implementations. • Notice that summary tables get much smaller as dimensions get less detailed (e.g., year vs. day). • Should plan for double the size of the unsummarized data for ROLAP summaries in most environments. • Every detail record that is received into warehouse must aggregate into EVERY summary table (assuming "to-date" summaries are maintained). CS 543 - Data Warehousing (Sp 2007-2008) - Asim Karim @ LUMS

  45. ROLAP Warning: Do not assume that dimensions are always simple hierarchies. Example: Items are not just category, subcategory, SKU, and atomic item.... what about trade styles or manufacturer? Now we need summary tables along these lines as well...another 120 summary tables! Calendar vs. accounting period vs. billing cycle can be even worse... CS 543 - Data Warehousing (Sp 2007-2008) - Asim Karim @ LUMS

  46. ROLAP Many ROLAP products have devised ways to reduce the number of summary tables: • Ability to build summaries on-the-fly as demanded by end-user applications. • Ability to aggregate efficiently from subset of the summary tables. • Tools exist in some products to assist in DBAs in selecting the "best” aggregations to build. • HOLAP (Hybrid OLAP) tools allow co-existence of pre-built cubes alongside relational OLAP structures. CS 543 - Data Warehousing (Sp 2007-2008) - Asim Karim @ LUMS

  47. Intelligent Aggregation Selection • Maximum performance boost implies lots of disk for every pre-calculation. • Minimum performance boost implies no disk with zero pre-calculation. • Strategy is to use meta data to heuristically determine optimum set of aggregates from which all other aggregates can be derived. CS 543 - Data Warehousing (Sp 2007-2008) - Asim Karim @ LUMS

  48. Aggregate Wizards CS 543 - Data Warehousing (Sp 2007-2008) - Asim Karim @ LUMS

  49. Fact Table Aggregates • Enhance performance on common queries at coarser granularities. • Save space to permit storing more history than possible with finer granularities. • Take advantage of need to store other facts (with similar samples) at a particular granularity. CS 543 - Data Warehousing (Sp 2007-2008) - Asim Karim @ LUMS

  50. Aggregate Advice • Coarser granularity decreases potential cardinality, but usually increases density (e.g., daily summary table is typically twice the size of weekly summary table - not seven times). • Strongly consider omitting candidate aggregates where expected cardinality is more than 10% that of next finer granularity stored. • Keep the detail for drill down, even if you deploy aggregates for performance. CS 543 - Data Warehousing (Sp 2007-2008) - Asim Karim @ LUMS

More Related