1 / 47

Decision Support, Data Warehousing, and OLAP

Decision Support, Data Warehousing, and OLAP. Anindya Datta Director, iXL Center for E-Commerce Georgia Institute of Technology adatta@cc.gatech.edu. Outline. Data Warehousing Architecture Terminology: OLTP vs. OLAP Technologies Products Research Issues References. Data Warehouse.

avalon
Download Presentation

Decision Support, Data Warehousing, and OLAP

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Decision Support, Data Warehousing, and OLAP Anindya Datta Director, iXL Center for E-Commerce Georgia Institute of Technology adatta@cc.gatech.edu

  2. Outline • Data Warehousing Architecture • Terminology: OLTP vs. OLAP • Technologies • Products • Research Issues • References

  3. Data Warehouse • A decision support database that is maintained separately from the organization’s operational databases. • A data warehouse is a • subject-oriented • integrated • time-varying • non-volatile

  4. Data Warehouse • collection of data that is used primarily in organizational decision making • Decision support system

  5. OLTP • On-Line Transaction Processing OLTP • What we have been talking about in class • For day-to-day operations • Involves operational database • Queries to retrieve information from specific tuples

  6. OLAP and Decision Support • On-Line Analytical ProcessingOLAP is an element of decision support systems (DSS). • Information technology to help the knowledge worker (executive, manager, analyst) make faster and better decisions. • What were the sales volumes by region and product category for the last year? • Which orders should we fill to maximize revenues? • Will a 10% discount increase sales volume sufficiently?

  7. Three-Tier Architecture • Warehouse database server • Almost always a relational DBMS • OLAP servers • Relational OLAP (ROLAP): extended relational DBMS that maps operations on multidimensional data to standard relational operations. • Multidimensional OLAP (MOLAP): special purpose server that directly implements multidimensional data and operations. • Clients • Query and reporting tools. • Analysis tools • Data mining tools (e.g., trend analysis, prediction)

  8. Data Warehouse vs. Data Marts • Enterprise warehouse: collects all information about subjects for entire organization. • Requires extensive business modeling • May take years to design and build • Data Marts: Departmental subsets that focus on selected subjects: Marketing data mart • Faster roll out, but complex integration in the long run. • Virtual warehouse: views over operational dbs • Materialize some summary views for efficient query processing • Easier to build • Requisite excess capacity on operational db servers

  9. Data Warehousing Architecture Monitoring & Administration OLAP servers Metadata Repository Analysis Query/ Reporting Data Mining Data Warehouse Extract Transform Load Refresh External Sources Serve Operational dbs Data Marts

  10. Clerk, IT Professional Day to day operations Application-oriented (E-R based) Current, Isolated Detailed, Flat relational Structured, Repetitive Short, Simple transaction Read/write Index/hash on prim. Key Tens Thousands 100 MB-GB Trans. throughput Knowledge worker Decision support Subject-oriented (Star, snowflake) Historical, Consolidated Summarized, Multidimensional Ad hoc Complex query Read Mostly Lots of Scans Millions Hundreds 100GB-TB Query throughput, response OLTP vs. OLAP OLTP OLAP User Function DB Design Data View Usage Unit of work Access Operations # Records accessed #Users Db size Metric

  11. OLAP for Decision Support • Goal of OLAP is to support ad-hoc querying for the business analyst • Business analysts are familiar with spreadsheets • Extend spreadsheet analysis model to work with warehouse data • Multidimensional view of data is the foundation of OLAP

  12. Relational DBMS as Warehouse Server - ROLAP • Schema design • Specialized scan, indexing and join techniques • Handling of aggregate views (querying and materialization) • Supporting query language extensions beyond SQL • Complex query processing and optimization • Data partitioning and parallelism

  13. Warehouse Database Schema • ER design techniques not appropriate • Design should reflect multidimensional view • Star Schema • Snowflake Schema • Fact Constellation Schema

  14. Star Schema • A single fact (subject) table and a single table for each dimension • Every fact points to one tuple in each of the dimensions and has additional attributes • Does not capture hierarchies directly • Generated keys are used for performance and maintenance reasons

  15. Example of a Star Schema Order Product Order No Order Date ProductNO ProdName ProdDescr Category CategoryDescription UnitPrice Fact Table Customer OrderNO SalespersonID CustomerNO ProdNo DateKey CityName Quantity Total Price Customer No Customer Name Customer Address City Date DateKey Date Salesperson City SalespersonID SalespersonName City Quota CityName State Country

  16. Types of tables • Fact table • (OrderNO, SalespersonID, CustomerNo, ProdNo, Total Price) • Dimension tables • Order (Order No, Order Date) • Date (DateKey, Date)

  17. Dimensional SQL version: SELECT SUM(SA.NoOfItems), P.PCategory, V.Vendor, C.Qtr, GroupBy(C.Qtr)FROM Calendar C, Store S, Product P, Sales SAWHERE C.CalendarKey = SA.CalendarKey AND S.StoreKey = SA. StoreKey AND P.ProductKey = SA. ProductKey AND P.Category = ‘Pacifica’ and S.RegionName = ‘Midwest’ AND C.DayofWeek = ‘Saturday’ AND C.Year = 2010HAVING C.Qtr in (1,2) Non-dimensional SQL version: SELECT SUM(SW.NoOfItems), C.CategoryName, V.VendorName, EXTRACTQUARTER(ST.Date), GroupBy(EXTRACTQUARTER(ST.Tdate))FROM Region R, Store S, SalesTransaction ST, SoldWithin SW, Product P, Vendor V, Category CWHERE R.RegionID = S. RegionID AND S.StoreID = ST.StoreID AND ST.Tid = SW.Tid AND SW.ProductID = P.ProductID AND P. VendorID = V. VendorID AND P.CateoryID = C.CategoryID AND P.Category = ‘Pacifica’ and R.RegionName = ‘Midwest’ AND EXTRACTWEEKDAY(St.Date) = ‘Saturday’, AND EXTRACTYEAR(ST.Date) = 2010HAVING QUARTER(ST.Date) in (1,2)

  18. Snowflake Schema • Represent dimensional hierarchy directly by normalizing the dimension tables • Easy to maintain • Saves storage, but is alleged that it reduces effectiveness of browsing (Kimball)

  19. Example of a Snowflake Schema Order Product Category Order No Order Date ProductNO ProdName ProdDescr Category Category UnitPrice CategoryName CategoryDescr Fact Table Customer OrderNO SalespersonID CustomerNO ProdNo DateKey CityName Quantity Total Price Customer No Customer Name Customer Address City Date Month DateKey Date Month Year Month Year Salesperson Year SalespersonID SalespersonName City Quota City State CityName State Country StateName Country

  20. Multidimensional Data Model • Database is a set of facts (points) in a multidimensional space • A fact has a measure dimension • quantity that is analyzed, e.g., sale, budget • A set of dimensions on which data is analyzed • e.g. , store, product, date associated with a sale amount • Each dimension has a set of attributes • e.g., owner city and county of store • Attributes of a dimension may be related by partial order • Hierarchy: e.g., county > city > street • Lattice: e.g., year>month> date, year>week>date

  21. Multidimensional Data Model Sales Volume as a function of time, city and product NYC LA SF Juice Cola Milk Cream 10 47 30 12 22 18 30 3/1 3/2 3/3 3/4 Date

  22. Operations in Multidimensional Data Model – Roll-up • Aggregation (roll-up) • summarization over aggregate hierarchy – higher levels of summarization: • e.g., total sales by city and rollup to total sales by state or take daily sales totals and rollup to monthly sales totals • Subtracting group by columns

  23. Roll-UP NY AL CA Juice Cola Milk Cream 100 475 310 121 3/1 3/2 3/3 3/4 Date

  24. Drill Down • Navigation to detailed data (drill-down) • Used to adjust the level of detail – getting more detail – finer data • e.g., total sales by region and drill down to total sales by city or top 3% of cities • Adding group by columns

  25. Drill Down NYC LA SF Juice Cola Milk Cream 5 5 20 27 20 10 7 5 3/1 am 3/1 pm 3/2 am 3/2 pm Date

  26. Operations in Multidimensional Data Model • Selection (slice) defines a subcube • e.g., sales where city = SF and Product = Cream • Visualization Operations (e.g., Pivot) • Rotating summary tables (next page)

  27. Selection - Slice SF Cream 3/1 3/2 3/3 3/4

  28. A Visual Operation: Pivot (Rotate) NYC LA SF Month Juice Cola Milk Cream 10 Region 47 30 12 Product 3/1 3/2 3/3 3/4 Date

  29. Indexing Techniques • Exploiting indexes to reduce scanning of data is of crucial importance • Bitmap Indexes • Join Indexes

  30. Bit Map Index Region Index Base Table Rating Index Region = W Customers where And Rating = L

  31. BitMap Indexes • Comparison, join and aggregation operations are reduced to bit arithmetic with dramatic improvement in processing time • Significant reduction in space and I/O (30:1) • Adapted for higher cardinality domains as well. • Compression (e.g., run-length encoding) exploit

  32. Issues in Handling of Aggregate Views (summary info) • Important component for ROLAP Servers • Logic for Aggregation Navigation • make optimum use of materialized aggregates to answer a query • Choice of aggregate views to materialize

  33. SQL Extensions for Front End Tools • Extended Family of Aggregate functions • rank (top 10) and N-Tile (“top 30%” of all products) • Median, mode….. • Reporting Features • running total, cumulative totals • Results of multiple group by: • total sales by month and total sales by product

  34. Approaches to OLAP Servers • Relational OLAP (ROLAP) • Relational and Specialized Relational DBMS to store and manage warehouse data • Example: Oracle Warehouse Builder, MetaCube (Informix) Multidimensional OLAP • Array-based storage structures • Direct access to array data structures • Example: Essbase (Arbor then Hyperion then Oracle)

  35. ROLAP Data stored in relational tables

  36. MOLAP – data stored in multidimensional array • The storage model is an n-dimensional array – Cube • (1,1,1) subject value stored for juice, 3/1, SF • 1,1,1= 10 1,2,1=37 1,3,1=40 1,4,1=12 • Array representation has good indexing properties but very poor storage utilization when data is sparse

  37. ROLAP vs MOLAP MOLAP – fast response times – data ready to display, cheaper if must start from scratch ROLAP – increased flexibility in choice of queries, cheaper if already have relational DB

  38. ETL • Data Extraction • Data cleaning • Data may have errors from multiple sources • Data Transformation • Convert from legacy/host format to warehouse format • Data Load • Sort, summarize, consolidate, compute views, check integrity, build indexes, partition

  39. Data Cleaning/Transformation Techniques • Extraction – decide which data • Uses domain-specific knowledge to do scrubbing • Discover facts that flag unusual patterns (auditing) • Some dealer has never received a single complaint • Products: QDB, SBStar, WizRule • Transformation Rules • Example: translate “gender” to “sex” • Products: Warehouse Manger (Prism), Extract (ETI)

  40. Data Load • Issues: • huge volumes of data to be loaded • small time window (usually at night) when the warehouse can be taken off-line • When to build indexes and summary tables • allow system administrator to monitor status, cancel suspend, resume load, or change load rate • restart after failure with no loss of data integrity • Techniques: • sort input records on clustering key and use sequential I/O; build indexes and derived tables • sequential loads still too long • use parallelism and incremental techniques

  41. Data Refresh • Issues: • when to refresh • on every update: too expensive, only necessary if OLAP queries need current data (e.g., up-the-minute stock quotes) • periodically (e.g., every 24 hours, every week) or after “significant” events • refresh policy set by administrator based on user needs and traffic • Techniques: • Full extract from base tables • read entire source table or database: expensive. • Incremental techniques (related to work on active DBS)

  42. Issues to Consider • Physical Design • design of summary tables, partitions, indexes • Query processing • selecting appropriate summary tables • approximate answers • Runaway queries • Warehouse Management • detecting runaway queries • resource management • incremental refresh techniques • computing summary tables during load

  43. Research Issues • Data warehouses in the cloud • XML and data warehouses

More Related