Advanced Querying - PowerPoint PPT Presentation

montgomery
advanced querying l.
Skip this Video
Loading SlideShow in 5 Seconds..
Advanced Querying PowerPoint Presentation
Download Presentation
Advanced Querying

play fullscreen
1 / 44
Download Presentation
Advanced Querying
139 Views
Download Presentation

Advanced Querying

- - - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

  1. Advanced Querying OLAP Data Warehousing

  2. Database Applications • Transaction processing • Online setting • Supports day-to-day operation of business • Decision support • Offline setting • Strategic planning (statistics)

  3. Transaction processing Operational setting Up-to-date = critical Simple data Simple queries Flight reservations ticket sales do not sell a seat twice reservation, date, name Give flight details of X List flights to Y Transaction Processing

  4. Transaction Processing • Database must support • simple data • tables • simple queries • select from where … • consistency & integrity CRITICAL • concurrency • Relational databases, Object-Oriented, Object-Relational

  5. Decision support Off-line setting « Historical » data Summarized data Different databases Statistical queries Flight company Evaluate ROI flights Flights of last year # passengers on line L Passengers, fuel costs, maintenance info Average % of seats sold/month/destination Decision Support

  6. Data Warehouse A decision support DB that is maintained separately from the organization’s operational databases. Why Separate Data Warehouse? • High performance for both systems • DBMS— tuned for OLTP • access methods, indexing, concurrency control, recovery • Warehouse—tuned for OLAP • complex OLAP queries, multidimensional view, consolidation. • Different functions and different data • Missing data: Decision support requires historical data which operational DBs do not typically maintain • Data consolidation: DS requires consolidation (aggregation, summarization) of data from heterogeneous sources • Data quality: different sources typically use inconsistent data representations, codes and formats which have to be reconciled

  7. Monitor & Integrator OLAP Server Metadata Analysis other sources Serve Data Warehouse Extract Transform Load Refresh Query/Reporting Operational DBs Data Mining ROLAP Server Data Marts Data Sources OLAP Engine Front-End Tools Data Storage Three-Tier Architecture

  8. OLAP • OLAP = OnLine Analytical Processing • Online = no waiting for answers • OLAP system = system that supports analytical queries that are dimensionalin nature.

  9. This Lecture • Examples of decision support queries • Data Cubes • Conceptual data model • Typical operations • Implementation • ROLAP vs MOLAP • Indexing structures • SQL:1999 support for OLAP

  10. Examples of Queries • Flight company: evaluate ticket sales • give total, average, minimal, maximal amount • per date: week, month, year • by destination/source port/country/continent • by ticket type • by # of connections • …

  11. Characteristics • One special attribute: amount  measure • Other attributes: select relevant regions  dimensions Different levels of generality (month, year, …)  hierarchies • Measure data is summarized: sum, min, max, average  aggregations

  12. Supermarket example • Evaluate the sales of products • Product cost in $ • Customer: ID, city, state, country, • Store: chain, size, location, • Product: brand, type, … • … • What are the measure and dimensional attributes, where are the hierarchies? measure Dim. hierarchies

  13. Why dimensions? • Multidimensional view on the data store Cost in $ customer product

  14. Cross Tabulation • Cross-tabulations are highly useful • Sales of clothes JuneAugust ‘06 Product: color Date:month, JuneAugust 2006

  15. Data cubes • Extension of Cross-Tables to multiple dimensions • Conceptual notion Dimensions Data Points/ 1st level of aggregation Aggregated w.r.t. X-dim Aggregated w.r.t. X and Y Aggregated w.r.t. Y-dim

  16. All, All, All Data Cubes Date 2Qtr 1Qtr sum 3Qtr 4Qtr TV Product Ireland PC VCR sum France Country Germany sum

  17. Data Cubes • Base cuboid = n-dimensional cube with n number of dimensions • The top most 0-D cuboid, which holds the highest-level of summarization, is called the apex cuboid • The lattice of cuboids forms a data cube

  18. Lattice of Cuboids all product country date date, country product, country product, date product, date, country

  19. Operations with Data Cubes Scenario: • Before starting the analysis task: • what data? • select a few relevant dimensions • define hierarchy • aggregation functions of interest • Pre-materialize • load data • compute counts/max, min, avg, … on beforehand

  20. Operations with Data Cubes • What operations can you think of an analyst might find useful? (e.g., store)

  21. Operations with Data Cubes • What operations can you think of that an analyst might find useful? (e.g., store) • only look at stores in the Netherlands • look at cities instead of individual stores • look at the cross-table for product-date • restrict analysis to 2006, product O1 • go back to a finer granularity at the store level

  22. Roll-Up • Move in one dimension from a lower granularity to a higher one • store  city • cities  country • product  product type

  23. Drill-down • Move in one dimension from a higher granularity to a lower one • city  store • country  cities • product type  product • Drill-through: • go back to the original, individual data records

  24. Pivoting • Change the dimensions that are “displayed”; select a cross-tab. • look at the cross-table for product-date • display cross-table for date-customer

  25. Slice & dice • Select a part of the cube by restricting one or more dimensions • restrict analysis to “city = Eindhoven”

  26. Summary of Concepts • Cube: Multidimensional view on data • dimensional attributes • measure attribute • Operations: • roll-up/drill-down • pivoting • slice and dice

  27. Implementation • To make query answering more efficient: consolidate (materialize) aggregations • Obvious implementation: multidimensional array. • Fast lookup: cell(prod. p, date d, prom. pr): • look up index of p1, index of d, index of pr: index = (p x D x PR) + (d x PR) + pr

  28. Implementation • Multidimensional array • obvious problem: sparse data can easily be solved, though. Example: binary search tree, key on index hash table.

  29. Implementation • However: very quickly people were confronted with the Data Explosion Problem Consolidating the summaries blows up the data enormously ! Reasons are often misunderstood and confusing.

  30. Data Explosion Problem • Why? Suppose: • n dimensions, every dimension has d values • dn possible tuples. • Number of cells in the cube: (d+1)n • So, this is not the problem

  31. Data Explosion Problem • Why? Suppose • n dimensions, every dimension has d values • every dimension has a hierarchy • most extreme case: binary tree  2d possibilities/dimension

  32. Data Explosion Problem • Why? Suppose • n dimensions, every dimension has d values • every dimension has a hierarchy • most extreme case: binary tree  2d possibilities/dimension 2n xdn cells Only partial explanation (factor 2n comes from an extremely pathological case)

  33. Data Explosion Problem • Why? • The problem is that most data is not dense, but sparse. • Hence, not all dn combinations are possible. Example: 10 dimensions with 10 values • 10 000 000 000 possibilities Suppose « only » 1 000 000 are present

  34. Data Explosion Problem Example: 10 dimensions with 10 values • 10 000 000 000 possibilities Suppose « only » 1 000 000 are present Every tuple increases count of 210 cells ! With hierarchies: effect even worse! If every hierarchy has 5 items: 510 =9 765 625 cells!

  35. View Selection Problem • Suffices to precompute some aggregates, and compute others on demand. • aggregate on (item-name, color) from an aggregate on (item-name, color, size) • For all but a few “non-decomposable” aggregates such as median • Several optimizations for computing multiple aggregates • Compute aggregate on (item-name, color) from an aggregate on (item-name, color, size) • Compute aggregates on (item-name, color, size), (item-name, color) and (item-name) in single DB sort

  36. View Selection Problem all product country date date, country product, country product, date product, date, country

  37. View Selection Problem all Which views to select: hard research problem ! product country date date, country product, country product, date product, date, country

  38. Implementation Nowadays systems can be divided in three categories: • ROLAP (Relational OLAP) • OLAP supported on top of a relational database • MOLAP (Multi-Dimensional OLAP) • Use of special multi-dimensional data structures • HOLAP: (Hybrid) • combination of previous two

  39. Month Prod. Cust. Price Jan p1 c1 10 Jan p2 c1 8 Jan p1 c2 10 Feb p1 c1 9 … all p1 c1 102 Jan all c1 18 Jan p1 all 1 230 all all c1 4 235 … all all all 1 253 458 ROLAP • Cubes can easily be represented in relational tables: special value “all”

  40. ROLAP • Typical database scheme: • star schema • fact table is central • links to dimensional tables • Extensions: • snowflake schema • dimensions have hierarchy/extra information attached • Star constellation • multiple star schemas sharing dimensions

  41. Example of a Star Schema Order Product Order No Order Date ProductNO ProdName ProdDescr Category CategoryDescription UnitPrice Fact Table Customer OrderNO SalespersonID CustomerNO ProdNo DateKey CityName Quantity Total Price Customer No Customer Name Customer Address City Date DateKey Date Salesperson City SalespersonID SalespersonName City Quota CityName State Country

  42. Example of a Snowflake Schema Order Product Category Order No Order Date ProductNO ProdName ProdDescr Category Category UnitPrice CategoryName CategoryDescr Fact Table Customer OrderNO SalespersonID CustomerNO ProdNo DateKey CityName Quantity Total Price Customer No Customer Name Customer Address City Date Month DateKey Date Month Month Year Salesperson Year Year SalespersonID SalespersonName City Quota City State CityName State Country StateName Country

  43. Item item_key item_name brand type supplier_key Time_key Time_key Item_key Item_key Branch_key shipper_key from_location Location_key to_location Unit_sold Euros_sold Euros_sold shipper Avg_sales unit_shipped shipper_key shipper_name location_key shipper_type Example of Fact Constellation Multiple fact tables share dimension tables Shipping Fact Table Time time_key day day_of_the_week month quarter year Sales Fact Table Branch Location branch_key branch_name branch_type location_key street city Province/street country Measures

  44. SQL 1999 support for OLAP • see other set of slides