1 / 28

Composite Subset Measures

Composite Subset Measures. Lei Chen , Paul Barford, Bee-Chung Chen, Vinod Yegneswaran University of Wisconsin - Madison Raghu Ramakrishnan Yahoo! Research and University of Wisconsin – Madison 09.12.2006. Motivation. Consider this query:

jtremblay
Download Presentation

Composite Subset Measures

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Composite Subset Measures Lei Chen, Paul Barford, Bee-Chung Chen, Vinod Yegneswaran University of Wisconsin - Madison Raghu Ramakrishnan Yahoo! Research and University of Wisconsin – Madison 09.12.2006

  2. Motivation • Consider this query: • “For each year and each country, compute the ratio of the average personal incomes between richest city and poorest city . Then find the number of countries where such ratio continuously decrease between 1990-2000“ • It is • Hard to write in SQL • Hard to optimize/understand the SQL query • This kind of queries is increasingly common: • Multi-step aggregation • Must scale to very large datasets, often distributed 2

  3. Contributions • A new framework for expressing such compositional aggregate queries • Key contribution is how we look at the computation, in terms of aggregating over related regions in “cube space” • An efficient evaluation framework based on sorted scans that take into account of multiple aggregation steps • Experimental results 3

  4. Background • Computing “measures” • Measures summarize some characteristic of data subsets (e.g., SUM, std dev, beta-value of a portfolio) • Approaches: Group by, data cubes, Hancock, Sawzall • Cube space • Partition feature space using attribute values; domain hierarchies organize this space into nested collections of regions • Regions: (2006, Korea), (2006/09, Seoul) • Region sets: (Year, Country), (Month, City) 4

  5. Composite Subset Measures • The measure of a cube region is computed by: • Aggregating data in a region directly (e.g., sales volumes for each day), or • Summarizing the measures for related regions, e.g.: • The maximum of daily volumes within a year • The ratio of average personal incomes between the richest and poorest cities in a country 5

  6. What is “Related” in Cube Space • Focus on relationships which • are commonly used • can be efficiently evaluated • Self • Parent/Child • E.g., Year/Day • Child/Parent • E.g., Day/Year • Sibling • E.g., Today/Tomorrow 6

  7. Examples (Network Analysis) • Data involved: • Stream of data records for IP packet information • Time (t), Source (U), Destination (D) , Size (s) • Queries: • For every minute, the number of outgoing packets from each given source IP • For every hour, the maximum number of minutely outgoing packets from a given source IP 7

  8. Expression Algebra • Each measure entity is defined as a collection of region/value pairs • Regions should belong to same region set • Fact Table • Aggregation • Selection • Match join • Combine join 8

  9. Example: Aggregation • For every hour and every unique IP, compute the number of outgoing packets 9

  10. Example: Selection • For every hour, compute the sum of outgoing packets from those source IP with at least five packets in that hour (High traffic count) Source time 10

  11. Example: Match • For each six hour time window, compute the average of the high traffic count 11

  12. Example: Combine • For each hour, compute the ratio between the six hour average and the high traffic count 12

  13. Aggregation Workflows • A diagrammatic way to express multiple composite subset measure expressions • Semantically equivalent to the algebra • Rectangles: Region sets • Ellipses: Measures associated with the Region sets • Arcs: Computational dependencies among measures 13

  14. Measure name Aggregation formula Selection condition Region set Match condition Example 14

  15. Example (cont.) 15

  16. Multi-step Execution Plan • Evaluation based on the topology order of the aggregation workflow • Materialize non-dependent measures • Then evaluate dependent measures • following the arcs of the aggregation workflow • May need to perform join • Problem • Intermediate measures: extra I/O 16

  17. Simple Scan Execution[*] • Build one hash table for each measure • “Insert” data into hash tables of low-level measures • Propagate the measures upwards after the scan is over • Distributive or algebraic aggregation function • Problem • Each hash table keeps all the entries • Bottleneck: Memory capacity [*] T. Johnson and D. Chatziantoniou, Extending complex ad-hoc OLAP, in CIKM, 1999, 170-179. 17

  18. Sort/Scan Execution • Simple scan requires large memory • For each hash table, we need to keep all the entries during the scan • When the data is ordered • Some hash entries can be flushed out before the scan is finished • The memory footprint can be reduced • One pass scan becomes feasible • CPU cost is reduced 18

  19. Evaluation t:Day U:IP t:Month U:IP t:Day COUNT3 count(*) COUNT0 count(*) COUNT2 count(*) month 1 month 2 Output stream for each hash table is still ordered! Sort by day 19

  20. Evaluation t:Day U:IP t:Month U:IP t:Day COUNT3 count(*) COUNT0 count(*) COUNT2 count(*) month 1 month 2 All the output stream is ordered by month! Sort by month 20

  21. Evaluation t:Month U:IP By carefully choosing the sort order of the raw data, we can greatly reduce the memory footprint COUNT3 count(*) month 1 month 2 1 1 1 2 Data are sorted by (t:month, U:IP) 21

  22. Order and Slack • Order • How the records are ordered in the stream • E.g., <t:day, U:IP> • Slack • The gap between the output stream of the measure and the scan progress of raw data • E.g., <t:day:[-3,+3]> • We have developed a mechanism to • Calculate the order/slack • Take advantage of the order/slack information during evaluation 22

  23. Evaluation Network Scan sorted data 23

  24. Optimization • How to find a good sort order? • Enumerate all possible orders • For each order estimate the memory usage • Use sort orders with minimal usage • Evaluation with multiple passes • What measure to compute during each pass? • What order to use in each pass? 24

  25. Experiments • 64 million records • Synthetic data set • Scenario 1 • The measures of a region are computed by combining the aggregated measures for different kinds of child region sets • Scenario 2 • The measures of a region are computed by aggregating the measures of multiple chained siblings 25

  26. Experimental Results (cont.) 26

  27. Experimental Results 27

  28. Conclusions • Composite measures as building blocks for complicated analysis process • Algebra provides the semantic foundation • Aggregation workflow offers intuitive interface • Sort/Scan execution plan evaluates multiple dependent measures in the same run • and hence improve the evaluation performance 28

More Related