Composite Subset Measures

Composite Subset Measures Lei Chen, Paul Barford, Bee-Chung Chen, Vinod Yegneswaran University of Wisconsin - Madison Raghu Ramakrishnan Yahoo! Research and University of Wisconsin – Madison 09.12.2006

Motivation • Consider this query: • “For each year and each country, compute the ratio of the average personal incomes between richest city and poorest city . Then find the number of countries where such ratio continuously decrease between 1990-2000“ • It is • Hard to write in SQL • Hard to optimize/understand the SQL query • This kind of queries is increasingly common: • Multi-step aggregation • Must scale to very large datasets, often distributed 2

Contributions • A new framework for expressing such compositional aggregate queries • Key contribution is how we look at the computation, in terms of aggregating over related regions in “cube space” • An efficient evaluation framework based on sorted scans that take into account of multiple aggregation steps • Experimental results 3

Background • Computing “measures” • Measures summarize some characteristic of data subsets (e.g., SUM, std dev, beta-value of a portfolio) • Approaches: Group by, data cubes, Hancock, Sawzall • Cube space • Partition feature space using attribute values; domain hierarchies organize this space into nested collections of regions • Regions: (2006, Korea), (2006/09, Seoul) • Region sets: (Year, Country), (Month, City) 4

Composite Subset Measures • The measure of a cube region is computed by: • Aggregating data in a region directly (e.g., sales volumes for each day), or • Summarizing the measures for related regions, e.g.: • The maximum of daily volumes within a year • The ratio of average personal incomes between the richest and poorest cities in a country 5

What is “Related” in Cube Space • Focus on relationships which • are commonly used • can be efficiently evaluated • Self • Parent/Child • E.g., Year/Day • Child/Parent • E.g., Day/Year • Sibling • E.g., Today/Tomorrow 6

Examples (Network Analysis) • Data involved: • Stream of data records for IP packet information • Time (t), Source (U), Destination (D) , Size (s) • Queries: • For every minute, the number of outgoing packets from each given source IP • For every hour, the maximum number of minutely outgoing packets from a given source IP 7

Expression Algebra • Each measure entity is defined as a collection of region/value pairs • Regions should belong to same region set • Fact Table • Aggregation • Selection • Match join • Combine join 8

Example: Aggregation • For every hour and every unique IP, compute the number of outgoing packets 9

Example: Selection • For every hour, compute the sum of outgoing packets from those source IP with at least five packets in that hour (High traffic count) Source time 10

Example: Match • For each six hour time window, compute the average of the high traffic count 11

Example: Combine • For each hour, compute the ratio between the six hour average and the high traffic count 12

Aggregation Workflows • A diagrammatic way to express multiple composite subset measure expressions • Semantically equivalent to the algebra • Rectangles: Region sets • Ellipses: Measures associated with the Region sets • Arcs: Computational dependencies among measures 13

Measure name Aggregation formula Selection condition Region set Match condition Example 14

Example (cont.) 15

Multi-step Execution Plan • Evaluation based on the topology order of the aggregation workflow • Materialize non-dependent measures • Then evaluate dependent measures • following the arcs of the aggregation workflow • May need to perform join • Problem • Intermediate measures: extra I/O 16

Simple Scan Execution[*] • Build one hash table for each measure • “Insert” data into hash tables of low-level measures • Propagate the measures upwards after the scan is over • Distributive or algebraic aggregation function • Problem • Each hash table keeps all the entries • Bottleneck: Memory capacity [*] T. Johnson and D. Chatziantoniou, Extending complex ad-hoc OLAP, in CIKM, 1999, 170-179. 17

Sort/Scan Execution • Simple scan requires large memory • For each hash table, we need to keep all the entries during the scan • When the data is ordered • Some hash entries can be flushed out before the scan is finished • The memory footprint can be reduced • One pass scan becomes feasible • CPU cost is reduced 18

Evaluation t:Day U:IP t:Month U:IP t:Day COUNT3 count(*) COUNT0 count(*) COUNT2 count(*) month 1 month 2 Output stream for each hash table is still ordered! Sort by day 19

Evaluation t:Day U:IP t:Month U:IP t:Day COUNT3 count(*) COUNT0 count(*) COUNT2 count(*) month 1 month 2 All the output stream is ordered by month! Sort by month 20

Evaluation t:Month U:IP By carefully choosing the sort order of the raw data, we can greatly reduce the memory footprint COUNT3 count(*) month 1 month 2 1 1 1 2 Data are sorted by (t:month, U:IP) 21

Order and Slack • Order • How the records are ordered in the stream • E.g., <t:day, U:IP> • Slack • The gap between the output stream of the measure and the scan progress of raw data • E.g., <t:day:[-3,+3]> • We have developed a mechanism to • Calculate the order/slack • Take advantage of the order/slack information during evaluation 22

Evaluation Network Scan sorted data 23

Optimization • How to find a good sort order? • Enumerate all possible orders • For each order estimate the memory usage • Use sort orders with minimal usage • Evaluation with multiple passes • What measure to compute during each pass? • What order to use in each pass? 24

Experiments • 64 million records • Synthetic data set • Scenario 1 • The measures of a region are computed by combining the aggregated measures for different kinds of child region sets • Scenario 2 • The measures of a region are computed by aggregating the measures of multiple chained siblings 25

Experimental Results (cont.) 26

Experimental Results 27

Conclusions • Composite measures as building blocks for complicated analysis process • Algebra provides the semantic foundation • Aggregation workflow offers intuitive interface • Sort/Scan execution plan evaluates multiple dependent measures in the same run • and hence improve the evaluation performance 28

Composite Subset Measures