1 / 36

MDL Summarization with Holes

Learn about MDL Summarization, a compression approach for expressing large sets of cells that satisfy OLAP query conditions, and how it can be improved with the concept of holes.

smithf
Download Presentation

MDL Summarization with Holes

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. MDL Summarization with Holes Shaofeng Bu Laks V.S. Lakshmanan Raymond T. Ng University of British Columbia, Canada

  2. Introduction • Multi-dimensional OLAP queriestypically produce data intensive answers • Often the question is: how to express the large answer set of cells that satisfy the OLAP query conditions: • Simple enumeration: accurate but not necessarily the most intuitive; • Summaries: not (necessarily) 100% accurate but can be more intuitive and informative. • Summarized answers can be more easily understood Shaofeng Bu UBC

  3. OLAP Data Cube Example clothes • Each dimension is associated with a hierarchical tree women’s men’s women’s jeans men’s jeans dress pants formal wear dress skirts jackets blouses tops skirts ties Vancouver Edmonton northwest San Jose San Francisco Chicago midwest location Minneapolis Boston Summit northeast Albany New York

  4. OLAP Data Cube Example clothes • Data Cell: (c1,c2), c1,c2 are leaf-nodes in axis-trees, e.g. (Vancouver, ties) • Data Region: describes all data cells covered by given nodes in the axis-trees, (x1, y1), e.g.: • (Vancouver, ties) • (Vancouver, women’s) • (northwest, women’s) women’s men’s women’s jeans men’s jeans dress pants formal wear dress skirts jackets blouses tops skirts ties Vancouver Edmonton northwest San Jose San Francisco Chicago midwest location Minneapolis Boston Summit northeast Albany New York

  5. OLAP Data Cube Example clothes • Blue cells: the cells that satisfy the query conditions; • How to find a summary of the blue cells in a data cube? women’s men’s women’s jeans men’s jeans dress pants formal wear dress skirts jackets blouses tops skirts ties Vancouver Edmonton northwest San Jose San Francisco Chicago midwest location Minneapolis Boston Summit northeast Albany New York

  6. MDL Summarization • MDL: Minimum Description Length • Use regions to cover the blue cells; • Length of an MDL description is the number of included regions and cells; • MDL is to find the description with the minimum length. Shaofeng Bu UBC

  7. R1 R2 R3 R4 R5 R7 R8 R6 R9 An Example of MDL Summarization clothes women’s men’s women’s jeans men’s jeans dress pants formal wear dress skirts jackets blouses tops skirts ties Vancouver Edmonton northwest San Jose San Francisco Chicago midwest location Minneapolis Boston Summit northeast Albany New York

  8. Not blue cells any more MDL Summarization 10 regions ?R1 8 single blue cells R2 ?R3 R4 Total length = 18 R5 R7 R8 R6 R12 ?R9 R13 R10 R11 A Motivating Example: A New Case clothes women’s men’s women’s jeans men’s jeans dress pants formal wear dress skirts jackets blouses tops skirts ties Vancouver Edmonton northwest San Jose San Francisco Chicago midwest location Minneapolis Boston Summit northeast Albany New York

  9. Can we do better? • Yes! • We present a new compression approach: MDL with Holes: • Identify regions with blue cells, even if they contain non-blue cells; • Express the included blue cells by using regions with the exception of the covered non-blue cells; • Non-blue cells are called holes. Shaofeng Bu UBC

  10. ?R1 Plus other 6 regions R2 ?R3 R4 R1+R3-(Vancouver,Skirts) R5 R7 R8 R6 ?R9 A Motivating Example: MDL with Holes clothes R1-(Vancouver,Skirts) • MDL with Holes: • Length = 6+3+3=12 • MDL Approach: • Length is 18 women’s men’s R3-(Vancouver,Skirts) women’s jeans men’s jeans dress pants formal wear dress skirts jackets blouses tops skirts ties R9-(Boston,ties) -(New York, dress skirts) Vancouver Edmonton northwest San Jose San Francisco Chicago location midwest Minneapolis Boston Summit northeast Albany New York

  11. Problem Statements • MDL with Holes (MDLH) is to find a description with holes that has the minimum length and the maximum benefit. • In practice, we can drill down on regions to get additional details. Shaofeng Bu UBC

  12. x s t g f h b c d e a Definitions: Length & Benefit • Given a set B of data cells (blue cells), an MDLH description for B: • D=S – H , • S is a set of data regions, • H is a set of data cells, also called ‘holes’, • D covers exactly the data cells in B. • Length: total number of the included regions and cells in the description. |D|=|S|+|H| • Benefit : how much shorter is the MDLH summary than the enumeration of B. Benefit (D) = |B| – | D| • B1={a, b, c} • D1= s – d • |D1|=2 • Benefit(D1) = |B1| - |D1| = 1 • B2={e, g} • D2= t – f – h • |D2| = 3 • Benefit(D2)= |B2| - |D2| = -1 Shaofeng Bu UBC

  13. Related Work • The Generalized MDL Approach for Summarization, Laks V.S. Lakshmanan, Raymond T. Ng et al., VLDB 2002 • Reduce description length byallowing non-blue cells to be covered in the regions • The regions are not pure. • Concise Descriptions of Subsets of Structured Sets, Alberto O. Mendelzon & Ken Q. Pu, PODS 2003 • Allow Cartesian products to be formed; • Not purely hierarchical: NP Completeness result is less surprising; • What about the pure hierarchical? • Intelligent Rollups in Multidimensional OLAP Data, Gayatri Sathe and Sunita Sarawagi, VLDB 2001 • Only report consistent generalization: A tuple can be generalized along a set of dimensions only if it can be generalized along all subsets of dimensions.

  14. Outline • Introduction to MDL with Holes • A motivating example • 1-D Case: MDLH is Tractable • 2-D Case: MDLH is NP-Complete • Heuristics • A Greedy Heuristic • Dynamic Programming • Quadratic Programming • Experimental Results • Summarization on Holes: An Extension • Conclusions & Contributions Shaofeng Bu UBC

  15. z x y s t u v w b c d e f g h i j k l m n o p q r a 1-D Case: MDLH is Tractable • MDLH is Tractable: the Optimal MDLH description, which has the maximum benefit, can be generated in polynomial time in 1-D case. • ‘x’ • D1= x – d – f – j • Benefit(D1) = 7 – 4 = 3 • D2=(s – d ) + e + ( u – j ) • Beneift(D2) = 7 – 5 = 2 • ‘y’ • D3 = y – m – p – q – r • Benefit(D3) = 4 – 5 = -1 • D4 = ( v – m ) + o , • Benefit(D4) = 4 – 3 = 1 • ‘z’ • D5 =z – d – f – j – m – p – q – r • Benefit(D5) = 11 – 8 = 3 • D6=(x – d – f – j)+( v – m + o ) • Benefit(D6) = 11 – 7 = 4

  16. Outline • Introduction to MDL with Holes • A motivating example • 1-D Case: MDLH is Tractable • 2-D Case: MDLH is NP-Hard • Heuristics • A Greedy Heuristic • Dynamic Programming • Quadratic Programming • Experimental Results • Summarization on Holes: An Extension • Conclusions & Contributions Shaofeng Bu UBC

  17. (i,2),(i,3),(i,4) 4 0 (i,6),(i,7) 2-D Case: Optimality is not Preserved Any More 8 rows length benefit 1 2 3 4 5 6 7 (f,8),(g,8) 3 2 a b (c,8),(d,8),(e,8) 4 0 c (a,8),(b,8) 5 -2 i d e f columns length benefit g (i,1) 3 2 • Optimal Solution: • {(c,8)+(d,8)+(e,8)+(i,2)+(i,e)+(i,4)} • -{(c,2)+(c,3)+(c,4)+(d,2)+(d,3)+(d,4) • +(e,2)+(e,3)+(e,4)} • +(f,1)+(g,1)+(f,6)+(g,7) • Length = 19 Benefit = 28-19 = 9 (i,5) 5 -2

  18. Clique MDLH is NP-Hard in 2-D Case • It is NP-Hard to find the optimal MDLH description in 2-D data cube; • Not a Trivial Proof: Details are in the paper; • Reduction Strategy: Maximum Induced Subgraph in Complete Edge-Weighted(CEW) Bipartite Graph MDL with Holes Shaofeng Bu UBC

  19. Outline • Introduction to MDL with Holes • A motivating example • 1-D Case: MDLH is Tractable • 2-D Case: MDLH is NP-Hard • Heuristics • A Greedy Heuristic • Dynamic Programming • Quadratic Programming • Experimental Results • Summarization on Holes: An Extension • Conclusions & Contributions Shaofeng Bu UBC

  20. Heuristics for MDLH • Greedy • Each time,choose the row/column with the most benefit • Dynamic Programming • A bottom-up method to get the description of a region from the descriptions of its children regions • Quadratic Programming • Using a quadratic function to represent the benefit of a 2-d data cube Shaofeng Bu UBC

  21. 12 11 10 1 2 3 4 5 6 7 8 9 a b e c d Example for Comparison with Heuristics • The optimal description for this example: (e,1)-(a,1)+(e,2)-(b,2)+(e,3)-(b,3)+(d,4)+(b,5) +(e,6)+(e,8)+(a,11)-(a,8) Length = 12 Benefit = 8 Shaofeng Bu UBC

  22. 12 11 10 1 2 3 4 5 6 7 8 9 a b e c d Description by Greedy: (e,6)+(a,11)+(e,8)-(a,8) +(d,10)-(d,5) +(a,2)+(a,3)+(b,1)+(b,5)+(c,1)+(c,2)+(c,3) The length is 13 The benefit is 20-13 = 7 Heuristics: A Greedy Heuristic region length benefit holes (e,6) 1 3 - (d,10) 2 2 (d,5) (e,1) 2 1 (a,1) (e,2) 2 1 (b,2) (e,3) 2 1 (b,3) (e,8) 2 1 (a,8) (a,11) 2 1 (a,8) (c,10) 3 0 (c,4)(c,5) Shaofeng Bu UBC

  23. 12 12 11 11 10 10 1 1 2 2 3 3 4 4 5 5 6 6 7 7 8 8 9 9 a a b b e e c c d d Optimal Description Greedy: Why it is not optimal? • A selection of row/column may reduce more total benefit Description from Greedy Shaofeng Bu UBC

  24. 12 11 10 1 2 3 4 5 6 7 8 9 a b e c d Heuristics: Dynamic Programming L: The Length of a Region t2 t1 S: Selection of Rows & Columns • (a,10) : (a,2) + (a,3) • L(a,10)=2, S(a,10)=‘t2’ • (e,4) : (d,4) • L(e,4)=1, S(e,4)=‘t1’ • (d,10): (d,10) – (d,5) • L(d,10)=2, S(d,10)=‘g’

  25. 12 11 10 1 2 3 4 5 6 7 8 9 a b D(e,12)=D(e,10)+D(e,11) e c d D(e,1)+D(e,2)+D(e,3)+D(e,4)+D(e,5) D(e,6)+D(e,7)+D(e,8)+D(e,9) (e,3)-(b,3) (e,1)-(a,1) (e,2)-(b,2) (b,5) (d,4) (e,6) (a,7) (e,8)-(a,8) (a,9) Heuristics: Dynamic Programming(2) t2 t1 D(x1,x2):description for region (x1,x2) S (e,12)=‘t2’ S (e,10)=‘t2’ S (e,11)=‘t2’ Generated Description: (e,1)-(a,1)+(e,2)-(b,2)+(e,3)-(b,3)+(d,4)+(b,5) +(e,6)+(a,7)+(e,8)-(a,8)+(a,9) The length is 13 and the benefit is 20-13 = 7

  26. 12 12 11 11 10 10 1 1 2 2 3 3 4 4 5 5 6 6 7 7 8 8 9 9 a a b b e e c c d d Dynamic Programming: Why it is not optimal? • Misses the combination of rows and columns Description by Dynamic Programming Optimal Description Shaofeng Bu UBC

  27. Heuristics: Quadratic Programming • Use variables to represent rows/columns; for a variable v: • v=1: the corresponding row/column is selected; • v=0: the corresponding row/column is not selected; • f = – Benefit( D) • Maximizing the benefit is to minimize the value of f • For the previous example, quadratic programming generates the optimal description; • Optimality is not guaranteed. Shaofeng Bu UBC

  28. Outline • Introduction to MDL with Holes • A motivating example • 1-D Case: MDLH is Tractable • 2-D Case: MDLH is NP-Hard • Heuristics • A Greedy Heuristic • Dynamic Programming • Quadratic Programming • Experimental Results • Summarization on Holes: An Extension • Conclusions & Contributions Shaofeng Bu UBC

  29. Experiments • We ran a set of experiments on the TPC-H benchmark data set; • We compared the three MDLH heuristics with MDL and GMDL. Shaofeng Bu UBC

  30. Experimental Results: Comparison of All Methods • Compression Ratio: • MDLH-Quadratic generates the most concise descriptions: a yardstick of quality; • MDLH-Dynamic is a very close second.

  31. Experimental Results: Compression Ratio • The more children per parent node, the greater the benefit

  32. Experimental Results: Summary • Running time & Scalability: • MDLH-Greedy is the fastest; • MDLH-Dynamic runs slower than MDLH-Greedy, but it is still scalable w.r.t. the number of cells; Shaofeng Bu UBC

  33. Outline • Introduction to MDL with Holes • A motivating example • 1-D Case: MDLH is Tractable • 2-D Case: MDLH is NP-Hard • Heuristics • A Greedy Heuristic • Dynamic Programming • Quadratic Programming • Experimental Results • Summarization on Holes: An Extension • Conclusions & Contributions Shaofeng Bu UBC

  34. 11 10 2 3 4 5 6 7 8 9 1 a b e c d Extension: Summarization on holes • As the blue density becomes high, a large part of the MDLH description is made up of holes. • Can we further reduce the total length by summarizing ‘Holes’? • MDLH description is: • (a,11)-{(a,6)+(a,8)+(a,9)} +(d,11)-{(d,6)+(d,7)+(d,8)} +(b,6)+(c,8) • Total length is 10. • Summarization on holes: • (a,6)+(a,8)+(a,9) = (a,10)-(a,7) • (d,6)+(d,7)+(d,8) = (d,10)-(d,9) • After summarization on holes: • (a,11) - { (a,10) - (a,7)} +(d,11) - { (d,10) - (d,9)} +(b,6) + (c,8) • Total length is 8.

  35. Conclusions & Contributions • We present a new method, MDLH, to compress the answers of OLAP queries; • We present a bottom-up algorithm for 1-d cube; • We proved the NP-Hardness of the MDLH problem; • We provided three heuristics for MDLH: greedy, dynamic programming, and quadratic programming; • We extended the summarization on holes to further reduce the total length; • We did a set of experiments on the TPC-H benchmark data to compare the heuristics. Shaofeng Bu UBC

  36. On going work • Based on the summarization on blue cells and summarization on holes, build a visualization tool with MDLH summarization: • Return summarized answers to user’s queries; • Provide drill down operation for users: • Browse details on blue cells • Browse details on holes • Design k-approximation algorithm for MDLH: • What is the best quality we can guarantee? Shaofeng Bu UBC

More Related