1 / 36

# MDL Summarization with Holes - PowerPoint PPT Presentation

MDL Summarization with Holes. Shaofeng Bu Laks V.S. Lakshmanan Raymond T. Ng University of British Columbia, Canada. Introduction. Multi-dimensional OLAP queries typically produce data intensive answers

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.

## PowerPoint Slideshow about ' MDL Summarization with Holes' - vera

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

### MDL Summarization with Holes

Shaofeng Bu

Laks V.S. Lakshmanan

Raymond T. Ng

• Multi-dimensional OLAP queriestypically produce data intensive answers

• Often the question is: how to express the large answer set of cells that satisfy the OLAP query conditions:

• Simple enumeration: accurate but not necessarily the most intuitive;

• Summaries: not (necessarily) 100% accurate but can be more intuitive and informative.

• Summarized answers can be more easily understood

Shaofeng Bu UBC

clothes

• Each dimension is associated with a hierarchical tree

women’s

men’s

women’s jeans

men’s jeans

dress pants

formal wear

dress skirts

jackets

blouses

tops

skirts

ties

Vancouver

Edmonton

northwest

San Jose

San Francisco

Chicago

midwest

location

Minneapolis

Boston

Summit

northeast

Albany

New York

clothes

• Data Cell: (c1,c2), c1,c2 are leaf-nodes

in axis-trees, e.g. (Vancouver, ties)

• Data Region: describes all data cells covered by given nodes in the axis-trees, (x1, y1), e.g.:

• (Vancouver, ties)

• (Vancouver, women’s)

• (midwest, women’s)

women’s

men’s

women’s jeans

men’s jeans

dress pants

formal wear

dress skirts

jackets

blouses

tops

skirts

ties

Vancouver

Edmonton

northwest

San Jose

San Francisco

Chicago

midwest

location

Minneapolis

Boston

Summit

northeast

Albany

New York

clothes

• Blue cells: the cells that satisfy the query conditions;

• How to find a summary of the blue cells in a data cube?

women’s

men’s

women’s jeans

men’s jeans

dress pants

formal wear

dress skirts

jackets

blouses

tops

skirts

ties

Vancouver

Edmonton

northwest

San Jose

San Francisco

Chicago

midwest

location

Minneapolis

Boston

Summit

northeast

Albany

New York

• MDL: Minimum Description Length

• Use regions to cover the blue cells;

• Length of an MDL description is the number of included regions and cells;

• MDL is to find the description with the minimum length.

Shaofeng Bu UBC

R2

R3

R4

R5

R7

R8

R6

R9

An Example of MDL Summarization

clothes

women’s

men’s

women’s jeans

men’s jeans

dress pants

formal wear

dress skirts

jackets

blouses

tops

skirts

ties

Vancouver

Edmonton

northwest

San Jose

San Francisco

Chicago

midwest

location

Minneapolis

Boston

Summit

northeast

Albany

New York

MDL Summarization

10 regions

?R1

8 single blue cells

R2

?R3

R4

Total length = 18

R5

R7

R8

R6

R12

?R9

R13

R10

R11

A Motivating Example: A New Case

clothes

women’s

men’s

women’s jeans

men’s jeans

dress pants

formal wear

dress skirts

jackets

blouses

tops

skirts

ties

Vancouver

Edmonton

northwest

San Jose

San Francisco

Chicago

midwest

location

Minneapolis

Boston

Summit

northeast

Albany

New York

• Yes!

• We present a new compression approach: MDL with Holes:

• Identify regions with blue cells, even if they contain non-blue cells;

• Express the included blue cells by using regions with the exception of the covered non-blue cells;

• Non-blue cells are called holes.

Shaofeng Bu UBC

Plus other 6 regions

R2

?R3

R4

R1+R3-(Vancouver,Skirts)

R5

R7

R8

R6

?R9

A Motivating Example: MDL with Holes

clothes

R1-(Vancouver,Skirts)

• MDL with Holes:

• Length = 6+3+3=12

• MDL Approach:

• Length is 18

women’s

men’s

R3-(Vancouver,Skirts)

women’s jeans

men’s jeans

dress pants

formal wear

dress skirts

jackets

blouses

tops

skirts

ties

R9-(Boston,ties)

-(New York, dress skirts)

Vancouver

Edmonton

northwest

San Jose

San Francisco

Chicago

location

midwest

Minneapolis

Boston

Summit

northeast

Albany

New York

• MDL with Holes (MDLH) is to find a description with holes that has the minimum length and the maximum benefit.

• In practice, we can drill down on regions to get additional details.

Shaofeng Bu UBC

s

t

g

f

h

b

c

d

e

a

Definitions: Length & Benefit

• Given a set B of data cells (blue cells), an MDLH description for B:

• D=S – H ,

• S is a set of data regions,

• H is a set of data cells, also called ‘holes’,

• D covers exactly the data cells in B.

• Length: total number of the included regions and cells in the description.

|D|=|S|+|H|

• Benefit : how much shorter is the MDLH summary than the enumeration of B.

Benefit (D) = |B| – | D|

• B1={a, b, c}

• D1= s – d

• |D1|=2

• Benefit(D1) = |B1| - |D1| = 1

• B2={e, g}

• D2= t – f – h

• |D2| = 3

• Benefit(D2)= |B2| - |D2| = -1

Shaofeng Bu UBC

• The Generalized MDL Approach for Summarization, Laks V.S. Lakshmanan, Raymond T. Ng et al., VLDB 2002

• Reduce description length byallowing non-blue cells to be covered in the regions

• The regions are not pure.

• Concise Descriptions of Subsets of Structured Sets, Alberto O. Mendelzon & Ken Q. Pu, PODS 2003

• Allow Cartesian products to be formed;

• Not purely hierarchical: NP Completeness result is less surprising;

• What about the pure hierarchical?

• Intelligent Rollups in Multidimensional OLAP Data, Gayatri Sathe and Sunita Sarawagi, VLDB 2001

• Only report consistent generalization: A tuple can be generalized along a set of dimensions only if it can be generalized along all subsets of dimensions.

• Introduction to MDL with Holes

• A motivating example

• 1-D Case: MDLH is Tractable

• 2-D Case: MDLH is NP-Complete

• Heuristics

• A Greedy Heuristic

• Dynamic Programming

• Experimental Results

• Summarization on Holes: An Extension

• Conclusions & Contributions

Shaofeng Bu UBC

x

y

s

t

u

v

w

b

c

d

e

f

g

h

i

j

k

l

m

n

o

p

q

r

a

1-D Case: MDLH is Tractable

• MDLH is Tractable: the Optimal MDLH description, which has the maximum benefit, can be generated in polynomial time in 1-D case.

• ‘x’

• D1= x – d – f – j

• Benefit(D1) = 7 – 4 = 3

• D2=(s – d ) + e + ( u – j )

• Beneift(D2) = 7 – 5 = 2

• ‘y’

• D3 = y – m – p – q – r

• Benefit(D3) = 4 – 5 = -1

• D4 = ( v – m ) + o ,

• Benefit(D4) = 4 – 3 = 1

• ‘z’

• D5 =z – d – f – j – m – p – q – r

• Benefit(D5) = 11 – 8 = 3

• D6=(x – d – f – j)+( v – m + o )

• Benefit(D6) = 11 – 7 = 4

• Introduction to MDL with Holes

• A motivating example

• 1-D Case: MDLH is Tractable

• 2-D Case: MDLH is NP-Hard

• Heuristics

• A Greedy Heuristic

• Dynamic Programming

• Experimental Results

• Summarization on Holes: An Extension

• Conclusions & Contributions

Shaofeng Bu UBC

4 0

(i,6),(i,7)

2-D Case: Optimality is not Preserved Any More

8

rows length benefit

1

2

3

4

5

6

7

(f,8),(g,8) 3 2

a

b

(c,8),(d,8),(e,8) 4 0

c

(a,8),(b,8) 5 -2

i

d

e

f

columns length benefit

g

(i,1) 3 2

• Optimal Solution:

• {(c,8)+(d,8)+(e,8)+(i,2)+(i,e)+(i,4)}

• -{(c,2)+(c,3)+(c,4)+(d,2)+(d,3)+(d,4)

• +(e,2)+(e,3)+(e,4)}

• +(f,1)+(g,1)+(f,6)+(g,7)

• Length = 19 Benefit = 28-19 = 9

(i,5) 5 -2

MDLH is NP-Hard in 2-D Case

• It is NP-Hard to find the optimal MDLH description in 2-D data cube;

• Not a Trivial Proof: Details are in the paper;

• Reduction Strategy:

Maximum Induced Subgraph in

Complete Edge-Weighted(CEW) Bipartite Graph

MDL with Holes

Shaofeng Bu UBC

• Introduction to MDL with Holes

• A motivating example

• 1-D Case: MDLH is Tractable

• 2-D Case: MDLH is NP-Hard

• Heuristics

• A Greedy Heuristic

• Dynamic Programming

• Experimental Results

• Summarization on Holes: An Extension

• Conclusions & Contributions

Shaofeng Bu UBC

• Greedy

• Each time,choose the row/column with the most benefit

• Dynamic Programming

• A bottom-up method to get the description of a region from the descriptions of its children regions

• Using a quadratic function to represent the benefit of a 2-d data cube

Shaofeng Bu UBC

11

10

1

2

3

4

5

6

7

8

9

a

b

e

c

d

Example for Comparison with Heuristics

• The optimal description for this example:

(e,1)-(a,1)+(e,2)-(b,2)+(e,3)-(b,3)+(d,4)+(b,5)

+(e,6)+(e,8)+(a,11)-(a,8)

Length = 12

Benefit = 8

Shaofeng Bu UBC

11

10

1

2

3

4

5

6

7

8

9

a

b

e

c

d

Description by Greedy:

(e,6)+(a,11)+(e,8)-(a,8)

+(d,10)-(d,5)

+(a,2)+(a,3)+(b,1)+(b,5)+(c,1)+(c,2)+(c,3)

The length is 13

The benefit is 20-13 = 7

Heuristics: A Greedy Heuristic

region length benefit holes

(e,6) 1 3 -

(d,10) 2 2 (d,5)

(e,1) 2 1 (a,1)

(e,2) 2 1 (b,2)

(e,3) 2 1 (b,3)

(e,8) 2 1 (a,8)

(a,11) 2 1 (a,8)

(c,10) 3 0 (c,4)(c,5)

Shaofeng Bu UBC

12

11

11

10

10

1

1

2

2

3

3

4

4

5

5

6

6

7

7

8

8

9

9

a

a

b

b

e

e

c

c

d

d

Optimal Description

Greedy: Why it is not optimal?

• A selection of row/column may reduce more total benefit

Description from Greedy

Shaofeng Bu UBC

11

10

1

2

3

4

5

6

7

8

9

a

b

e

c

d

Heuristics: Dynamic Programming

L: The Length of a Region

t2

t1

S: Selection of Rows & Columns

• (a,10) : (a,2) + (a,3)

• L(a,10)=2, S(a,10)=‘t2’

• (e,4) : (d,4)

• L(e,4)=1, S(e,4)=‘t1’

• (d,10): (d,10) – (d,5)

• L(d,10)=2, S(d,10)=‘g’

11

10

1

2

3

4

5

6

7

8

9

a

b

D(e,12)=D(e,10)+D(e,11)

e

c

d

D(e,1)+D(e,2)+D(e,3)+D(e,4)+D(e,5)

D(e,6)+D(e,7)+D(e,8)+D(e,9)

(e,3)-(b,3)

(e,1)-(a,1)

(e,2)-(b,2)

(b,5)

(d,4)

(e,6)

(a,7)

(e,8)-(a,8)

(a,9)

Heuristics: Dynamic Programming(2)

t2

t1

D(x1,x2):description for region (x1,x2)

S (e,12)=‘t2’

S (e,10)=‘t2’

S (e,11)=‘t2’

Generated Description:

(e,1)-(a,1)+(e,2)-(b,2)+(e,3)-(b,3)+(d,4)+(b,5) +(e,6)+(a,7)+(e,8)-(a,8)+(a,9)

The length is 13 and the benefit is 20-13 = 7

12

11

11

10

10

1

1

2

2

3

3

4

4

5

5

6

6

7

7

8

8

9

9

a

a

b

b

e

e

c

c

d

d

Dynamic Programming: Why it is not optimal?

• Misses the combination of rows and columns

Description by

Dynamic Programming

Optimal Description

Shaofeng Bu UBC

• Use variables to represent rows/columns; for a variable v:

• v=1: the corresponding row/column is selected;

• v=0: the corresponding row/column is not selected;

• f = – Benefit( D)

• Maximizing the benefit is to minimize the value of f

• For the previous example, quadratic programming generates the optimal description;

• Optimality is not guaranteed.

Shaofeng Bu UBC

• Introduction to MDL with Holes

• A motivating example

• 1-D Case: MDLH is Tractable

• 2-D Case: MDLH is NP-Hard

• Heuristics

• A Greedy Heuristic

• Dynamic Programming

• Experimental Results

• Summarization on Holes: An Extension

• Conclusions & Contributions

Shaofeng Bu UBC

• We ran a set of experiments on the TPC-H benchmark data set;

• We compared the three MDLH heuristics with MDL and GMDL.

Shaofeng Bu UBC

Experimental Results: Comparison of All Methods

• Compression Ratio:

• MDLH-Quadratic generates the most concise descriptions: a yardstick of quality;

• MDLH-Dynamic is a very close second.

• The more children per parent node, the greater the benefit

• Running time & Scalability:

• MDLH-Greedy is the fastest;

• MDLH-Dynamic runs slower than MDLH-Greedy, but it is still scalable w.r.t. the number of cells;

Shaofeng Bu UBC

• Introduction to MDL with Holes

• A motivating example

• 1-D Case: MDLH is Tractable

• 2-D Case: MDLH is NP-Hard

• Heuristics

• A Greedy Heuristic

• Dynamic Programming

• Experimental Results

• Summarization on Holes: An Extension

• Conclusions & Contributions

Shaofeng Bu UBC

10

2

3

4

5

6

7

8

9

1

a

b

e

c

d

Extension: Summarization on holes

• As the blue density becomes high, a large part of the MDLH description is made up of holes.

• Can we further reduce the total length by summarizing ‘Holes’?

• MDLH description is:

• (a,11)-{(a,6)+(a,8)+(a,9)}

+(d,11)-{(d,6)+(d,7)+(d,8)} +(b,6)+(c,8)

• Total length is 10.

• Summarization on holes:

• (a,6)+(a,8)+(a,9) = (a,10)-(a,7)

• (d,6)+(d,7)+(d,8) = (d,10)-(d,9)

• After summarization on holes:

• (a,11) - { (a,10) - (a,7)}

+(d,11) - { (d,10) - (d,9)}

+(b,6) + (c,8)

• Total length is 8.

• We present a new method, MDLH, to compress the answers of OLAP queries;

• We present a bottom-up algorithm for 1-d cube;

• We proved the NP-Hardness of the MDLH problem;

• We provided three heuristics for MDLH: greedy, dynamic programming, and quadratic programming;

• We extended the summarization on holes to further reduce the total length;

• We did a set of experiments on the TPC-H benchmark data to compare the heuristics.

Shaofeng Bu UBC

• Based on the summarization on blue cells and summarization on holes, build a visualization tool with MDLH summarization:

• Return summarized answers to user’s queries;

• Provide drill down operation for users:

• Browse details on blue cells

• Browse details on holes

• Design k-approximation algorithm for MDLH:

• What is the best quality we can guarantee?

Shaofeng Bu UBC