Mdl summarization with holes
Download
1 / 36

MDL Summarization with Holes - PowerPoint PPT Presentation


  • 45 Views
  • Uploaded on

MDL Summarization with Holes. Shaofeng Bu Laks V.S. Lakshmanan Raymond T. Ng University of British Columbia, Canada. Introduction. Multi-dimensional OLAP queries typically produce data intensive answers

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about ' MDL Summarization with Holes' - vera


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
Mdl summarization with holes

MDL Summarization with Holes

Shaofeng Bu

Laks V.S. Lakshmanan

Raymond T. Ng

University of British Columbia, Canada


Introduction
Introduction

  • Multi-dimensional OLAP queriestypically produce data intensive answers

  • Often the question is: how to express the large answer set of cells that satisfy the OLAP query conditions:

    • Simple enumeration: accurate but not necessarily the most intuitive;

    • Summaries: not (necessarily) 100% accurate but can be more intuitive and informative.

    • Summarized answers can be more easily understood

Shaofeng Bu UBC


Olap data cube example
OLAP Data Cube Example

clothes

  • Each dimension is associated with a hierarchical tree

women’s

men’s

women’s jeans

men’s jeans

dress pants

formal wear

dress skirts

jackets

blouses

tops

skirts

ties

Vancouver

Edmonton

northwest

San Jose

San Francisco

Chicago

midwest

location

Minneapolis

Boston

Summit

northeast

Albany

New York


Olap data cube example1
OLAP Data Cube Example

clothes

  • Data Cell: (c1,c2), c1,c2 are leaf-nodes

    in axis-trees, e.g. (Vancouver, ties)

  • Data Region: describes all data cells covered by given nodes in the axis-trees, (x1, y1), e.g.:

    • (Vancouver, ties)

    • (Vancouver, women’s)

    • (midwest, women’s)

women’s

men’s

women’s jeans

men’s jeans

dress pants

formal wear

dress skirts

jackets

blouses

tops

skirts

ties

Vancouver

Edmonton

northwest

San Jose

San Francisco

Chicago

midwest

location

Minneapolis

Boston

Summit

northeast

Albany

New York


Olap data cube example2
OLAP Data Cube Example

clothes

  • Blue cells: the cells that satisfy the query conditions;

  • How to find a summary of the blue cells in a data cube?

women’s

men’s

women’s jeans

men’s jeans

dress pants

formal wear

dress skirts

jackets

blouses

tops

skirts

ties

Vancouver

Edmonton

northwest

San Jose

San Francisco

Chicago

midwest

location

Minneapolis

Boston

Summit

northeast

Albany

New York


Mdl summarization
MDL Summarization

  • MDL: Minimum Description Length

    • Use regions to cover the blue cells;

    • Length of an MDL description is the number of included regions and cells;

    • MDL is to find the description with the minimum length.

Shaofeng Bu UBC


An example of mdl summarization

R1

R2

R3

R4

R5

R7

R8

R6

R9

An Example of MDL Summarization

clothes

women’s

men’s

women’s jeans

men’s jeans

dress pants

formal wear

dress skirts

jackets

blouses

tops

skirts

ties

Vancouver

Edmonton

northwest

San Jose

San Francisco

Chicago

midwest

location

Minneapolis

Boston

Summit

northeast

Albany

New York


A motivating example a new case

Not blue cells any more

MDL Summarization

10 regions

?R1

8 single blue cells

R2

?R3

R4

Total length = 18

R5

R7

R8

R6

R12

?R9

R13

R10

R11

A Motivating Example: A New Case

clothes

women’s

men’s

women’s jeans

men’s jeans

dress pants

formal wear

dress skirts

jackets

blouses

tops

skirts

ties

Vancouver

Edmonton

northwest

San Jose

San Francisco

Chicago

midwest

location

Minneapolis

Boston

Summit

northeast

Albany

New York


Can we do better
Can we do better?

  • Yes!

  • We present a new compression approach: MDL with Holes:

    • Identify regions with blue cells, even if they contain non-blue cells;

    • Express the included blue cells by using regions with the exception of the covered non-blue cells;

    • Non-blue cells are called holes.

Shaofeng Bu UBC


A motivating example mdl with holes

?R1

Plus other 6 regions

R2

?R3

R4

R1+R3-(Vancouver,Skirts)

R5

R7

R8

R6

?R9

A Motivating Example: MDL with Holes

clothes

R1-(Vancouver,Skirts)

  • MDL with Holes:

    • Length = 6+3+3=12

  • MDL Approach:

    • Length is 18

women’s

men’s

R3-(Vancouver,Skirts)

women’s jeans

men’s jeans

dress pants

formal wear

dress skirts

jackets

blouses

tops

skirts

ties

R9-(Boston,ties)

-(New York, dress skirts)

Vancouver

Edmonton

northwest

San Jose

San Francisco

Chicago

location

midwest

Minneapolis

Boston

Summit

northeast

Albany

New York


Problem statements
Problem Statements

  • MDL with Holes (MDLH) is to find a description with holes that has the minimum length and the maximum benefit.

  • In practice, we can drill down on regions to get additional details.

Shaofeng Bu UBC


Definitions length benefit

x

s

t

g

f

h

b

c

d

e

a

Definitions: Length & Benefit

  • Given a set B of data cells (blue cells), an MDLH description for B:

    • D=S – H ,

      • S is a set of data regions,

      • H is a set of data cells, also called ‘holes’,

      • D covers exactly the data cells in B.

    • Length: total number of the included regions and cells in the description.

      |D|=|S|+|H|

    • Benefit : how much shorter is the MDLH summary than the enumeration of B.

      Benefit (D) = |B| – | D|

  • B1={a, b, c}

    • D1= s – d

    • |D1|=2

    • Benefit(D1) = |B1| - |D1| = 1

  • B2={e, g}

    • D2= t – f – h

    • |D2| = 3

    • Benefit(D2)= |B2| - |D2| = -1

Shaofeng Bu UBC


Related work
Related Work

  • The Generalized MDL Approach for Summarization, Laks V.S. Lakshmanan, Raymond T. Ng et al., VLDB 2002

    • Reduce description length byallowing non-blue cells to be covered in the regions

    • The regions are not pure.

  • Concise Descriptions of Subsets of Structured Sets, Alberto O. Mendelzon & Ken Q. Pu, PODS 2003

    • Allow Cartesian products to be formed;

    • Not purely hierarchical: NP Completeness result is less surprising;

    • What about the pure hierarchical?

  • Intelligent Rollups in Multidimensional OLAP Data, Gayatri Sathe and Sunita Sarawagi, VLDB 2001

    • Only report consistent generalization: A tuple can be generalized along a set of dimensions only if it can be generalized along all subsets of dimensions.


Outline
Outline

  • Introduction to MDL with Holes

    • A motivating example

  • 1-D Case: MDLH is Tractable

  • 2-D Case: MDLH is NP-Complete

  • Heuristics

    • A Greedy Heuristic

    • Dynamic Programming

    • Quadratic Programming

  • Experimental Results

  • Summarization on Holes: An Extension

  • Conclusions & Contributions

Shaofeng Bu UBC


1 d case mdlh is tractable

z

x

y

s

t

u

v

w

b

c

d

e

f

g

h

i

j

k

l

m

n

o

p

q

r

a

1-D Case: MDLH is Tractable

  • MDLH is Tractable: the Optimal MDLH description, which has the maximum benefit, can be generated in polynomial time in 1-D case.

  • ‘x’

    • D1= x – d – f – j

      • Benefit(D1) = 7 – 4 = 3

    • D2=(s – d ) + e + ( u – j )

      • Beneift(D2) = 7 – 5 = 2

  • ‘y’

    • D3 = y – m – p – q – r

      • Benefit(D3) = 4 – 5 = -1

    • D4 = ( v – m ) + o ,

      • Benefit(D4) = 4 – 3 = 1

  • ‘z’

    • D5 =z – d – f – j – m – p – q – r

      • Benefit(D5) = 11 – 8 = 3

    • D6=(x – d – f – j)+( v – m + o )

      • Benefit(D6) = 11 – 7 = 4


Outline1
Outline

  • Introduction to MDL with Holes

    • A motivating example

  • 1-D Case: MDLH is Tractable

  • 2-D Case: MDLH is NP-Hard

  • Heuristics

    • A Greedy Heuristic

    • Dynamic Programming

    • Quadratic Programming

  • Experimental Results

  • Summarization on Holes: An Extension

  • Conclusions & Contributions

Shaofeng Bu UBC


2 d case optimality is not preserved any more

(i,2),(i,3),(i,4)

4 0

(i,6),(i,7)

2-D Case: Optimality is not Preserved Any More

8

rows length benefit

1

2

3

4

5

6

7

(f,8),(g,8) 3 2

a

b

(c,8),(d,8),(e,8) 4 0

c

(a,8),(b,8) 5 -2

i

d

e

f

columns length benefit

g

(i,1) 3 2

  • Optimal Solution:

  • {(c,8)+(d,8)+(e,8)+(i,2)+(i,e)+(i,4)}

  • -{(c,2)+(c,3)+(c,4)+(d,2)+(d,3)+(d,4)

  • +(e,2)+(e,3)+(e,4)}

  • +(f,1)+(g,1)+(f,6)+(g,7)

  • Length = 19 Benefit = 28-19 = 9

(i,5) 5 -2


Mdlh is np hard in 2 d case

Clique

MDLH is NP-Hard in 2-D Case

  • It is NP-Hard to find the optimal MDLH description in 2-D data cube;

  • Not a Trivial Proof: Details are in the paper;

  • Reduction Strategy:

Maximum Induced Subgraph in

Complete Edge-Weighted(CEW) Bipartite Graph

MDL with Holes

Shaofeng Bu UBC


Outline2
Outline

  • Introduction to MDL with Holes

    • A motivating example

  • 1-D Case: MDLH is Tractable

  • 2-D Case: MDLH is NP-Hard

  • Heuristics

    • A Greedy Heuristic

    • Dynamic Programming

    • Quadratic Programming

  • Experimental Results

  • Summarization on Holes: An Extension

  • Conclusions & Contributions

Shaofeng Bu UBC


Heuristics for mdlh
Heuristics for MDLH

  • Greedy

    • Each time,choose the row/column with the most benefit

  • Dynamic Programming

    • A bottom-up method to get the description of a region from the descriptions of its children regions

  • Quadratic Programming

    • Using a quadratic function to represent the benefit of a 2-d data cube

Shaofeng Bu UBC


Example for comparison with heuristics

12

11

10

1

2

3

4

5

6

7

8

9

a

b

e

c

d

Example for Comparison with Heuristics

  • The optimal description for this example:

    (e,1)-(a,1)+(e,2)-(b,2)+(e,3)-(b,3)+(d,4)+(b,5)

    +(e,6)+(e,8)+(a,11)-(a,8)

    Length = 12

    Benefit = 8

Shaofeng Bu UBC


Heuristics a greedy heuristic

12

11

10

1

2

3

4

5

6

7

8

9

a

b

e

c

d

Description by Greedy:

(e,6)+(a,11)+(e,8)-(a,8)

+(d,10)-(d,5)

+(a,2)+(a,3)+(b,1)+(b,5)+(c,1)+(c,2)+(c,3)

The length is 13

The benefit is 20-13 = 7

Heuristics: A Greedy Heuristic

region length benefit holes

(e,6) 1 3 -

(d,10) 2 2 (d,5)

(e,1) 2 1 (a,1)

(e,2) 2 1 (b,2)

(e,3) 2 1 (b,3)

(e,8) 2 1 (a,8)

(a,11) 2 1 (a,8)

(c,10) 3 0 (c,4)(c,5)

Shaofeng Bu UBC


Greedy why it is not optimal

12

12

11

11

10

10

1

1

2

2

3

3

4

4

5

5

6

6

7

7

8

8

9

9

a

a

b

b

e

e

c

c

d

d

Optimal Description

Greedy: Why it is not optimal?

  • A selection of row/column may reduce more total benefit

Description from Greedy

Shaofeng Bu UBC


Heuristics dynamic programming

12

11

10

1

2

3

4

5

6

7

8

9

a

b

e

c

d

Heuristics: Dynamic Programming

L: The Length of a Region

t2

t1

S: Selection of Rows & Columns

  • (a,10) : (a,2) + (a,3)

    • L(a,10)=2, S(a,10)=‘t2’

  • (e,4) : (d,4)

    • L(e,4)=1, S(e,4)=‘t1’

  • (d,10): (d,10) – (d,5)

    • L(d,10)=2, S(d,10)=‘g’


Heuristics dynamic programming 2

12

11

10

1

2

3

4

5

6

7

8

9

a

b

D(e,12)=D(e,10)+D(e,11)

e

c

d

D(e,1)+D(e,2)+D(e,3)+D(e,4)+D(e,5)

D(e,6)+D(e,7)+D(e,8)+D(e,9)

(e,3)-(b,3)

(e,1)-(a,1)

(e,2)-(b,2)

(b,5)

(d,4)

(e,6)

(a,7)

(e,8)-(a,8)

(a,9)

Heuristics: Dynamic Programming(2)

t2

t1

D(x1,x2):description for region (x1,x2)

S (e,12)=‘t2’

S (e,10)=‘t2’

S (e,11)=‘t2’

Generated Description:

(e,1)-(a,1)+(e,2)-(b,2)+(e,3)-(b,3)+(d,4)+(b,5) +(e,6)+(a,7)+(e,8)-(a,8)+(a,9)

The length is 13 and the benefit is 20-13 = 7


Dynamic programming why it is not optimal

12

12

11

11

10

10

1

1

2

2

3

3

4

4

5

5

6

6

7

7

8

8

9

9

a

a

b

b

e

e

c

c

d

d

Dynamic Programming: Why it is not optimal?

  • Misses the combination of rows and columns

Description by

Dynamic Programming

Optimal Description

Shaofeng Bu UBC


Heuristics quadratic programming
Heuristics: Quadratic Programming

  • Use variables to represent rows/columns; for a variable v:

    • v=1: the corresponding row/column is selected;

    • v=0: the corresponding row/column is not selected;

  • f = – Benefit( D)

    • Maximizing the benefit is to minimize the value of f

  • For the previous example, quadratic programming generates the optimal description;

  • Optimality is not guaranteed.

Shaofeng Bu UBC


Outline3
Outline

  • Introduction to MDL with Holes

    • A motivating example

  • 1-D Case: MDLH is Tractable

  • 2-D Case: MDLH is NP-Hard

  • Heuristics

    • A Greedy Heuristic

    • Dynamic Programming

    • Quadratic Programming

  • Experimental Results

  • Summarization on Holes: An Extension

  • Conclusions & Contributions

Shaofeng Bu UBC


Experiments
Experiments

  • We ran a set of experiments on the TPC-H benchmark data set;

  • We compared the three MDLH heuristics with MDL and GMDL.

Shaofeng Bu UBC


Experimental results comparison of all methods
Experimental Results: Comparison of All Methods

  • Compression Ratio:

    • MDLH-Quadratic generates the most concise descriptions: a yardstick of quality;

    • MDLH-Dynamic is a very close second.


Experimental results compression ratio
Experimental Results: Compression Ratio

  • The more children per parent node, the greater the benefit


Experimental results running time
Experimental Results: Running time

  • Running time & Scalability:

    • MDLH-Greedy is the fastest;

    • MDLH-Dynamic runs slower than MDLH-Greedy, but it is still scalable w.r.t. the number of cells;

Shaofeng Bu UBC


Outline4
Outline

  • Introduction to MDL with Holes

    • A motivating example

  • 1-D Case: MDLH is Tractable

  • 2-D Case: MDLH is NP-Hard

  • Heuristics

    • A Greedy Heuristic

    • Dynamic Programming

    • Quadratic Programming

  • Experimental Results

  • Summarization on Holes: An Extension

  • Conclusions & Contributions

Shaofeng Bu UBC


Extension summarization on holes

11

10

2

3

4

5

6

7

8

9

1

a

b

e

c

d

Extension: Summarization on holes

  • As the blue density becomes high, a large part of the MDLH description is made up of holes.

  • Can we further reduce the total length by summarizing ‘Holes’?

  • MDLH description is:

    • (a,11)-{(a,6)+(a,8)+(a,9)}

      +(d,11)-{(d,6)+(d,7)+(d,8)} +(b,6)+(c,8)

    • Total length is 10.

  • Summarization on holes:

    • (a,6)+(a,8)+(a,9) = (a,10)-(a,7)

    • (d,6)+(d,7)+(d,8) = (d,10)-(d,9)

  • After summarization on holes:

    • (a,11) - { (a,10) - (a,7)}

      +(d,11) - { (d,10) - (d,9)}

      +(b,6) + (c,8)

    • Total length is 8.


Conclusions contributions
Conclusions & Contributions

  • We present a new method, MDLH, to compress the answers of OLAP queries;

  • We present a bottom-up algorithm for 1-d cube;

  • We proved the NP-Hardness of the MDLH problem;

  • We provided three heuristics for MDLH: greedy, dynamic programming, and quadratic programming;

  • We extended the summarization on holes to further reduce the total length;

  • We did a set of experiments on the TPC-H benchmark data to compare the heuristics.

Shaofeng Bu UBC


On going work
On going work

  • Based on the summarization on blue cells and summarization on holes, build a visualization tool with MDLH summarization:

    • Return summarized answers to user’s queries;

    • Provide drill down operation for users:

      • Browse details on blue cells

      • Browse details on holes

  • Design k-approximation algorithm for MDLH:

    • What is the best quality we can guarantee?

Shaofeng Bu UBC


ad