Pushing aggregate constraints by divide and approximate
Download
1 / 24

Pushing Aggregate Constraints by Divide-and-Approximate - PowerPoint PPT Presentation


  • 102 Views
  • Uploaded on

Pushing Aggregate Constraints by Divide-and-Approximate. Ke Wang, Yuelong Jiang, Jeffrey Xu Yu, Guozhu Dong and Jiawei Han. No Easy to Push Constraints. The exists a gap between the interesting criterion and the techniques used in mining patterns from a large amount of data

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Pushing Aggregate Constraints by Divide-and-Approximate' - destiny


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
Pushing aggregate constraints by divide and approximate l.jpg

Pushing Aggregate Constraints by Divide-and-Approximate

Ke Wang, Yuelong Jiang, Jeffrey Xu Yu,

Guozhu Dong and Jiawei Han


No easy to push constraints l.jpg
No Easy to Push Constraints

  • The exists a gap between the interesting criterion and the techniques used in mining patterns from a large amount of data

    • Anti-monotonicity is too loose as a pruning strategy.

    • Anti-monotonicity is too restricted as an interesting criterion.

  • Should we design new algorithms to mine those patterns that can only be found using anti-monotonicity?

    • Mining patterns with “general” constraints


Iceberg cube mining l.jpg
Iceberg-Cube Mining

  • A iceberg-cube mining queryselect A, B, C, count(*) from R cube by A, B, C having count(*) >= 2

  • Count(*) >= 2 is an anti-monotone constraint.


Iceberg cube mining4 l.jpg
Iceberg-Cube Mining

R1

  • Another queryselect A, B, C, sum(M) from R cube by A, B, C having sum(M) >= 150

  • sum(M) >= 150 is an anti-monotone constraint, when all values in M are positive.

  • sum(M) >= 150 is not an anti-monotone constraint, when some values in M are negative.

R2


The main idea l.jpg
The Main Idea

  • Study Iceberg-Cube Mining

  • Consider f(v) θσ

    • f is a function with SQL-like aggregates and arithmetic operators (+, -, *, /); v is a variable; σ is a constant, and θ is either ≤ or ≥.

  • Can we push the constraints into iceberg-cube mining that are not anti-monotone or monotone? If so, what is pushing method that is not specific to a particular constraint?

    • Divide-Approximate: find a “stronger approximate” for the constraint in a subspace.


Some definitions l.jpg
Some Definitions

  • A relation with many dimensions Di and one or more measures Mi.

  • A cell is, di…dk, from Di, …, Dk.

    • Use c as a cell variable

    • Use di…dk for a cell value (representative)

  • SAT(d1…dk) (or SAT(c)) contains all tuples that contains all values in d1…dk (or c).

  • C’ is a super-cell of c, or c is a sub-cell of c’, if c’ contains all the values in c.

  • Let C be a constraint (f(v) θσ). CUBE(C) denotes the set of cells that satisfy C.

  • A constraint C is weaker than C’ if CUBE(C’) ⊆ CUBE(C)


An example l.jpg
An Example

  • Iceberg-Cube Miningselect A, B, C, sum(M) from R cube by A, B, C having sum(M) >= 150

  • sum(c) >= 150 is neither anti-monotone nor monotone.

  • Let the space be S = {ABC, AB, AC, BC, A, B, C}

  • Let sum(c) = psum(c) – nsum(c) >= 150.

    • psum(c) is the profit, and nsum(c) is the cost.

    • Push an anti-monotone approximator

      • Use psum(c) >= 150, and ignore nsum(c).

        • If nsum(c) is large, there are have many false positive.

      • Use a min nsum in S: psum(c) – nsummin(ABC) >= 150.

        • nsummin(ABC) is the minimum nsum in S.

      • Use a min nsum in a subspace of S (a stronger constraint)


The search strategy using a lexicographic tree l.jpg
The Search Strategy(using a lexicographic tree)

0

  • A node represents a group-by

  • BUC (BottomUpCube):

    • Partition the database in the depth-first order of the lexicographic tree.

E

A

C

D

B

AE

AC

AD

DE

AB

CD

CE

BC

BD

BE

ADE

ABC

ACD

CDE

ACE

BCD

BDE

ABD

BCE

ABE

BCDE

ABCD

ACDE

ABCE

ABDE

ABCDE


Another example l.jpg
Another Example

  • Iceberg-Cube Miningselect A, B, C, D, E, sum(M) from R cube by A, B, C having sum(M) >= 200

  • At node ABCDE, sum(12345) = psum(12345) – nsum(12345) = 200 – 250 = -50. (fails).

  • Backtracking to ABC, psum(123) – nsummin(12345) = 290 - 100 = 190. (fails)

  • Then, at node ABCE, p[1235], must fail. Therefore, all tuples, t[1235], can be pruned.


Slide10 l.jpg

uk

0

Tree(uk)

A

E

D

uk’

B

C

AD

AE

AC

AB

CD

CE

DE

BC

BD

BE

ADE

ABC

ACD

  • Find a cell p at u0 fails C, and then extract an anti-monotone approximatorCp.

  • Consider an ancestor uk of u0, where u0 is the left-most leaf in tree(uk).

  • p[u] denote p projected onto u (a cell of u).

  • tree(uk, p) = {p[u] | u is a node in tree(uk)}.

    • p is the max cell in tree(uk, p) and p[uk] is the min cell.

  • In tree(uk, p).

    • If p[uk] fails Cp, all cells in tree(uk, p) fails.

    • Note: tree(uk, p) ≠ tree(uk, p’) if p’ ≠ p.

ACE

BDE

CDE

ABD

BCD

ABE

BCE

ABCD

ABCE

ACDE

u0

  • A node in tree(uk) is group-by attributes

  • A cell in tree(uk, p) is group-by values

BCDE

ABCDE

ABDE

u0’


Slide11 l.jpg

The Pruning

uk

0

Tree(uk)

A

B

C

D

E

AD

AE

AC

AB

  • On the backtracking from u0 to uk

    • Check if u0 is on the left-most path in tree(uk)

    • Check if p[uk] can use the same anti-monotone approximator as p[u0]

    • Check if p[uk] fails Cp.

  • If all conditions are met, then

    • For every unexplored child ui of uk, we prune all the tuples that match p on tail(ui), because such tuples generate only cells in tree(uk, p), which fail Cp.

      • tail(u): the set of all dimensions appearing in tree(u).

BC

BD

BE

DE

CD

CE

ABC

ACD

ADE

ACE

ABD

BCD

BDE

CDE

ABE

BCE

ABCD

ABCE

ACDE

u0

BCDE

ABCDE

ABDE


Slide12 l.jpg

0

A

uk’

B

C

D

E

ui’

AD

AE

AC

AB

BC

BD

BE

DE

CD

CE

uk

ABC

ACD

ADE

  • Suppose that a cell p[ABCDE] fails.

  • On the backtracking from ABCDE to ABC,

    • If conditions are met (p[ABC] fails)

      • Prune tuples such that t[ABCE] = p[ABCE]

  • On the backtracking from ABC to AB,

    • If conditions are met (p[AB] fails)

      • Prune tuples such that t[ABDE] = p[ABDE] from tree (ABD)

      • Prune tuples such that t[ABE] = p[ABE] from tree(ABE)

ACE

ui

ABD

BCD

BDE

CDE

ABE

BCE

ABCD

ABCE

ACDE

u0

BCDE

ABCDE

ABDE

  • Given a leaf node u0 and a cell p at u0.

  • Let the leftmost path uk…u0 in tree(uk), k >= 0.

  • p is a pruning anchor wrt (uk,u0).

  • Tree(uk, p) the pruning scope.


The d a algorithm l.jpg
The D&A Algorithm

  • Modify BUC.

  • Push up a pruning anchor p along the leftmost path from u0 to uk.

  • Partition the prunning anchors pushed up to the current node, in addition to partitioning the tuples


With min support l.jpg
With Min-Support

0

  • Suppose cell abcd is frequent, but cell abcde is infrequent. (Shoud stop at abcd)

  • If cell abcd is anchored at node A, cannot prune ae, abe, ace, ade in tree(A, abcd).

E

A

C

D

B

AE

DE

Min-sup = 3

sum(M) >= 100

AC

AB

AD

BC

BD

BE

CE

CD

ADE

BDE

BCD

ABC

ACD

CDE

ACE

BCE

ABD

ABE

BCDE

ACDE

ABCD

ABCE

ABDE

ABCDE


Rollback tree l.jpg
Rollback tree

0

  • RBtree(AD), RBtree(AC), RBtree(ABD), RBtree(D), RBtree(C), and RBtree(B) do not have E.

  • If abcd is anchored at the root, we can prune tuples from RBtree(D), RBtree(C), and RBtree(B).

B

A

E

C

D

CB

AD

AC

EB

ED

EC

Min-sup = 3

sum(M) >= 100

AB

AE

DB

DC

EDC

AEC

AED

ADC

DBC

EBC

ABC

ABE

EBD

ABD

BBCD

AECD

ABCD

ABCE

ABED

ABCDE


Constraint function monotonicity l.jpg
Constraint/Function Monotonicity

  • A constraint C is a-monotone if whenever a cell is not in CUBE(C), neither is any super-cell.

  • A constraint C is m-monotone if whenever a cell is in CUBE(C), so its every super-cell.

  • A function x(y) is a-monotone wrt y if x decreases as y grows (for cell-valued y) or increases (for real-valued y).

  • A function x(y) is m-monotone wrt y if x increases as y grows (for cell-valued y) or increases (for real-values y).

  • An example: sum(v) = psum(v) – nsum(v)

    • sum(v) is m-monotone wrt psum(v)

    • sum(v) is a-monotone wrt nsum(v)


Constraint function monotonicity17 l.jpg
Constraint/Function Monotonicity

  • Let a denote m, and m denote a. Let τ denote either a or m.

    • Example: psum(v) ≥σ is a-monotone, then psum(v) ≤σ is m-monotone

      • If psum(c1) ≥σ is not held, then psum(c2) ≥σ is not true, where c2 is a super cell of c1. (say c1 is a cell of ABC, and c2 is a cell of ABCD)

  • f(v) ≥σ is τ-monotone if and only if f(v) is τ-monotone wrt v.

  • f(v) ≤σ is τ-monotone if and only if f(v) is τ-monotone wrt v.

  • An example: sum(v) = psum(v) – nsum(v) ≥σ.

    • sum(v) ≥σ is m-monotone with psum(v), because sum(v) is m-monotone wrt psum(v).

    • sum(v) ≥σ is a-monotone with nsum(v), because sum(v) is a-monotone wrt nsum(v).


Find approximators l.jpg
Find Approximators

  • Consider f(v) ≥σ.

  • Divide f(v) ≥σ into two groups.

    • A+: As cell v grows (becomes a super cell), f monotonically increases.

    • A-: As cell grows (becomes a super cell), f monotonically decreases.

  • Consider sum(v) = psum(v) – nsum(v) ≥σ.

    • A+ = {nsum(v)}

    • A- = {psum(v)}

  • f(A+; A-/cmin) ≥σand f(A+/cmin; A-) ≤σarem-monotone approximators in a subspace Si, where cmin is a min cell instantiation in Si.

  • f(A+/cmax; A-) ≥σand f(A+; A-/cmax) ≤σarea-monotone approximators in a subspace Si, where cmax is a max cell instantiation in Si.

    • sum(nsum/cmax; psum) ≥σ


Separate monotonicity l.jpg
Separate Monotonicity

  • Consider function rewriting:

    • (E1 + E2) * E into E1 * E + E2 * E.

  • Consider space division

    • divide a space into subspaces, Si.

  • Find approximators using equation rewriting techniques for a subspace, Si.


Experimental studies l.jpg
Experimental Studies

  • Consider sum(v) = psum(v) – nsum(v)

  • Three algorithms

    • BUC: push only the minimum support.

    • BUC+: push approximators and mininum support.

    • D&A: push approximators and minimum support.



Without minimum support l.jpg
Without minimum support

*) psum(v) >= sigma



Conclusion l.jpg
Conclusion

  • General aggregate constraints, rather than only well-behaved constraints.

  • SQL-like tuple-based aggregates, rather than item-based aggregates.

  • Constraint independent techniques, rather than constraint specific techniques

  • A new push strategy: divide-and-approximate


ad