Pushing Aggregate Constraints by Divide-and-Approximate

1 / 24

# Pushing Aggregate Constraints by Divide-and-Approximate - PowerPoint PPT Presentation

Pushing Aggregate Constraints by Divide-and-Approximate. Ke Wang, Yuelong Jiang, Jeffrey Xu Yu, Guozhu Dong and Jiawei Han. No Easy to Push Constraints. The exists a gap between the interesting criterion and the techniques used in mining patterns from a large amount of data

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.

## PowerPoint Slideshow about 'Pushing Aggregate Constraints by Divide-and-Approximate' - destiny

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

### Pushing Aggregate Constraints by Divide-and-Approximate

Ke Wang, Yuelong Jiang, Jeffrey Xu Yu,

Guozhu Dong and Jiawei Han

No Easy to Push Constraints
• The exists a gap between the interesting criterion and the techniques used in mining patterns from a large amount of data
• Anti-monotonicity is too loose as a pruning strategy.
• Anti-monotonicity is too restricted as an interesting criterion.
• Should we design new algorithms to mine those patterns that can only be found using anti-monotonicity?
• Mining patterns with “general” constraints
Iceberg-Cube Mining
• A iceberg-cube mining queryselect A, B, C, count(*) from R cube by A, B, C having count(*) >= 2
• Count(*) >= 2 is an anti-monotone constraint.
Iceberg-Cube Mining

R1

• Another queryselect A, B, C, sum(M) from R cube by A, B, C having sum(M) >= 150
• sum(M) >= 150 is an anti-monotone constraint, when all values in M are positive.
• sum(M) >= 150 is not an anti-monotone constraint, when some values in M are negative.

R2

The Main Idea
• Study Iceberg-Cube Mining
• Consider f(v) θσ
• f is a function with SQL-like aggregates and arithmetic operators (+, -, *, /); v is a variable; σ is a constant, and θ is either ≤ or ≥.
• Can we push the constraints into iceberg-cube mining that are not anti-monotone or monotone? If so, what is pushing method that is not specific to a particular constraint?
• Divide-Approximate: find a “stronger approximate” for the constraint in a subspace.
Some Definitions
• A relation with many dimensions Di and one or more measures Mi.
• A cell is, di…dk, from Di, …, Dk.
• Use c as a cell variable
• Use di…dk for a cell value (representative)
• SAT(d1…dk) (or SAT(c)) contains all tuples that contains all values in d1…dk (or c).
• C’ is a super-cell of c, or c is a sub-cell of c’, if c’ contains all the values in c.
• Let C be a constraint (f(v) θσ). CUBE(C) denotes the set of cells that satisfy C.
• A constraint C is weaker than C’ if CUBE(C’) ⊆ CUBE(C)
An Example
• Iceberg-Cube Miningselect A, B, C, sum(M) from R cube by A, B, C having sum(M) >= 150
• sum(c) >= 150 is neither anti-monotone nor monotone.
• Let the space be S = {ABC, AB, AC, BC, A, B, C}
• Let sum(c) = psum(c) – nsum(c) >= 150.
• psum(c) is the profit, and nsum(c) is the cost.
• Push an anti-monotone approximator
• Use psum(c) >= 150, and ignore nsum(c).
• If nsum(c) is large, there are have many false positive.
• Use a min nsum in S: psum(c) – nsummin(ABC) >= 150.
• nsummin(ABC) is the minimum nsum in S.
• Use a min nsum in a subspace of S (a stronger constraint)
The Search Strategy(using a lexicographic tree)

0

• A node represents a group-by
• BUC (BottomUpCube):
• Partition the database in the depth-first order of the lexicographic tree.

E

A

C

D

B

AE

AC

DE

AB

CD

CE

BC

BD

BE

ABC

ACD

CDE

ACE

BCD

BDE

ABD

BCE

ABE

BCDE

ABCD

ACDE

ABCE

ABDE

ABCDE

Another Example
• Iceberg-Cube Miningselect A, B, C, D, E, sum(M) from R cube by A, B, C having sum(M) >= 200
• At node ABCDE, sum(12345) = psum(12345) – nsum(12345) = 200 – 250 = -50. (fails).
• Backtracking to ABC, psum(123) – nsummin(12345) = 290 - 100 = 190. (fails)
• Then, at node ABCE, p[1235], must fail. Therefore, all tuples, t[1235], can be pruned.

uk

0

Tree(uk)

A

E

D

uk’

B

C

AE

AC

AB

CD

CE

DE

BC

BD

BE

ABC

ACD

• Find a cell p at u0 fails C, and then extract an anti-monotone approximatorCp.
• Consider an ancestor uk of u0, where u0 is the left-most leaf in tree(uk).
• p[u] denote p projected onto u (a cell of u).
• tree(uk, p) = {p[u] | u is a node in tree(uk)}.
• p is the max cell in tree(uk, p) and p[uk] is the min cell.
• In tree(uk, p).
• If p[uk] fails Cp, all cells in tree(uk, p) fails.
• Note: tree(uk, p) ≠ tree(uk, p’) if p’ ≠ p.

ACE

BDE

CDE

ABD

BCD

ABE

BCE

ABCD

ABCE

ACDE

u0

• A node in tree(uk) is group-by attributes
• A cell in tree(uk, p) is group-by values

BCDE

ABCDE

ABDE

u0’

The Pruning

uk

0

Tree(uk)

A

B

C

D

E

AE

AC

AB

• On the backtracking from u0 to uk
• Check if u0 is on the left-most path in tree(uk)
• Check if p[uk] can use the same anti-monotone approximator as p[u0]
• Check if p[uk] fails Cp.
• If all conditions are met, then
• For every unexplored child ui of uk, we prune all the tuples that match p on tail(ui), because such tuples generate only cells in tree(uk, p), which fail Cp.
• tail(u): the set of all dimensions appearing in tree(u).

BC

BD

BE

DE

CD

CE

ABC

ACD

ACE

ABD

BCD

BDE

CDE

ABE

BCE

ABCD

ABCE

ACDE

u0

BCDE

ABCDE

ABDE

0

A

uk’

B

C

D

E

ui’

AE

AC

AB

BC

BD

BE

DE

CD

CE

uk

ABC

ACD

• Suppose that a cell p[ABCDE] fails.
• On the backtracking from ABCDE to ABC,
• If conditions are met (p[ABC] fails)
• Prune tuples such that t[ABCE] = p[ABCE]
• On the backtracking from ABC to AB,
• If conditions are met (p[AB] fails)
• Prune tuples such that t[ABDE] = p[ABDE] from tree (ABD)
• Prune tuples such that t[ABE] = p[ABE] from tree(ABE)

ACE

ui

ABD

BCD

BDE

CDE

ABE

BCE

ABCD

ABCE

ACDE

u0

BCDE

ABCDE

ABDE

• Given a leaf node u0 and a cell p at u0.
• Let the leftmost path uk…u0 in tree(uk), k >= 0.
• p is a pruning anchor wrt (uk,u0).
• Tree(uk, p) the pruning scope.
The D&A Algorithm
• Modify BUC.
• Push up a pruning anchor p along the leftmost path from u0 to uk.
• Partition the prunning anchors pushed up to the current node, in addition to partitioning the tuples
With Min-Support

0

• Suppose cell abcd is frequent, but cell abcde is infrequent. (Shoud stop at abcd)
• If cell abcd is anchored at node A, cannot prune ae, abe, ace, ade in tree(A, abcd).

E

A

C

D

B

AE

DE

Min-sup = 3

sum(M) >= 100

AC

AB

BC

BD

BE

CE

CD

BDE

BCD

ABC

ACD

CDE

ACE

BCE

ABD

ABE

BCDE

ACDE

ABCD

ABCE

ABDE

ABCDE

Rollback tree

0

• RBtree(AD), RBtree(AC), RBtree(ABD), RBtree(D), RBtree(C), and RBtree(B) do not have E.
• If abcd is anchored at the root, we can prune tuples from RBtree(D), RBtree(C), and RBtree(B).

B

A

E

C

D

CB

AC

EB

ED

EC

Min-sup = 3

sum(M) >= 100

AB

AE

DB

DC

EDC

AEC

AED

DBC

EBC

ABC

ABE

EBD

ABD

BBCD

AECD

ABCD

ABCE

ABED

ABCDE

Constraint/Function Monotonicity
• A constraint C is a-monotone if whenever a cell is not in CUBE(C), neither is any super-cell.
• A constraint C is m-monotone if whenever a cell is in CUBE(C), so its every super-cell.
• A function x(y) is a-monotone wrt y if x decreases as y grows (for cell-valued y) or increases (for real-valued y).
• A function x(y) is m-monotone wrt y if x increases as y grows (for cell-valued y) or increases (for real-values y).
• An example: sum(v) = psum(v) – nsum(v)
• sum(v) is m-monotone wrt psum(v)
• sum(v) is a-monotone wrt nsum(v)
Constraint/Function Monotonicity
• Let a denote m, and m denote a. Let τ denote either a or m.
• Example: psum(v) ≥σ is a-monotone, then psum(v) ≤σ is m-monotone
• If psum(c1) ≥σ is not held, then psum(c2) ≥σ is not true, where c2 is a super cell of c1. (say c1 is a cell of ABC, and c2 is a cell of ABCD)
• f(v) ≥σ is τ-monotone if and only if f(v) is τ-monotone wrt v.
• f(v) ≤σ is τ-monotone if and only if f(v) is τ-monotone wrt v.
• An example: sum(v) = psum(v) – nsum(v) ≥σ.
• sum(v) ≥σ is m-monotone with psum(v), because sum(v) is m-monotone wrt psum(v).
• sum(v) ≥σ is a-monotone with nsum(v), because sum(v) is a-monotone wrt nsum(v).
Find Approximators
• Consider f(v) ≥σ.
• Divide f(v) ≥σ into two groups.
• A+: As cell v grows (becomes a super cell), f monotonically increases.
• A-: As cell grows (becomes a super cell), f monotonically decreases.
• Consider sum(v) = psum(v) – nsum(v) ≥σ.
• A+ = {nsum(v)}
• A- = {psum(v)}
• f(A+; A-/cmin) ≥σand f(A+/cmin; A-) ≤σarem-monotone approximators in a subspace Si, where cmin is a min cell instantiation in Si.
• f(A+/cmax; A-) ≥σand f(A+; A-/cmax) ≤σarea-monotone approximators in a subspace Si, where cmax is a max cell instantiation in Si.
• sum(nsum/cmax; psum) ≥σ
Separate Monotonicity
• Consider function rewriting:
• (E1 + E2) * E into E1 * E + E2 * E.
• Consider space division
• divide a space into subspaces, Si.
• Find approximators using equation rewriting techniques for a subspace, Si.
Experimental Studies
• Consider sum(v) = psum(v) – nsum(v)
• Three algorithms
• BUC: push only the minimum support.
• BUC+: push approximators and mininum support.
• D&A: push approximators and minimum support.
Without minimum support

*) psum(v) >= sigma

Conclusion
• General aggregate constraints, rather than only well-behaved constraints.
• SQL-like tuple-based aggregates, rather than item-based aggregates.
• Constraint independent techniques, rather than constraint specific techniques
• A new push strategy: divide-and-approximate