Organization “Association Analysis”. What is Association Analysis? Association Rules The APRIORI Algorithm Interestingness Measures Sequence Mining Coping with Continuous Attributes in Association Analysis Mining for Other Associations (short) . Association Analysis .
Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.
Goal: Find Interesting Relationships between Sets of Variables (Descriptive Data Mining)
Relationships can be:
What associations are interesting is determined by measures of interestingness.
R1
R2
Regional
Colocation
R3
R4
Task: Find Colocation patterns for the following dataset.
Global Colocation: and are colocated in the whole dataset
MarketBasket transactions
Example of Association Rules
{Diaper} {Beer},{Milk, Bread} {Eggs,Coke},{Beer, Bread} {Milk},
Implication means cooccurrence, not causality!
Computationally prohibitive!
Example of Rules:
{Milk,Diaper} {Beer} (s=0.4, c=0.67){Milk,Beer} {Diaper} (s=0.4, c=1.0)
{Diaper,Beer} {Milk} (s=0.4, c=0.67)
{Beer} {Milk,Diaper} (s=0.4, c=0.67) {Diaper} {Milk,Beer} (s=0.4, c=0.5)
{Milk} {Diaper,Beer} (s=0.4, c=0.5)
Given d items, there are 2d possible candidate itemsets
If d=6, R = 602 rules
Found to be Infrequent
Pruned supersets
What do we learn from
That?
If we create k+1itemsets
from kitemsets we avoid
generating candidates that
are not frequent due to the
Apriori principle.
Items (1itemsets)
Pairs (2itemsets)
(No need to generatecandidates involving Cokeor Eggs)
Minimum Support = 3
Triplets (3itemsets)
If every subset is considered,
6C1 + 6C2 + 6C3 = 41
With supportbased pruning,
6 + 6 + 1 = 13
For rule AC:
support = support({AC}) = 50%
confidence = support({AC})/support({A}) = 66.6%
The Apriori principle:
Any subset of a frequent itemset must be frequent
Min. support 50%
Min. confidence 50%
Ck: Candidate itemset of size k
Lk : frequent itemset of size k
L1 = {frequent items};
for(k = 1; Lk !=; k++) do begin
Ck+1 = candidates generated from Lk;
for each transaction t in database do
increment the count of all candidates in Ck+1 that are contained in t
Lk+1 = candidates in Ck+1 with min_support
end
returnkLk;
insert intoCk
select p.item1, p.item2, …, p.itemk1, q.itemk1
from Lk1 p, Lk1 q
where p.item1=q.item1, …, p.itemk2=q.itemk2, p.itemk1 < q.itemk1
forall itemsets c in Ckdo
forall (k1)subsets s of c do
if (s is not in Lk1) then delete c from Ck
An itemset is maximal frequent if none of its immediate supersets is frequent
Maximal Itemsets
Can be use
as a compact
summary
of all frequent
itemsbut
Lack support
Information.
Infrequent Itemsets
Border
Closed but not maximal
Minimum support = 2
Closed and maximal
# Closed = 9
# Maximal = 4
ABC D, ABD C, ACD B, BCD A, A BCD, B ACD, C ABD, D ABCAB CD, AC BD, AD BC, BC AD, BD AC, CD AB,
f11: support of X and Yf10: support of X and Yf01: support of X and Yf00: support of X and Y
Computing Interestingness MeasureContingency table for X Y
Used to define various measures
Association Rule: Tea Coffee
Problem: Confidence can be large for
independent and negatively correlated
patterns, if the right hand side pattern
has high support.
Probability
multiplier
Kind of measure of independence
Similar to covariance
Correlation for binary Variable
Variables with high prior
probabilities cannot have
a high lift lift is biased towards
variables with low prior probability.
Statistical independence:
If P(X,Y)=P(X)P(Y) => Lift = 1
There are lots of measures proposed in the literature
Some measures are good for certain applications, but not for others
What criteria should we use to determine whether a measure is good or bad?
What about Aprioristyle support based pruning? How does it affect these measures?
skip
Coefficient is the same for both tables
10 examples of contingency tables:
Rankings of contingency tables using various measures:
Does M(A,B) = M(B,A)?
Symmetric measures:
Asymmetric measures:
+
Pattern expected to be frequent

Pattern expected to be infrequent
Pattern found to be frequent
Pattern found to be infrequent

+
Expected Patterns

+
Unexpected Patterns