96 Views

Download Presentation
## Associations and Frequent Item Analysis

- - - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - - -

**Transactions**Frequent itemsets Subset Property Association rules Applications Outline**Transaction database: Example**Instances = Transactions ITEMS: A = milk B= bread C= cereal D= sugar E= eggs**Transaction database: Example**Attributes converted to binary flags**Definitions**• Item: attribute=value pair or simply value • usually attributes are converted to binary flags for each value, e.g. product=“A” is written as “A” • Itemset I : a subset of possible items • Example: I = {A,B,E} (order unimportant) • Transaction: (TID, itemset) • TID is transaction ID**Support and Frequent Itemsets**• Support of an itemset • sup(I ) = no. of transactions t that support (i.e. contain) I • In example database: • sup ({A,B,E}) = 2, sup ({B,C}) = 4 • Frequent itemset I is one with at least the minimum support count • sup(I ) >= minsup**SUBSET PROPERTY**• Every subset of a frequent set is frequent! • Q: Why is it so? • A: Example: Suppose {A,B} is frequent. Since each occurrence of A,B includes both A and B, then both A and B must also be frequent • Similar argument for larger itemsets • Almost all association rule algorithms are based on this subset property**Association Rules**• Association rule R : Itemset1 => Itemset2 • Itemset1, 2 are disjoint and Itemset2 is non-empty • meaning: if transaction includes Itemset1 then it also has Itemset2 • Examples • A,B => E,C • A => B,C**From Frequent Itemsets to Association Rules**• Q: Given frequent set {A,B,E}, what are possible association rules? • A => B, E • A, B => E • A, E => B • B => A, E • B, E => A • E => A, B • __ => A,B,E (empty rule), or true => A,B,E**Classification Rules**Focus on one target field Specify class in all cases Measures: Accuracy Association Rules Many target fields Applicable in some cases Measures: Support, Confidence, Lift Classification vs Association Rules**Rule Support and Confidence**• Suppose R: I => J is an association rule • sup (R) = sup (I J) is the support count • support of itemset I J (I or J) • conf (R) = sup(J) / sup(R) is the confidence of R • fraction of transactions with I J that have J • Association rules with minimum support and count are sometimes called “strong” rules**Association Rules Example:**• Q: Given frequent set {A,B,E}, what association rules have minsup = 2 and minconf= 50% ? A, B => E : conf=2/4 = 50% A, E => B : conf=2/2 = 100% B, E => A : conf=2/2 = 100% E => A, B : conf=2/2 = 100% Don’t qualify A =>B, E : conf=2/6 =33%< 50% B => A, E : conf=2/7 = 28% < 50% __ => A,B,E : conf: 2/9 = 22% < 50%**Find Strong Association Rules**• A rule has the parameters minsup and minconf: • sup(R) >= minsup and conf (R) >= minconf • Problem: • Find all association rules with given minsup and minconf • First, find all frequent itemsets**Finding Frequent Itemsets**• Start by finding one-item sets (easy) • Q: How? • A: Simply count the frequencies of all items**Finding itemsets: next level**• Apriori algorithm (Agrawal & Srikant) • Idea: use one-item sets to generate two-item sets, two-item sets to generate three-item sets, … • If (A B) is a frequent item set, then (A) and (B) have to be frequent item sets as well! • In general: if X is frequent k-item set, then all (k-1)-item subsets of X are also frequent • Compute k-item set by merging (k-1)-item sets**An example**• Given: five three-item sets (A B C), (A B D), (A C D), (A C E), (B C D) • Lexicographic order improves efficiency • Candidate four-item sets: (A B C D) Q: OK? A: yes, because all 3-item subsets are frequent (A C D E) Q: OK? A: No, because (C D E) is not frequent**Generating Association Rules**• Two stage process: • Determine frequent itemsets e.g. with the Apriori algorithm. • For each frequent item set I • for each subset J of I • determine all association rules of the form: I-J => J • Main idea used in both stages : subset property**Example: Generating Rules from an Itemset**• Frequent itemset from golf data: • Seven potential rules:**Rules for the weather data**• Rules with support > 1 and confidence = 100%: • In total: 3 rules with support four, 5 with support three, and 50 with support two**Weka associations**File: weather.nominal.arff MinSupport: 0.2**Filtering Association Rules**• Problem: any large dataset can lead to very large number of association rules, even with reasonable Min Confidence and Support • Confidence by itself is not sufficient • e.g. if all transactions include Z, then • any rule I => Z will have confidence 100%. • Other measures to filter rules**Association Rule LIFT**• The lift of an association rule I => J is defined as: • lift = P(J|I) / P(J) • Note, P(I) = (support of I) / (no. of transactions) • ratio of confidence to expected confidence • Interpretation: • if lift > 1, then I and J are positively correlated lift < 1, then I are J are negatively correlated. lift = 1, then I and J are independent.**Other issues**• ARFF format very inefficient for typical market basket data • Attributes represent items in a basket and most items are usually missing • Interestingness of associations • find unusual associations: Milk usually goes with bread, but soy milk does not.**Beyond Binary Data**• Hierarchies • drink milk low-fat milk Stop&Shop low-fat milk … • find associations on any level • Sequences over time • …**Sampling**• Large databases • Sample the database and apply Apriori to the sample. • Potentially Large Itemsets (PL): Large itemsets from sample • Negative Border (BD - ): • Generalization of Apriori-Gen applied to itemsets of varying sizes. • Minimal set of itemsets which are not in PL, but whose subsets are all in PL.**Negative Border Example**PLBD-(PL) PL**Sampling Algorithm**• Ds = sample of Database D; • PL = Large itemsets in Ds using smalls; • C = PL BD-(PL); • Count C in Database using s; • ML = large itemsets in BD-(PL); • If ML = then done • else C = repeated application of BD-; • Count C in Database;**Sampling Example**• Find AR assuming s = 20% • Ds = { t1,t2} • Smalls = 10% • PL = {{Bread}, {Jelly}, {PeanutButter}, {Bread,Jelly}, {Bread,PeanutButter}, {Jelly, PeanutButter}, {Bread,Jelly,PeanutButter}} • BD-(PL)={{Beer},{Milk}} • ML = {{Beer}, {Milk}} • Repeated application of BD- generates all remaining itemsets**Sampling Adv/Disadv**• Advantages: • Reduces number of database scans to one in the best case and two in worst. • Scales better. • Disadvantages: • Potentially large number of candidates in second pass**Partitioning**• Divide database into partitions D1,D2,…,Dp • Apply Apriori to each partition • Any large itemset must be large in at least one partition.**Partitioning Algorithm**• Divide D into partitions D1,D2,…,Dp; • For I = 1 to p do • Li = Apriori(Di); • C = L1 … Lp; • Count C on D to generate L;**Partitioning Example**L1 ={{Bread}, {Jelly}, {PeanutButter}, {Bread,Jelly}, {Bread,PeanutButter}, {Jelly, PeanutButter}, {Bread,Jelly,PeanutButter}} D1 L2 ={{Bread}, {Milk}, {PeanutButter}, {Bread,Milk}, {Bread,PeanutButter}, {Milk, PeanutButter}, {Bread,Milk,PeanutButter}, {Beer}, {Beer,Bread}, {Beer,Milk}} D2 S=10%**Partitioning Adv/Disadv**• Advantages: • Adapts to available main memory • Easily parallelized • Maximum number of database scans is two. • Disadvantages: • May have many candidates during second scan.**Count Distribution Algorithm(CDA)**• Place data partition at each site. • In Parallel at each site do • C1 = Itemsets of size one in I; • Count C1; • Broadcast counts to all sites; • Determine global large itemsets of size 1, L1; • i = 1; • Repeat • i = i + 1; • Ci = Apriori-Gen(Li-1); • Count Ci; • Broadcast counts to all sites; • Determine global large itemsets of size i, Li; • until no more large itemsets found;**Data Distribution Algorithm(DDA)**• Place data partition at each site. • In Parallel at each site do • Determine local candidates of size 1 to count; • Broadcast local transactions to other sites; • Count local candidates of size 1 on all data; • Determine large itemsets of size 1 for local candidates; • Broadcast large itemsets to all sites; • Determine L1; • i = 1; • Repeat • i = i + 1; • Ci = Apriori-Gen(Li-1); • Determine local candidates of size i to count; • Count, broadcast, and find Li; • until no more large itemsets found;**Applications**• Market basket analysis • Store layout, client offers • …**Application Difficulties**• Wal-Mart knows that customers who buy Barbie dolls have a 60% likelihood of buying one of three types of candy bars. • What does Wal-Mart do with information like that? 'I don't have a clue,' says Wal-Mart's chief of merchandising, Lee Scott • See - KDnuggets 98:01 for many ideas www.kdnuggets.com/news/98/n01.html • Diapers and beer urban legend**Summary**• Frequent itemsets • Association rules • Subset property • Apriori algorithm • Application difficulties