AMCS/CS 340: Data Mining

Association Rules AMCS/CS 340: Data Mining Xiangliang Zhang King Abdullah University of Science and Technology

Outline: Mining Association Rules • Motivation and Definition • High computational complexity • Frequent itemsets mining • Apriori algorithm – reduce the number of candidate • Frequent-Pattern tree (FP-tree) • Rule Generation • Rule Evaluation • Mining rules with multiple minimum supports Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

The Market-Basket Model • A large set of items • e.g., things sold in a supermarket • A large set of baskets, each of which is a small set of the items • e.g., each basket lists the things a customer buys once • Goal: Find “interesting” connections between items • Can be used to model any • many‐many relationship, • not just in the retail setting

The Frequent Itemsets • Simplest question: Find sets of items that appear together “frequently” in baskets

Definition: Frequent Itemset • Itemset • A collection of one or more items • Example: {Milk, Bread, Diaper} • k-itemset • An itemset that contains k items • Support count () • Frequency of occurrence of an itemset • E.g. ({Milk, Bread,Diaper}) = 2 • Support • Fraction of transactions that contain an itemset • E.g. s({Milk, Bread, Diaper}) = 2/5 • Frequent Itemset • An itemset whose support is greater than or equal to a minsupthreshold

Definition: Frequent Itemset • Itemset • A collection of one or more items • Example: {Milk, Bread, Diaper} • k-itemset • An itemset that contains k items • Support count () • Frequency of occurrence of an itemset • E.g. ({Milk, Bread,Diaper}) = 2 • Support • Fraction of transactions that contain an itemset • E.g. s({Milk, Bread, Diaper}) = 2/5 • Frequent Itemset • An itemset whose support is greater than or equal to a minsupthreshold Example: Set minsup = 0.5 The frequent 2-itemsests : {Bread, Milk}, {Bread, Dipper} {Milk, Diaper}, {Diaper, Beer}

Example: Definition: Association Rule • Association Rule • If-then rules, an implication expression of the form X  Y, where X and Y are itemsets • Example: {Milk, Diaper}  {Beer} • Rule Evaluation Metrics • Support (s) • Fraction of transactions that contain both X and Y • Confidence (c) • Measures how often items in Y appear in transactions thatcontain X

Application of Association Rule Items= products; Baskets= sets of products a customer bought; many people buy beer and diapers together  Run a sale on diapers; raise price of beer 2. Items = documents; Baskets = documents containing a similar sentence; Items that appear together too often could represent plagiarism 3. Items =words; Baskets= Web pages;  Co‐occurrence of relatively rare words may indicate an interesting relationship

Association Rule Mining Task • Given a set of transactions T, the goal of association rule mining is to find all rules having • support ≥ minsupthreshold • confidence ≥ minconfthreshold • Brute-force approach: • List all possible association rules • Compute the support and confidence for each rule • Prune rules that fail the minsup and minconf thresholds  Computationally prohibitive!

Mining Association Rules Example of Rules: {Milk,Diaper}  {Beer} (s=0.4, c=0.67){Milk,Beer}  {Diaper} (s=0.4, c=1.0) {Diaper,Beer}  {Milk} (s=0.4, c=0.67) {Beer}  {Milk,Diaper} (s=0.4, c=0.67) {Diaper}  {Milk,Beer} (s=0.4, c=0.5) {Milk}  {Diaper,Beer} (s=0.4, c=0.5) • Observations: • All the above rules are binary partitions of the same itemset: {Milk, Diaper, Beer} • Rules originating from the same itemsethave identical support but can have different confidence • Thus, we may decouple the support and confidence requirements

Mining Association Rules • Two-step approach: • Frequent Itemset Generation • Generate all itemsets whose support  minsup • Rule Generation • Generate high confidence rules from each frequent itemset, where each rule is a binary partitioning of a frequent itemset • Frequent itemset generation is still computationally expensive Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

Computational Complexity • Given d unique items in all transactions: • Total number of itemsets = 2d -1 • Total number of possible association rules: If d=6, R = 602 rules d (#items) can be 100K (Wal‐Mart) Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

Reduce the number of candidates • Complete search of frequent items = 2d -1 • Priory principle: • If an itemset is frequent, then all of its subsets must also be frequent if Y is frequent, all X are frequent • If an itemset is infrequent, then all of this parent-sets must also be infrequent if X is infrequent, all Y are infrequent Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

Illustrating Apriori Principle Database D Minimum Support = 3/5 Items (1-itemsets) Eliminate Scan D Generate Pairs (2-itemsets) Scan D Eliminate Generate Triplets (3-itemsets) Prune Scan D 2 Milk Not a frequent 3-itemset

Apriori Algorithm • Let k=1 • Generate frequent itemsets of length 1 • Repeat until no new frequent itemsets are identified • Generate length (k+1) candidate itemsets from length k frequent itemsets • Prune candidate itemsets containing subsets of length k that are infrequent • Count the support of each candidate by scanning the DB • Eliminate candidates that are infrequent, leaving only those that are frequent Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

Factors Affecting Complexity • Choice of minimum support threshold • lowering support threshold results in more frequent itemsets • this may increase number of candidates and max length of frequent itemsets • Dimensionality (number of items) of the data set • more space is needed to store support count of each item • if number of frequent items also increases, both computation and I/O costs may also increase • Size of database • since Apriori makes multiple passes, run time of algorithm may increase with number of transactions • Average transaction width • transaction width increases with denser data sets • This may increase max length of frequent itemsets Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

Mining Frequent Patterns Without Candidate Generation • Compress a large database into a compact, Frequent-Pattern tree (FP-tree) structure • highly condensed, but complete for frequent pattern mining • avoid costly database scans • Develop an efficient, FP-tree-based frequent pattern mining method • A divide-and-conquer methodology: decompose mining tasks into smaller ones • Avoid candidate generation: sub-database test only! Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

FP-tree construction Header Table Item frequency a 8 b 7 c 6 d 5 e 3 • Steps: • Scan DB once, find frequent 1-itemset (single item pattern) • Scan DB again, construct FP-tree • (One transaction  one path in the FP-tree)

FP-tree construction: pointers Header table Pointers are used to assist frequent itemset generation

Benefits of the FP-tree Structure • Completeness: • never breaks a long pattern of any transaction • preserves complete information for frequent pattern mining • Compactness • One transaction  one path in the FP-tree • Paths may overlap: • the more the paths overlap, the more compression achieved • Never be larger than the original database (if not count node-links and counts) • Best case: only one single tranche of nodes • The size of a FP-tree depends on how the items are ordered Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

Different FP-tree by ordering items Header Table Item frequency a 8 b 7 c 6 d 5 e 3 Header Table Item frequency e 3 d 5 c 6 b 7 a 8

Generating frequent itemset FP-Growth Algorithm: Bottom-up fashion to derive the frequent itemsets ending with a particular item Paths can be accessed rapidly using the pointers associated to e

Generating frequent itemset FP-Growth Algorithm: Decompose the frequent itemset generation problem into multiple sub-problems

Example: find frequent itemsets including e How to solve the sub-problem? • Construct conditional tree for e (if e is frequent) • Start from paths containing node e • Prefix Paths ending in {de} • conditional tree for {de} • Prefix Paths ending in {ce} • conditional tree for {ce} • Prefix Paths ending in {ae} • conditional tree for {ae}

Example: find frequent itemsets including e • How to solve the sub-problem? • Check e is frequent or not • Convert the prefix paths into • conditional FP-tree • Update support counts • Remove nodee • Ignore infrequent node b Count(e) =3 > Minsup=2 Frequent Items: {e}

Example: find frequent itemsets including e • How to solve the sub-problem? • (prefix paths and conditional tree with {de}) • Check dis frequent or not in the prefix paths ending in {de} Count(d,e) =2 (Minsup=2) Frequent Items: {e}{d,e}

Example: find frequent itemsets including e • How to solve the sub-problem? • (prefix paths and conditional tree with {de}) • Convert the prefix paths (ending in {de}) into conditional FP-tree for {de} • Update support counts • Remove noded • Ignore infrequent node c Frequent Items: {e}{d,e}

Example: find frequent itemsets including e • How to solve the sub-problem? • (prefix paths and conditional tree with {de}) • Check ais frequent or not in the prefix paths ending in {ade} Count(a) =2 (Minsup=2) Frequent Items: {e}{d,e} {a,d,e}

Example: find frequent itemsets including e • How to solve the sub-problem? • (prefix paths and conditional tree with {ce}) • Check cis frequent or not in the prefix paths ending in {e} Count(c,e) =2 (Minsup=2) Frequent Items: {e}{d,e} {a,d,e} {c,e}

Example: find frequent itemsets including e • How to solve the sub-problem? • (prefix paths and conditional tree with {ce}) • Convert the prefix paths (ending in {ce}) into conditional FP-tree for {ce} • Update support counts • Remove nodec Frequent Items: {e}{d,e} {a,d,e} {c,e}

Example: find frequent itemsets including e • How to solve the sub-problem? • (prefix paths and conditional tree with {ce}) • Check a is frequent or not in the prefix paths ending in {ace} Count(a) =1 (Minsup=2) Frequent Items: {e}{d,e} {a,d,e} {c,e}

Example: find frequent itemsets including e • How to solve the sub-problem? • (prefix paths and conditional tree with {ae}) • Check ais frequent or not in the prefix paths ending in {ae} Count(a,e) =2 (Minsup=2) Frequent Items: {e}{d,e} {a,d,e} {c,e} {a,e}

Mining Frequent Itemset using FP-tree • General idea (divide-and-conquer) • Recursively grow frequent pattern path using the FP-tree • At each step, a conditional FP-tree is constructed by updating the frequency counts along the prefix paths and removing all infrequent items • Properties • Sub-problems are disjoint  No duplicate itemsets • FP-growth is an order of magnitude faster Apriorialgorithm (depends on the compaction factor of data) • No candidate generation, no candidate test • Use compact data structure • Eliminate repeated database scan • Basic operation is counting and FP-tree building Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

Compacting the Output of Frequent Itemsets ---Maximal vs Closed Itemsets Maximal Frequent itemsets: no immediate superset is frequent Closed itemsets: no immediate superset has the same count (> 0). Stores not only frequent information, but exact counts.

Example: Maximal vs Closed Itemsets Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

Rule Generation • Given a frequent itemset L, find all non-empty subsets f  L such that f  L – f satisfies the minimum confidence requirement • If {A,B,C,D} is a frequent itemset, candidate rules: ABC D, ABD C, ACD B, BCD A, A BCD, B ACD, C ABD, D ABCAB CD, AC  BD, AD  BC, BC AD, BD AC, CD AB, • If |L| = k, then there are 2k – 2 candidate association rules (ignoring L   and   L)

Rule Generation • How to efficiently generate rules from frequent itemsets? • In general, confidence does not have an anti-monotone property c(ABC D) can be larger or smaller than c(AB D) s(ABCD) s(ABD) c(ABCD) = ------------- c(ABD) =------------ s(ABC) s(AB) only s(ABC) < s(AB), s(ABCD) < s(ABD) (support has monotone property) • But confidence of rules generated from the same itemset has an anti-monotone property e.g., L = {A,B,C,D}: c(ABC  D)  c(AB  CD)  c(A  BCD) Computing the confidence of a rule does not require additional scans of data

Rule Generation for Apriori Algorithm • Apriori algorithm: • level-wise approach for generating association rules • each level: same number of items in rule consequent • rules with t consequent items are used to generate rules with t+1 consequent items

Rule Generation for Apriori Algorithm • Candidate rule is generated by merging two rules that share the same prefixin the rule consequent • Join(CD=>AB,BD=>AC)would produce the candidaterule D => ABC • Prune rule D=>ABC if itssubset AD=>BC does not havehigh confidence Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

Pruned Rules (all rules containing item A as consequent) Rule Generation for Apriori Algorithm Low-confidence Rules

Interesting rules ? • Ruleswith high support and confidence may be useful, but not “interesting” • The “then” part is a frequent action in “if-them” rules • Example: {Milk,Beer}  {Diaper} (s=0.4, c=1.0) Prob(Diaper) = 4/5 = 0.8 • high interest suggests a cause that might be worth investigating

Computing Interestingness Measure • Given a rule X  Y, information needed to compute rule interestingness can be obtained from a contingency table Contingency table for X  Y f11: the # of transactions containing both X and Yf10: the # of transactions containing only X, no Yf01: the # of transactions containing only Y, no Xf00: the # of transactions containing no X, no Y Used to define various measures • support (f11/|T|), confidence(f11/f1+), lift, Gini, J-measure, etc

Association Rule: Tea  Coffee • Confidence= P(Coffee|Tea) = 15/20 = 0.75 • but P(Coffee) = 90/100 = 0.9 • Although confidence is high, rule is misleading • P(Coffee|Tea) =75/80= 0.9375 Drawback of Confidence

Example: Lift/Interest • Association Rule: Tea  Coffee • Confidence= P(Coffee|Tea) = 15/20 = 0.75 • but P(Coffee) = 90/100 = 0.9 • Lift = 0.75/0.9= 0.8333 (< 1, therefore is negatively associated)

There are lots of measures proposed in the literature Tan,Steinbach, Kumar Introduction to Data Mining

AMCS/CS 340: Data Mining