html5
1 / 37

Market Basket Analysis and Association Rules

Market Basket Analysis and Association Rules. Chapter 14. Customers Tend to Buy Things Together …. Market Basket Analysis (MBA)?. Relationships through associations. Examples of association rules: {Bread}  {Milk} {Diaper}  {?} {Milk, Bread}  {?}.

nadineh
Download Presentation

Market Basket Analysis and Association Rules

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Market Basket Analysisand Association Rules Chapter 14

  2. Customers Tend to Buy Things Together…

  3. Market Basket Analysis (MBA)? • Relationships through associations • Examples of association rules: • {Bread}  {Milk} • {Diaper}  {?} • {Milk, Bread}  {?} • In summary, we want to know, “what product, in a shopping basket, likely goes with what other product”

  4. Similar Ideas Apply to Many Industries (not just about shopping baskets) • Retailer’s point-of-sale (POS) data • Credit card data (possibly cross-merchant purchases) • Services ordered by telecom customers • Banking services ordered • Record of insurance claims (for fraud detection) • Medical record • … • Generalizing MBA to Association rules (also called Affinity Analysis): we want to know “what goes with what”

  5. Why Do We Care? • Product placement • Whole Foods: next to flowers are birthday cards • Wal-Mart customers who purchase Barbie dolls have a 60% likelihood of also purchasing one of three types of candy bars [Forbes, Sept 8, 1997] • Recommendations • Amazon.com: as you are looking at HDTVs, you might also want HDMI cables • Bundling • E.g., travel “packages” – flight, hotel, car • Other Applications • Price discrimination • Website / catalog design • Fraud detection (multiple suspicious insurance claims) • Medical complications (based on combinations of treatments)

  6. Example: Recommendations in Amazon.com

  7. Association Rules – A Definition • Given a transactional database (set of transactions), • find rules that • predict the occurrence of an item • based on the occurrences of other items in the database. • Implication means co-occurrence, not causality

  8. Rule Format • IF {set of items}  THEN {set of items} • Example: If {diapers}  then {beer} • “IF” part: Antecedent • “THEN” part: Consequent • “Itemset” = the set of items (e.g., products) comprising the antecedent and consequent • Antecedent and consequent are disjoint (i.e., have no items in common)

  9. Many Rules are Possible Consider example to the right: Transaction 2 supports several rules, such as • “If bread, then diaper” • “If beer, then diaper” • “If bread and beer, then eggs” • + many more … • Given n items, the total number of itemsets that can be generated is 2n – n – 1!!!!

  10. Frequent “Itemsets” • Ideally, we want to create all possible combinations of items • Problem: computation time grows exponentially as # of items increases • Solution: consider only “frequent itemsets” • Criterion for “frequent”: support

  11. Support (measures the relevance of a rule) • Support for {bread}  {diapers} is 2/4 • In other words, 50% of transactions include this pair of items • Support quantifies the significance of the co-occurrence of the items involved in a rule • In practice, we only care about itemsets with strong enough support

  12. Exercise on Support: Phone Faceplates • What is the support of {white}? • What is the support of {red, white}? • If we only care about itemsets with minimum support 50%, do we care about {white}? Do we care about {red, white}?

  13. Confidence (measures the strength of a rule) • Confidence for {bread}  {diapers} is 2/3 • In other words: conditional on the fact that a basket contains bread, the probability that the same basket also contains diapers is 2/3

  14. Exercise on Confidence • Confidence for {diapers} {bread} is? • Confidence for {Milk,Eggs} {Shoes} is? • Confidence for {red, white}  {green} is?

  15. Valid Association Rules • A valid rule is one that meets both a minimum support as well as a minimum confidence thresholds. • Both thresholds determined by decision maker. • Why need both thresholds? • A strong (has high confidence level) rule is not necessarily relevant (has high support level) • Example of a rule with high confidence & low support: • A cell phone company database contains all call destinations for each account • {Germany}  {France, Belgium} with confidence = 100% • Support is 1 of 100K accounts.

  16. Valid Association Rules • Suppose we use: Minimum support: 50% • Minimum confidence: 50% • Check Support: • Check Rules – only two survive: • Bread  Diapers (Support = 50%, Confidence = 66.67%) • Diapers  Bread (Support = 50%, Confidence = 100%)

  17. Is a “Valid Rule” Always a “Good Rule”? • Consider: • Tea  Coffee • Confidence = #(Coffee and Tea)/#(Tea) = 15/20 = 75% • i.e., the probability that someone who has bought tea will also buy coffee, is 75%. • Seems good?

  18. Caveat About Confidence • Tea  Coffee • Recall that Confidence = #(CoffeeTea)/#(Tea) = 15/20 = 75% • But, P(Coffee) = #(Coffee) /100 = 90/100 = 90% • i.e., the probability that someone would have bought coffee is 90% • So, given that tea has been bought, the probability of buying coffee has dropped. • Although confidence is high, rule is misleading! • In fact, the confidence of “NOT Tea  Coffee” is 75/80 = 93.75%

  19. Statistical (In)Dependence • Population of 1000 students • 600 students know how to swim (S) • 700 students know how to bike (B) • 300 students know how to swim and bike (S,B) • P(S and B) = 300/1000 = 0.30 • P(S)  P(B) = 0.6  0.7 = 0.42 • P(S and B) = P(S)  P(B) Statistical independence • P(S and B) > P(S)  P(B) Positively correlated • P(S and B) < P(S)  P(B) Negatively correlated • P(Coffee and Tea) = 15/100 = 0.15 • P(Coffee)  P(Tea) = 0.9  0.2 = 0.18 > 0.15 Negatively correlated

  20. Another Performance Measure: Lift • The lift of a rule measures how much more likely the consequent is, given the antecedent • Tea  Coffee • Confidence is 75% • Support of Coffee is 90% • Lift = 0.75/0.9 = 0.833 < 1  this rule is worse than not having any rule.

  21. More on Lift • Another example: {diapers}  {beer} • # of customers in database: 1000 • # of customers buying diapers: 200 • # of customers buying beer: 50 • # of customers buying diapers & beer: 20 • Confidence: 20/200 = 0.1 (or 10%) • Support of consequent: 50/1000 = 0.05 (or 5%) • Lift: 0.1/0.05 = 2 > 1 • i.e, diapers and beer are positively correlated – WHY?

  22. Exercise on Lift • Lift for • {red, white} {green} is ?

  23. More About Performance Measures • Are support, confidence, and lift together enough? • Example: • {maternity ward}  {patient is woman} • Confidence 100%, lift >> 1, but obvious and uninteresting • How to screening for rules that are of particular interest and significance? • Use domain specific conditions to filter generated rules. • Some thoughts: • “Actionability”: Keep only rules that can be acted upon. • “Interestingness”: Various measures for how unexpected a rule is. • E.g. Rule is interesting if it contradicts what is currently known.

  24. Other Evaluation Criteria (Optional) • Many measures proposed in the literature • Some measures good for some applications, but not for others • What criteria should determine whether a measure is good or bad? • Piatetsky-Shapiro suggests 3 properties of a good measure M: • M(A,B) = 0 if A and B are statistically independent • M(A,B)increases monotonically with P(A,B) when P(A) and P(B) remain unchanged • M(A,B)decreases monotonically with P(A) [or P(B)] when P(A,B) and P(B) [or P(A)] remainunchanged • Support and lift are symmetric measures, i.e., M(A, B) = M(B, A) • Confidence is an asymmetric measure, i.e., M(A, B)  M(B, A) Piatetsky-Shapiro, G. and Frawley, W. J., eds (1991), Knowledge Discovery in Databases, AAAI, MIT Press

  25. Generating Association Rules • Standard approach: Apriori • Developed by Agrawal et al (1994) • Problem was defined as: • Generate all association rules that have • support greater than the user-specified support threshold min_sup (minimum support) , and • confidence greater than the user-specified confidence threshold min_conf (minimum confidence) • The algorithm performs a (relatively) efficient search over the data to find all such rules.

  26. Generating Association Rules (Cont.) • Problem is decomposed into two sub-problems: • Find all sets of items (itemsets) with support above min_sup • Itemsets with support ≥ min_sup are called frequent itemsets. • From each frequent itemset, generate rules that use items from that frequent itemset. • Given a frequent itemset Y, and X, a subset of Y • Take the support of Y and divide it by the support of X • Estimates confidence c of the rule X  (Y \ X) • If c ≥ min_conf, then X  (Y \ X) is a valid association rule

  27. Phase 1: Finding Frequent Itemsets • Subsets of frequent itemsets must also be frequent • If a frequent itemset has size n, all subsets of size (n-1) are also frequent • If {diaper, beer} is frequent, then {diaper} and {beer} are also frequent • Therefore, if an itemset is not frequent then no itemset that includes it can be frequent. • if {wine} is not frequent then {wine, beer} cannot be frequent. • We start by finding all itemsets of size 1 that are frequent. • We then try to “expand” these by counting the frequency of only those itemsets of size 2 that include frequent itemsets of size 1. • Next, we take itemsets of size 2 that are frequent, and try expanding them to itemsets of size 3 • We continue this process until we further expansion is not possible.

  28. Exercise on Phase 1 • Requirement: • Minimum support: 40% • Minimum confidence: 80% • Find all 1-item itemsets that meet the minimum support • What are the 2-item itemsets that you need to investigate? • Find all 2-item itemsets that meet the minimum support • Do you need to investigate any 3-item itemsets?

  29. Phase 2: Finding Association Rules • For each frequent itemset, find all possible rules of the form • Antecedent Consequent • using items contained in the itemset • Only keep the rules that meet min_conf (minimum confidence). • Example: • Suppose {Milk, Bread, Butter} is a frequent itemset. • Does {Milk}  {Bread, Butter} have the minimum confidence? • Similarly, {Bread}  {Milk, Butter}, {Butter} {Milk, Bread}, {Bread, Butter}  {Milk}, {Milk, Butter}  {Bread}, {Milk, Bread}  {Butter} • The confidence of the rule {Milk}  {Bread, Butter} is calculated as

  30. Exercise on Phase 2 • Requirement: • Minimum support: 40% • Minimum confidence: 80% • Recall that we already find the following frequent Itemsets: • {Beer} (sup = 0.8), {Diapers} (sup = 0.6), {Chocolates} (sup = 0.4) and {Beer, Diapers} (support = 0.6) • Do we need to consider any 1-item frequent itemsets? • For each multi-item itemset, list all possible association rules and calculate confidence. Then identify all valid association rules.

  31. On the Symmetry/Asymmetry of Metrics Which of these statement(s) is(are) correct ? (1) Support of R1 > Support of R2 (2) Confidence of R1 > Confidence of R2 (3) Lift of R1 > Lift of R2 • Consider these two rules: • R1: {bread}  {diapers} • R2: {diapers}  {bread}

  32. “Confidence” Is Not Symmetric • If A B meets the minimum confidence threshold, B  A does NOT necessarily meet it! • Example: • Support of {Yogurt} is 0.2 (20%) • Support of {Yogurt, Bread, Butter } is 0.1 (10%) • Support of {Bread, Butter} is 0.5 (50%) • Confidence of {Yogurt}  {Bread, Butter} is 0.1/0.2 = 0.5 (50%) • Confidence of {Bread, Butter}  {Yogurt} is 0.1/0.5 = 0.2 (20%)

  33. Think Wildly -- Applications of Association Rules (Revisiting “Why Do We Care?”) • Product placement • Should a store put associated items together? • Recommendations • What if there are competing products to recommend? • Fraud detection • Finding in insurance data that a certain doctor often works with a certain lawyer may indicate potential fraudulent activity • Is it useful for website / catalog design? • Is dissociation important? • If A and NOT B  C • “Database” and Not “Systems Analysis”  “Business Intelligence”

  34. uniform support reduced support Level 1 min_sup = 5% Level 1 min_sup = 5% Milk [support = 10%] Level 2 min_sup = 5% 2% Milk [support = 6%] Skim Milk [support = 4%] Level 2 min_sup = 3% Variants • Multiple-level Association Rules • Items often form hierarchy • Flexible support settings • Items at the lower level are expected to have lower support.

  35. Variants • Analyzing sequential patterns • Given a set of sequences and a support threshold, find the complete set of frequent subsequences • Transaction databases vs. sequence databases • Example: customer shopping sequences: • More examples: • Medical treatment, natural disasters (e.g., earthquakes), telephone calling patterns, Weblog click streams, …

  36. Variants • Continuous attributes, or categorical attributes • Spatial and Multi-Media Association • Constraint-based Data Mining • Knowledge type constraint: • Classification, association, etc. • Data constraint — using SQL-like queries • Find product pairs sold together in stores in Richardson in May 2010. • Dimension/level constraint • Region, price, brand, customer category • Rule (or pattern) constraint • Small sales (price < $10) triggers big sales (sum > $200)

  37. Chapter 14 • Read Chapter 14 for details on this topic • Only Section 14.1

More Related