Business Intelligence Technologies – Data Mining

Business Intelligence Technologies – Data Mining Lecture 4 Market Basket Analysis, Association Rules

Market Basket Analysis (MBA) • MBA in retail setting • Find out what are bought together • Cross-selling • Optimize shelf layout • Product bundling • Timing promotions • Discount planning • Product selection under limited space • Targeted advertisement, Personalized coupons, item recommendations • Usage beyond Market Basket • Medical (associated symptoms) • Financial (customers with mortgage acct also have saving acct)

Break out session Given a sample data set, discuss the useful information you can derive from the data. Think deeper, not just the information on the surface. 15-20 minutes discussion according to pre-assigned groups. You will be evaluated based on the quality of your ideas.

Data Description • A retailer (e.g. Target) has the following data sources. • Shopping transactions • Shopper information • Census data with information for each zip code

Data Format Transaction data set: Shopper Information: Census Data:

Present Your Answers

What the data contains

Rules Discovered from MBA • Actionable Rules • Wal-Mart customers who purchase Barbie dolls have a 60% likelihood of also purchasing one of three types of candy bars • Trivial Rules • Customers who purchase large appliances are very likely to purchase maintenance agreements • Inexplicable Rules • When a new hardware store opens, one of the most commonly sold items is toilet bowl cleaners

Products Co-purchased Frequently – Frequent Itemsets Itemset: {Beer, Diaper}, {Chocolate, Cheese}, {Beer, Diaper,Cheese} Frequent Itemset: An set of items/products that occurs often in data. {Beer, Diaper} How do we find the frequent itemsets (sets of products co-purchased together frequently)? Is it enough just to find all the frequent itemsets?

Association Rules For a frequent itemset {Diaper, Beer}, is Diaper promoting the purchase of Beer, or Beer increasing the chance of Diaper purchase? We need directions. Examples: Shoppers who buy Diaper are very likely to buy Beer. Then If buy Diaper Buy Beer Shoppers who buy Beer and Diaper are likely to buy Cheese and Chocolate Then If buy Beer, Diaper Buy Cheese, Chocolate

Association Rules Rule format: If {set of items}  Then {set of items} LHS implies RHS LHS RHS If {Diaper, Baby Food} {Beer, Wine} Then An association rule is valid if it satisfies some evaluation measures

Rule Evaluation • Milk & Wine co-occur • But… Only 2 out of 200K transactions contain these items

Rule Evaluation – Support Support: The frequency in which the items in LHS and RHS co-occur. E.g., The support of the {Diaper}  {Beer} rule is 3/5: 60% of the transactions contain both items. No. of transactions containing items in LHS and RHS Total No. of transactions in the dataset Support=

Rule Evaluation - Confidence Is Beer leading to Diaper purchase or Diaper leading to Beer purchase? • Among the transactions with Diaper, 100% have Beer. P(Beer|Diaper)=100% • Among the transactions with Beer, 75% have Diaper. P(Diaper|Beer)=75% No. of transactions containing both LHS and RHS No. of transactions containing LHS • confidence for {Diaper} {Beer} : 3/3 • When Diaper is purchased, the likelihood of Beer purchase is 100% • confidence for {Beer} {Diaper} : 3/4 • When Beer is purchased, the likelihood of Diaper purchase is 75% • So, {Diaper} {Beer} is a more important rule according to confidence. Confidence=

Rule Evaluation - Lift What’s the support and confidence for rule {Chocolate}{Milk}? Support = 3/5 Confidence = 3/4 Very high support and confidence. Does Chocolate really lead to Milk purchase? No! Because Milk occurs in 4 out of 5 transactions. Chocolate is even decreasing the chance of Milk purchase 3/4 < 4/5, i.e. P(Milk|Chocolate)<P(Milk) Lift = (3/4)/(4/5) = 0.9375 < 1

Rule Evaluation – Lift (cont.) • Measures how much more likely is the RHS given the LHS than merely the RHS • Lift = confidence of the rule / frequency of the RHS • i.e. = P(RHS|LHS)/P(RHS) Example: {Diaper}  {Beer} • Total number of customer in database: 1000 • No. of customers buying Diaper: 200 • No. of customers buying beer: 50 • No. of customers buying Diaper & beer: 20 • Frequency of Beer = 50/1000 (5%) • Confidence = 20/200 (10%) • Lift = 10%/5% = 2 • Lift higher than 1 implies people have higher change to buy Beer when they buy Diaper. Lift lower than 1 implies people have lower change to buy Milk when they buy Chocolate.

Algorithm to Extract Association Rules • The standard algorithm: Apriori Rakesh Agrawal, Ramakrishnan Srikant: Fast Algorithms for Mining Association Rules in Large Databases. VLDB 1994: 487-499 • The Association Rules problem was defined as: Generate all association rules that have • support greater than the user-specified minimum support • and confidence greater than the user-specified minimum confidence • The base algorithm uses support and confidence, but we can also use lift to rank the rules discovered by Apriori. • The algorithm performs an efficient search over the data to find all such rules.

Finding Association Rules from Data Association rules discovery problem is decomposed into two sub-problems: • Find all sets of items (itemsets) whose support is above minimum support --- called frequent itemsets or largeitemsets • From each frequent itemset, generate rules whose confidence is above minimum confidence. Given a large itemset Y, and X is a subset of Y Calculate confidence of the rule X  (Y - X) If its confidence is above the minimum confidence, then X  (Y - X) is an association rule we are looking for.

Example • A data set with 5 transactions • Minimum support = 40%, Minimum confidence = 80% • Phase 1: Find all frequent itemsets {Beer} (support=80%), {Diaper} (60%), {Chocolate} (40%) {Beer, Diaper} (60%) Phase 2: Beer  Diaper (conf. 3÷4= 75%) Diaper  Beer (conf. 3÷3= 100%)

Phase 1: Finding all frequent itemsetsHow to perform an efficient search of all frequent itemsets? • A naïve way is to calculate the support for every possible itemset. 2N possible itemsets given N items – impossible to do! • Need smart method: frequent itemsets of size n contain itemsets of size n-1 that also must be frequest Example: if {diaper, beer} is frequent then {diaper} and {beer} are each frequent as well This means that… • If an itemset is not frequent (e.g., {wine}) then no itemset that includes wine can be frequent either, such as {wine, beer} . • We therefore first find all itemsets of size 1 that are frequent. Then try to “expand” these by counting the frequency of all itemsets of size 2 that include frequent itemsets of size 1. Example: If {wine}is not frequent we need not try to find out whether {wine, beer} is frequent. But if both {wine}&{beer} were frequent then it is possible (though not guaranteed) that {wine, beer} is also frequent. • Then take only itemsets of size 2 that are frequent, and try to expand those, etc.

Phase 2: Generating Association Rules Assume {Milk, Bread, Butter} is a frequent itemset. • Using items contained in the itemset, list all possible rules • {Milk}  {Bread, Butter} • {Bread}  {Milk, Butter} • {Butter}  {Milk, Bread} • {Milk, Bread}  {Butter} • {Milk, Butter}  {Bread} • {Bread, Butter}  {Milk} • Calculate the confidence of each rule • Pick the rules with confidence above the minimum confidence Confidence of {Milk}  {Bread, Butter}: Support {Milk, Bread, Butter} Support {Milk} No. of transaction that support {Milk, Bread, Butter} No. of transaction that support {Milk} =

Agrawal (94)’s Apriori Algorithm—An Example Transactions C1 L1 1st scan C2 C2 2nd scan L2 L3 C3 3rd scan {A,B,C}, {A, C, E}?

Sequential Patterns Instead of finding association between items in a single transactions, find association between items across related transactions over time. • Sequence : {Laptop}, {Wireless Card, Router} • A sequence has to satisfy some predetermined minimum support

Applications of Association Rules • Market-Basket Analysis: • e.g. Product assortment optimization, store self layout • Recommendations: Determines which books are frequently purchased together and recommends associated books or products to people who express interest in an item. • Healthcare: Studying the side-effects in patients with multiple prescriptions, we can discover previously unknown interactions and warn patients about them. • Fraud detection: Finding in insurance data that a certain doctor often works with a certain lawyer may indicate potential fraudulent activity. (virtual items) • Sequence Discovery: looks for associations between items bought over time. E.g., we may notice that people who buy chili tend to buy antacid within a month. Knowledge like this can be used to plan inventory levels.

WEKA • Data Preprocessing • bank-data.csv  bank-data-final.arff • Remove the id attribute • Save as .arff format • Take a look at the .arff file in text editor (WordPad) • Process the attributes using Filters • Rules can only handle categorical data types • Children attribute: FilterunsupervisedattributeNumericToNominal • Age, income attributes: FilterunsupervisedattributeDiscretize • Find association rules • Apriori: set parameter; class association rules • Association Rules • Read the results

Binning • Binning (discretization) converts numeric values to discrete categories. e.g. low-income is <= 30, high-income is > 30 • For example: • Equal-Interval binning • Bin intervals of equal width, irrespective of number of items per bin • Equal-Frequency binning • Equal number of items per bin, irrespective of bin width Bins 21-30 31-40 41-50 25 26 29 35 37 42 45 48 26 21-26 29-37 38-48 25 26 29 35 37 42 45 48 26

Exercise 1 Take beer.csv file and find the association rules First process the data to the right format

Exercise 2 – by hand Given the above list of transactions, do the following: 1) Find all the frequent itemsets (minimum support 40%) 2) Find all the association rules (minimum confidence 70%) 3) For the discovered association rules, calculate the lift

Business Intelligence Technologies – Data Mining

Business Intelligence Technologies – Data Mining

Presentation Transcript

Exposing Business Intelligence with SharePoint 2010

The Software Infrastructure for Electronic Commerce

DATA MINING Introductory and Advanced Topics Part II

Knime: a data mining platform

Data Mining: Concepts and Techniques — Slides for Textbook — — Chapter 6 —

Data Mining Classification: Basic Concepts, Decision Trees, and Model Evaluation

Data Mining

Data Mining Classification: Basic Concepts,

Data Mining Chapter 1

Data Mining: Concepts and Techniques — Chapter 5 — Mining Frequent Patterns

Data Warehouse Day 2

Weka – A Data Mining Toolkit

Data Mining: Concepts and Techniques

Data Mining Toon Calders

Spatial Data Mining: Accomplishments and Research Needs

Data Mining: Concepts and Techniques

DATA WAREHOUSING AND DATA MINING

DATA MINING van data naar informatie Ronald Westra Dep. Mathematics Maastricht University

Session 8-9 Data Resource Management

15-826: Multimedia Databases and Data Mining