330 likes | 494 Views
CS4445/B12 Provided by: Kenneth J. Loomis. Homework 4: Solutions. Homework 4 Solutions. CLASSIFICATION RULES: RIPPER ALGORITHM. RIPPER: First Rule.
E N D
CS4445/B12 Provided by: Kenneth J. Loomis Homework 4: Solutions
Homework 4 Solutions CLASSIFICATION RULES: RIPPER ALGORITHM
RIPPER: First Rule • The first thing that needs to be determined is the consequence of the rule: Recall that a rule is made up of an antecedent consequence. • The table below contains the frequency counts of the possible consequences of the rules from the userprofile dataset using budget as the classification attribute: • We can see that budget=high has the lowest frequency count in our training dataset, so we choose that as the first antecedent that we will find rules for. • Note: I have included missing values here as one could classify the target as missing. Alternately, these instances could be removed.
RIPPER: First Rule • Next we attempt to find the first condition in the antecedent. We need only look at possible conditions that exists in the 5 instances that have budget=high. • The list of possible conditions are in the table below.
RIPPER: First Rule • Next we determine the information gain for each of the candidate rules in the table. • Below is a detailed example of the calculation for the rule • smoker = true budget = high: • Given: is the number of instances such that budget=high • is the number of instance such that budget ≠ high • is the number of instances such that smoker=true andbudget=high • is the number of instance such that smoker=true but budget ≠ high =
RIPPER: First Rule • Here we see a list of the information gain for each of the possible first condition in the antecedent
RIPPER: First Rule • Since the following rule results in the highest information gain we select that as the first condition of our rule: • transport = car owner budget = high: • Now we can use the number of instances calculated from this rule as and we calculate all the possible second conditions as in the next set of calculations.
RIPPER: First Rule • Next we attempt to find the second condition in the antecedent. We need only look at possible conditions that exists in the 4 instances that have transport = car owner and budget=high. • The list of possible conditions are in the table below.
RIPPER: First Rule • Here we see a list of the information gain for each of the possible second condition in the antecedent
RIPPER: First Rule • Since the following rule results in the highest information gain we select that as the second condition of our rule: • transport = car owner and • drink_level=abstemious budget = high: • Now we can use the number of instances calculated from this rule as and we calculate all the possible third conditions as in the next set of calculations.
RIPPER: First Rule • Next we attempt to find the third condition in the antecedent. We need only look at possible conditions that exists in the 3 instances that have transport = car owner and drink_level = abstemious and budget=high. • The list of possible conditions are in the table below.
RIPPER: First Rule • Here we see a list of the information gain for each of the possible third conditions in the antecedent
RIPPER: First Rule • Since the following rule results in the highest information gain we select that as the third condition of our rule: • transport = car owner and • drink_level = abstemious and • ambience = friends budget = high: • Note that this rule covers only positive examples (i.e., budget=high data instances). Since it doesn’t cover negative examples, then there is no need to add more conditions to the rule. RIPPER’s construction of the first rule is now complete.
RIPPER: Pruning the First Rule • First rule: transport = car owner and • drink_level= abstemious and • ambience = friends budget = high: • In order to decide if/how to prune this rule, RIPPER will: • use a validation set (that is, a piece of the training set that was kept apart and not used to construct the rule) • use a metric for pruning: v = (p-n)/(p+n) where • p: # of positive examples covered by the rule inthe validation set • n: # of negative examples covered by the rule in the validation set • pruning method: deletes any final sequence of conditions that maximizes v. That is, it calculates v for each of the following pruned versions of the rule and keeps the version of the rule with maximum v: • transport = car owner & drink_level= abstemious &ambience = friends budget = high • transport = car owner & drink_level = abstemious budget = high • transport = car owner budget = high • budget = high
Homework 4 Solutions ASSOCIATION RULES: APRIORI ALGORITHM
Apriori: Level 1 • We begin the Apriori algorithm by determining the order: • Here I will use the order that the attributes appear and the values for each attribute in alphabetical order. • Then all the possible single item rules are generated and the support calculated for each rule. • The following slide shows the complete list of possible items in the rule. • Support is calculated in the following manner: • Since we know the minimum acceptable support count is 55, we need only look at the numerator of this ratio to determine whether or not to keep this item.
Apriori: Level 1 • We keep the ones in bold as they meet the minimum support threshold.
Apriori: Level 1 • We keep the following item sets as they contain enough support, and use these item sets to generate candidate item sets for the next level.
Apriori: Level 2 • We merge pairs from the level 1 set. Since there are no prefixes here then we must consider all combinations. (Continued on next slide)
Apriori: Level 2 • We keep the following item sets as they contain enough support, and use these item sets to generate candidate item sets for the next level.
Apriori: Level 3 • We generate the next level of candidate sets, but before we calculate the support we can use the Apriori principle to determine if they are viable candidates.
Apriori: Level 3 • First we determine the candidates by “joining” itemsets with like prefixes. (i.e., the first k-1 items in the items sets are the same) • Here we need only match the first item in the itemset.
Apriori: Level 3 • That results in this set of potential candidate itemsets.
Apriori: Level 3 • We have one final step before calculating the support: we can eliminate unnecessary candidates. We must check that all subsets of size 2 in each of these itemsets also existed in the level 2 set. We can make this a little easier by ignoring the prefix subsets as we know those existed because we used them to create the itemsets. • The following itemsets can be removed as the bolded subsets do not appear in the Level 2 itemsets. This leaves us the candidate itemsets on the next slide.
Finally we can calculate the support for these candidate itemsets. Apriori: Level 3
Apriori: Level 3 • We keep the following item sets as they contain enough support, and use these item sets to generate candidate item sets for the next level.
Apriori: Level 4 • We generate the next level of candidate sets, but before we calculate the support we can use the Apriori principle to determine if they are viable candidates.
Finally we can calculate the support for these candidate itemsets. Apriori: Level 4 • We generate the next level of candidate sets, but before we calculate the support we can use the Apriori principle to determine if they are viable candidates. • First we determine the candidates by “joining” itemsets with like prefixes. (i.e., the first k-1 items in the items sets match) • Here we need only match the first two items in the itemset.
Apriori: Level 4 • That results in this set of candidate itemsets. • We have one final step before calculating the support: we can eliminate unnecessary candidates. We must check that all subsets of size 3 in each of these itemsets also existed in the level 3 set. We can make this a little easier by ignoring the prefix subsets as we know those existed because we used them to create the itemsets. • Here we again eliminate candidates from consideration, the offending subsets are bolded.
Apriori: Level 4 • In the end we keep only one single itemset that has enough support for this level. • The following slide depicts the complete itemset.
Rule Construction Largest itemset: Let’s call this itemset I4: I4: smoker=false, marital_status=single, religion=catholic, activity=student Rules constructed from I4 with 2 items in the antecedent: • R1: smoker=false, marital_status=single religion=catholic, activity=student conf(R1) = supp(I4)/supp(smoker=false, marital_status=single ) = 63/ 98 = 64.28% • R2: smoker=false, religion=catholicmarital_status=single, activity=student conf(R2) = supp(I4)/supp(smoker=false, religion=catholic ) = 63/ 79 = 79.74% • R3: smoker=false, activity=student marital_status=single, religion=catholic conf(R3) = supp(I4)/supp(smoker=false, activity=student ) = 63/ 90= 70% • R4: marital_status=single, religion=catholic smoker=false, activity=student conf(R4) = supp(I4)/supp(marital_status=single, religion=catholic ) = 63/ 91 = 69.23% • R5: marital_status=single, activity=student smoker=false, religion=catholic conf(R5) = supp(I4)/supp(marital_status=single, activity=student ) = 63/ 107 = 58.87% • R6: religion=catholic, activity=studentsmoker=false, marital_status=single conf(R6) = supp(I4)/supp(religion=catholic, activity=student) = 63/ 84 = 75%