160 likes | 164 Views
Learn about association rules in data mining, including the concept, benefits, and how to find them using the Apriori algorithm. Understand support and confidence percentages and interpret the significance of generated rules.
E N D
Chapter Five • Data Mining for the Masses by Matthew North
Explain concept of association rules, how they are found, and their benefits • Discuss the Apriori algorithm, support percent, and confidence percent • Interpret the rules generated by an association rule model and explain their significance, if any
Association rules are a data mining methodology that seeks to find frequent connections between attributes in a data set • An example of this could be a shopping basket analysis where marketers and vendors try to find which products are most frequently purchased together
Although association rule operators require binominal data types, it’s helpful to evaluate the average (avg) and standard deviation for each attribute. An example of the calculations in RapidMiner are shown above
Standard deviations are measurements of how dispersed or varied the values in an attribute are • A good rule of thumb- any value that is smaller than two standard deviations below the mean or two standard deviations above the mean is a statistical outlier • For example, if Average = 36.731 and Standard Deviation = 10.647, acceptable range of values should be 15.437 to 58.025 • Not a hard-and-fast rule
RapidMiner uses binominal instead of binomial • Binomial means one of two numbers (usually 0 and 1), meaning the basic underlying data type is still numeric • Binominal means one of two values which may be numeric or character based
An example of a data type transformation of all attributes in RapidMiner
Frequency Pattern analysis is handy for many kinds of data mining and is a necessary component of association rule mining • We use this to determine whether any of the patterns in the data occur often enough to be considered rules • Results of an FP-Growth operator in RapidMiner
Two main factors that dictate whether or not frequency patterns get translated into association rules: • Confidence percent - how confident we are that when one attribute is flagged as true, the associated attribute will also be flagged as true • Support percent - the number of times that the rule did occur, divided by the number of observations in the data set
Confidence Percent • Out of 10 shopping baskets, we find that cookies were purchased in 4 of them and milk was purchased in 7 • 3/4 instances where cookies were purchased, milk was also in those baskets • Therefore, we have a 75% confidence in the association rule cookies g milk • The rule cookies g milk had a chance to occur 4 times, but only occurred three
Confidence Percent (cont.) • However, milk g cookies confidence is not the same • Milk and cookies is purchased together in only 3 out of the 7 times that milk was purchased • Therefore, the confidence in this rule is only 43% (3/7 = .429)
Confidence Percent (cont.) • Premise (or antecedent) gConclusion (or consequent) • When evaluating associations between three or more attributes, the confidence percentages are calculated based on the two attributes being found with the third
Support Percent • If cookies and milk were found together 3 out of the 10 shopping baskets, support percentage is calculated as 30% (3/10 = .3) • There is no reciprocal for support percentages since this metric is simply the number of times the association occurred over the number of times it could have occurred in the data set
An example of association rules found with 50% confidence threshold in RapidMiner An example of association rules found with 50% confidence threshold in R