Association Rules: Finding Frequent Connections in Data

Chapter Five • Data Mining for the Masses by Matthew North

Learning Objectives

Explain concept of association rules, how they are found, and their benefits • Discuss the Apriori algorithm, support percent, and confidence percent • Interpret the rules generated by an association rule model and explain their significance, if any

Association Rules

Association rules are a data mining methodology that seeks to find frequent connections between attributes in a data set • An example of this could be a shopping basket analysis where marketers and vendors try to find which products are most frequently purchased together

Although association rule operators require binominal data types, it’s helpful to evaluate the average (avg) and standard deviation for each attribute. An example of the calculations in RapidMiner are shown above

Standard deviations are measurements of how dispersed or varied the values in an attribute are • A good rule of thumb- any value that is smaller than two standard deviations below the mean or two standard deviations above the mean is a statistical outlier • For example, if Average = 36.731 and Standard Deviation = 10.647, acceptable range of values should be 15.437 to 58.025 • Not a hard-and-fast rule

RapidMiner uses binominal instead of binomial • Binomial means one of two numbers (usually 0 and 1), meaning the basic underlying data type is still numeric • Binominal means one of two values which may be numeric or character based

An example of a data type transformation of all attributes in RapidMiner

Frequency Pattern analysis is handy for many kinds of data mining and is a necessary component of association rule mining • We use this to determine whether any of the patterns in the data occur often enough to be considered rules • Results of an FP-Growth operator in RapidMiner

Two main factors that dictate whether or not frequency patterns get translated into association rules: • Confidence percent - how confident we are that when one attribute is flagged as true, the associated attribute will also be flagged as true • Support percent - the number of times that the rule did occur, divided by the number of observations in the data set

Confidence Percent • Out of 10 shopping baskets, we find that cookies were purchased in 4 of them and milk was purchased in 7 • 3/4 instances where cookies were purchased, milk was also in those baskets • Therefore, we have a 75% confidence in the association rule cookies g milk • The rule cookies g milk had a chance to occur 4 times, but only occurred three

Confidence Percent (cont.) • However, milk g cookies confidence is not the same • Milk and cookies is purchased together in only 3 out of the 7 times that milk was purchased • Therefore, the confidence in this rule is only 43% (3/7 = .429)

Confidence Percent (cont.) • Premise (or antecedent) gConclusion (or consequent) • When evaluating associations between three or more attributes, the confidence percentages are calculated based on the two attributes being found with the third

Support Percent • If cookies and milk were found together 3 out of the 10 shopping baskets, support percentage is calculated as 30% (3/10 = .3) • There is no reciprocal for support percentages since this metric is simply the number of times the association occurred over the number of times it could have occurred in the data set

An example of association rules found with 50% confidence threshold in RapidMiner An example of association rules found with 50% confidence threshold in R

Association Rules: Finding Frequent Connections in Data

Association Rules: Finding Frequent Connections in Data

Presentation Transcript

Chapter Five

Chapter Five

Chapter Five

Chapter Five

Chapter Five

Chapter Five

Chapter Five

Chapter Five

Chapter Five

Chapter Five

Chapter Five

Chapter Five

Chapter Five

Chapter Five

Chapter Five

CHAPTER FIVE

Chapter Five

Chapter Five

Chapter Five

Chapter Five

Chapter Five

Chapter Five