1 / 16

Association Rules: Finding Frequent Connections in Data

Learn about association rules in data mining, including the concept, benefits, and how to find them using the Apriori algorithm. Understand support and confidence percentages and interpret the significance of generated rules.

lidiae
Download Presentation

Association Rules: Finding Frequent Connections in Data

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Chapter Five • Data Mining for the Masses by Matthew North

  2. Learning Objectives

  3. Explain concept of association rules, how they are found, and their benefits • Discuss the Apriori algorithm, support percent, and confidence percent • Interpret the rules generated by an association rule model and explain their significance, if any

  4. Association Rules

  5. Association rules are a data mining methodology that seeks to find frequent connections between attributes in a data set • An example of this could be a shopping basket analysis where marketers and vendors try to find which products are most frequently purchased together

  6. Although association rule operators require binominal data types, it’s helpful to evaluate the average (avg) and standard deviation for each attribute. An example of the calculations in RapidMiner are shown above

  7. Standard deviations are measurements of how dispersed or varied the values in an attribute are • A good rule of thumb- any value that is smaller than two standard deviations below the mean or two standard deviations above the mean is a statistical outlier • For example, if Average = 36.731 and Standard Deviation = 10.647, acceptable range of values should be 15.437 to 58.025 • Not a hard-and-fast rule

  8. RapidMiner uses binominal instead of binomial • Binomial means one of two numbers (usually 0 and 1), meaning the basic underlying data type is still numeric • Binominal means one of two values which may be numeric or character based

  9. An example of a data type transformation of all attributes in RapidMiner

  10. Frequency Pattern analysis is handy for many kinds of data mining and is a necessary component of association rule mining • We use this to determine whether any of the patterns in the data occur often enough to be considered rules • Results of an FP-Growth operator in RapidMiner

  11. Two main factors that dictate whether or not frequency patterns get translated into association rules: • Confidence percent - how confident we are that when one attribute is flagged as true, the associated attribute will also be flagged as true • Support percent - the number of times that the rule did occur, divided by the number of observations in the data set

  12. Confidence Percent • Out of 10 shopping baskets, we find that cookies were purchased in 4 of them and milk was purchased in 7 • 3/4 instances where cookies were purchased, milk was also in those baskets • Therefore, we have a 75% confidence in the association rule cookies g milk • The rule cookies g milk had a chance to occur 4 times, but only occurred three

  13. Confidence Percent (cont.) • However, milk g cookies confidence is not the same • Milk and cookies is purchased together in only 3 out of the 7 times that milk was purchased • Therefore, the confidence in this rule is only 43% (3/7 = .429)

  14. Confidence Percent (cont.) • Premise (or antecedent) gConclusion (or consequent) • When evaluating associations between three or more attributes, the confidence percentages are calculated based on the two attributes being found with the third

  15. Support Percent • If cookies and milk were found together 3 out of the 10 shopping baskets, support percentage is calculated as 30% (3/10 = .3) • There is no reciprocal for support percentages since this metric is simply the number of times the association occurred over the number of times it could have occurred in the data set

  16. An example of association rules found with 50% confidence threshold in RapidMiner An example of association rules found with 50% confidence threshold in R

More Related