1 / 41

Basic Data Mining Techniques

Basic Data Mining Techniques. Decision Trees. Basic concepts. Decision trees are constructed using only those attributes best able to differentiate the concepts to be learned A decision tree is built by initially selecting a subset of instances from a training set

huy
Download Presentation

Basic Data Mining Techniques

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Basic Data Mining Techniques Decision Trees

  2. Basic concepts • Decision trees are constructed using only those attributes best able to differentiate the concepts to be learned • A decision tree is built by initially selecting a subset of instances from a training set • This subset is then used to construct a decision tree • The remaining training set instances test the accuracy of the constructed tree

  3. The Accuracy Score and the Goodness Score • The accuracy score is a ratio (usually expressed in percent) of the number of the correctly classified samples to the total number of samples in the training set. • The goodness score is the ratio of the accuracy score to the total number of branches added to the tree by the attribute, which is used to make a decision. • That tree is better, which shows better accuracy and goodness scores

  4. An Algorithm for Building Decision Trees 1. Let T be the set of training instances.2. Choose an attribute that best differentiates the instances in T.3. Create a tree node whose value is the chosen attribute. -Create child links from this node where each link represents a unique value for the chosen attribute. -Use the child link values to further subdivide the instances into subclasses. 4. For each subclass created in step 3: -If the instances in the subclass satisfy predefined criteria (minimum training set classification accuracy) or if the set of remaining attribute choices for this path is empty, verify the classification for the remaining training set instances following this decision path and STOP. -If the subclass does not satisfy the criteria and there is at least one attribute to further subdivide the path of the tree, let Tbe the current set of subclass instances and return to step 2.

  5. Attribute Selection • The attribute choice made when building a decision tree determines the size of the constructed tree • A main goal is to minimize the number of tree levels and tree nodes and to maximize data generalization

  6. Example: The Credit Card Promotion Database • Designation of the life insurance promotion as the output attribute • Our input attributes are: income range, credit card insurance, sex, and age

  7. Partial Decision Trees for the Credit Card Promotion Database

  8. Accuracy Score 11/15~0.7373% Goodness score 0.73/4~0.183

  9. Accuracy Score 9/15=0.660% Goodness score 0.6/2~0.3

  10. Accuracy Score 12/15~0.880% Goodness score 0.8/2~0.4

  11. Multiple-Node Decision Trees for the Credit Card Promotion Database

  12. Accuracy Score 14/15~0.9393% Goodness score 0.93/6~0.16 No (3/1)

  13. Accuracy Score 13/15~0.8787% Goodness score 0.87/4~0.22

  14. Decision Tree Rules

  15. A Rule for the Tree in Figure 3.4 IF Age <=43 & Sex = Male & Credit Card Insurance = NoTHEN Life Insurance Promotion = No

  16. A Simplified Rule Obtained by Removing Attribute Age IF Sex = Male & Credit Card Insurance = No THEN Life Insurance Promotion = No

  17. Advantages of Decision Trees Easy to understand. Map nicely to a set of production rules. Applied to real problems. Make no prior assumptions about the data. Able to process both numerical and categorical data.

  18. Disadvantages of Decision Trees Output attribute must be categorical. Limited to one output attribute. Decision tree algorithms are unstable. Trees created from numeric datasets can be complex.

  19. Generating Association Rules

  20. Confidence and Support • Traditional classification rules usually limit a consequent of a rule to a single attribute • Association rule generators allow the consequent of a rule to contain one or several attribute values

  21. Example • If there are any interesting relationships to be found in customer purchasing trends among the grocery store products: • Milk • Cheese • Bread • Eggs

  22. Possible associations: • If customers purchase milk they also purchase bread • If customers purchase bread they also purchase milk • If customers purchase milk and eggs they also purchase cheese and bread • If customers purchase milk, cheese, and eggs they also purchase bread

  23. Confidence • Analyzing the first rule we are coming to the natural question: “How likely will the event of a milk purchase lead to a bread purchase?” • To answer this question, a rule has an associated confidence, which is in our case the conditional probability of a bread purchase given a milk purchase

  24. Rule Confidence Given a rule of the form “If A then B”, rule confidenceis the conditional probability that B is true when A is known to be true.

  25. Rule Support The minimum percentage of instances in the database that contain all items listed in a given association rule.

  26. Mining Association Rules: An Example

  27. Apriori Algorithm • This algorithm generates item sets • Item sets are attribute-value combinations that meet a specified coverage requirement • Those attribute-value combinations that do not meet the coverage requirement are discarded

  28. Apriori Algorithm • The first step: item set generation • The second step: creation a set of association rules using the generated item set

  29. The “income range” and “age” attributes are eliminated

  30. Generation of the item sets • First, we will generate “single-item” sets • Minimum attribute-value coverage requirement: four items • Single-item sets represent individual attribute-value combinations extracted from the original data set

  31. Single-Item Sets

  32. Two-Item Sets and Multiple-Item Sets • Two-item sets can be created from single-item sets by their combination (usually with the same coverage restriction) • The next step is to use the attribute-value combinations from the two-item sets to create three-item sets, etc. • The process is continued until such n, for which the n-item set will contain a single instance

  33. Two-Item Sets

  34. Three-Item Set • The only three-item set that satisfies the coverage criterion is: • (Watch Promotion = No) & (Life Insurance Promotion = No) & (Credit Card Insurance = No)

  35. Rule Creation • The first step is to specify a minimum rule confidence • Next, association rules are generated from the two- and three-item set tables • Any rule not meeting the minimum confidence value is discarded

  36. Two Possible Two-Item Set Rules IF Magazine Promotion =Yes THEN Life Insurance Promotion =Yes (5/7) (Rule confidence is 5/7x100% = 71%) IF Life Insurance Promotion =Yes THEN Magazine Promotion =Yes (5/5) (Rule confidence is 5/5x100% = 100%)

  37. Three-Item Set Rules IF Watch Promotion =No & Life Insurance Promotion = No THEN Credit Card Insurance =No (4/4) (Rule confidence is 4/4x100% = 100%) IF Watch Promotion =No THEN Life Insurance Promotion = No & Credit Card Insurance = No (4/6) (Rule confidence is 4/6x100% = 66.6%)

  38. General Considerations We are interested in association rules that show a lift in product sales where the lift is the result of the product’s association with one or more other products. We are also interested in association rules that show a lower than expected confidence for a particular association.

  39. Homework • Problems 2, 3 (p. 102 of the book)

More Related