1 / 43

Rule Learning

Rule Learning. Intelligent Systems – Lecture 10 Prof. D. Fensel & R. Krummenacher. Agenda. Motivation Associative Rules Decision Trees Entropy ID3 Algorithm Refinement of Rule Sets Generalization and Specialization RELAX and JoJo Algorithms Incremental Refinement of Rule Sets

tender
Download Presentation

Rule Learning

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Rule Learning Intelligent Systems – Lecture 10 Prof. D. Fensel & R. Krummenacher

  2. Agenda • Motivation • Associative Rules • Decision Trees • Entropy • ID3 Algorithm • Refinement of Rule Sets • Generalization and Specialization • RELAX and JoJo Algorithms • Incremental Refinement of Rule Sets • Summary and Conclusions

  3. Motivation • Data warehouses and other large scale data collections provide vaste amounts of data, and hide knowledge that is not obviously to discover • Rule learning (associative rule learning) is a popular means for discovering interesting relations between data sets. • Rules enable the inference of knowledge and make hidden facts explicit

  4. Motivating Example • Association rule learning is very popular in marketing: • People shopping Coke and beer, are very likely to also buy potato chips: {Coke, beer} ⇒ {potato chips}. • Retail Advertising and Marketing Association: "For example, convenience stores realized that new fathers came in to buy Pampers and they also bought beer. So where are the Pampers now? Near the beer. It's about getting a feel for what other things you can merchandise“. • Other popular application domains include amongst others: Web usage, intrusion detection or bioinformatics.

  5. Definition by Agrawal • Let I = {i1, i2, ..., in} be a set of binary attributes called items, and let D = {t1, t2, ..., tn} be a set of transactions called database. • Each transaction (tuple in the database) contains a subset of the items in I. • A rule is then defined to be of the form A → C (an implication), where A, C ⊆ I and A ∩ C = ∅. • The left-hand side (A) is called antecedent, the right-hand side (C) is referred to as consequent.

  6. Simple Example • From Wikipedia (supermarket domain): • I = {milk, bread, butter, bear} • Example database • A possible rule would be {milk, bread} → butter • In reality, hundreds of itemsets must support a rule before it can be considered statistically significant

  7. Significance of Rules • Association rule learning only makes sense in the context of very large data sets. • In very large data sets there are obviously hundreds if not thousands of implications discoverable. • Significance and interest in a rule is therefore an important selection criteria, and only those rules that represent a bigger share of the whole can be considered relevant. • The support of an itemset A is defined as the proportion of transactions ti⊆ D that contain A.

  8. Confidence • The confidence in a rule depends on the support of the itemset A and the support of the union of A and C. • Confidence: conf(A → C) = supp(A ⋃ C) / supp(A). • The confidence is an estimated measure of probability, formally known as P(C|A), and gives an indication of the probability that the consequent holds if the antecedent is given.

  9. Finding Rules • Association rules need to satisfy a given level of confidence, and a given degree of support the same time. • A two-step process is generally applied to discover rules that satisfy both requirements: • Mininmal support is used to determine frequent itemsets • Minimum confidence tresholds are applied to determine the rules. • The first step is significantly more challenging than the second one!

  10. Finding Frequent Datasets • The number of possible datasets is given by the powerset of I, and is thus equal to: 2n – 1, with n = | I |. • Consequently, the number of potential datasets grows exponentially in the size of I. • Different algorithms allow nontheless to compute the frequent datasets efficiently: Apriori (BFS), Eclat (DFS), Frequent Pattern-Growth (FP-tree). • These algorithms exploit the downward-closure property (aka anti-monotonicity): frequent itemset have frequent subsets, which results in infrequent itemset to have infrequent supersets.

  11. Alternative Indicators • All-confidence: all rules that can be derived from itemset A have at least a confidence of all-confidence(A) = supp(A) / max(supp(a ∈ A)) with max(supp(a ∈ A)) the support of the item with the highest support in A. • Collective strength is given by cs(A) = (1-v(A))/(1-E[v(A)]) * E[v(A)]/v(A) with v(Z) the violation rate and E[] the expected value for independent items; the violation rate is defined as the fraction of transactions which contain some of the items in an itemset but not all. Collective strength gives 0 for perfectly negative correlated items, infinity for perfectly positive correlated items, and 1 if the items co-occur as expected under independence.

  12. Alternative Indicators (2) • Coverage (aka antecedent support) measures how often a rule is applicable in a database: coverage(A → C) = supp(A) = P(A) • The conviction of a rule indicates the ratio of appearences of the antecedent A without C being a consequence of A: conv(A → C) = 1 - supp(C) / 1 - conf(A → C). • Leverage measures the difference of A and C appearing together in the data set and what would be expected if A and C where statistically dependent: leverage(A → C) = P(A and C) - (P(A)P(B))

  13. Alternative Indicators (3) • The lift of a rule is given as the ratio between the confidence and the pure chance of observing an observation, and thus measures how many times more often A and C occur together than expected if they where statistically independent: lift(A → C) = conf(A → C) / supp(C) = supp(A ⋃ C) / supp(A) * supp(C).

  14. Decision Trees • Many inductive knowledge acquisition algorithms generate classifiers in form of decision trees. • A decision tree is a simple recursive structure for expressing a sequential classification process. • Leaf nodes denote classes • Intermediate nodes represent tests

  15. Decision Trees and Rules • Rules can represent a decision tree: if item1 then subtree1 elseif item2 then subtree2 elseif... • There are as many rules as there are leaf nodes in the decision tree. • Advantage of rules over decision trees: • Rules are a widely-used and well-understood representation formalism for knowledge in expert systems; • Rules are easier to understand, modify and combine; and • Rules can significantly improve classification performance by eliminating unnecessary tests.

  16. Decision-Driven Rules • The following definitions apply to rules that aim to conclude a fact out of a given set of attribute value assignments. • The decision tree takes the following form: if attribute1 = value1 then subtree1 elseif attribute1 = value2 then subtree2 elseif... • The critical question is then: which attribute should be the first one to evaluate, i.e. which attribute is the most selective determiner and should be the first one in the decision tree.

  17. Entropy • Entropy is a measure of ‚degree of doubt‘ and is a well-studied concept in information theory. • Let {c1, c2, ..., cn} be a set of conclusions C of a rule (consequences); let {x1, x2, ..., xn} be a set of possible values of an attribute X. • The probability that ci is true given that X has value xj is given by p(ci|xj). • Entropy is then defined as entropy = - Σ p(ci|xj) log2[p(ci|xj)] for i ∈ 1...|C| • The logarithm (- log2[p(ci|xj)]) indicates the ‚amount of information‘ that xj has to offer about the conclusion ci.

  18. Most Useful Determiner • The lower the entropy of xj with respect to C, the more information xj has to offer about C. • The entropy of an attribute X with respect to C is then given by - Σ p(xj) Σ p(ci|xj) log2[p(ci|xj)]. • The attribute with the lowest entropy is the most useful determiner, as it has the lowest ‚degree of doubt‘.

  19. ID3 Algorithm by Quinlan • For each attribute, compute its entropy with respect to the conclusion; • Select the attribute with lowest entropy (say X) • Divide the data into separate sets so that within a set, X has a fixed value (X=x1 in one set, X=x2 in another...) • Build a tree with branches: if X=x1 then ... (subtree1) if X=x2 then ... (subtree2) ... • For each subtree, repeat from step 1. • At each iteration, one attribute gets removed from consideration. STOP if no more attributes are left, or if all attribute values lead to the same conclusion.

  20. Robinson Crusoe Example • Identifying what is good to eat, sixteen rules: • Next step, determining the entropy of each attribute with respect to the conclusion.

  21. Robinson Crusoe Example (2) • Considering Size as example, yields: • p(safe | large) = 5/7 • p(unsafe | large) = 2/7 • p(large) = 7/16 • p(safe | small) = 5/9 • p(unsafe | small) = 4/9 • p(small) = 9/16 • Entropy of size = 7/16 (5/7 log (5/7) + 2/7 log(2/7)) + 9/16 (5/9 log (5/9) + 4/9 log(4/9)) = 0.935095...

  22. Robinson Crusoe Example (3) • Calculatin all entropies results in Color having the smallest one • Color is thus the most useful determiner • A resulting six rules determine the same space, as the initial bigger set of sixteen rules: • if color=brown then unsafe • if color=green and size=large then safe • if color=green and size=small then unsafe • if color=red and skin=hairy then safe • if color=red and skin=smooth and size=large then unsafe • if color=red and skin=smooth and size=small then safe

  23. Counter Example • Consider the following data: • X=3, Y=3 ⇒ yes • X=2, Y=1 ⇒ no • X=3, Y=4 ⇒ no • X=1, Y=1 ⇒ yes • X=2, Y=2 ⇒ yes • X=3, Y=2 ⇒ no • The entropy-based ID3 algorithm is incapable to spot the obvious relationship if X = Y then yes else no, as only one attribute is considered at the time!

  24. Accuracy of ID3 • ID3 forms rules by eliminating conditions from a path in the decision tree, and thus the rules tend to be over-generalized with respect to the training data. • Can rules keep up with the decision trees? • Experimental results by Quinlan in 1987 show that rules are not only simpler in the general case, but that they are sometimes even more accurate! J.R. Quinlan: Generating Production Rules From Decision Trees, IJCAI’87

  25. Refinement of Rule Sets • There is a four step procedure for the refinment of rules: • Rules that become incorrect because of new examples are refined: incorrect rules are replaced by new rules that cover the positive examples, but not the new negative ones. • Complete the rule set to cover new positive examples. • Redundant rules are deleted to correct the rule set. • Minimize the rule set. • Steps 1 and 2 are subject to the algorithm JoJo that integrates generalization and specification via a heuristic search procedure.

  26. Specialization • Specialization algorithms start from very general descriptions and specializes those until they are correct. • This is done by adding additional premises to the antecedent of a rule, or by restricting the range of an attribute which is used in an antecedent. • Algorithms relying on specialization generally have the problem of overspecialization: previous specialization steps could become unnecessary due to subsequent specialization steps. • This brings along the risk for ending up with results that are not maximal-general. • Some examples of (heuristic) specialization algorithms are the following: AQ, C4, CN2, CABRO, FOIL, or PRISM; references at the end of the lecture.

  27. Generalization • Generalization starts from very special descriptions and generalizes them as long as they are not incorrect, i.e. in every step some unnecessary premises are deleted from the antecedent. • The generalization procedure stops if no more premises to remove exist. • Generalization avoids the maximal-general issue of specialization, in fact it guarantees most-general descriptions. • However, generalization of course risks to derive final results that are not most-specific. • RELAX is an example of a generalization-based algorithm; references at the end of the lecture.

  28. RELAX • RELAX is a generalization algorithm, and proceeds as long as the resulting rule set is not incorrect. • Interesting to note, the motivation for RELAX were algorithms from a different domain: minimalization of electronic circuits. • In fact, the minimization of the description of two classes and a binary attribute is identifal to the discovery of a minimal boolean expression according to McClusky’56.

  29. RELAX (2) • Every example is considered to be a specific rule that is generalized. • The algorithm then starts from a first rule and relaxes the first premise. • The resulting rule is tested against the negative examples. • If the new (generalized) rule covers negative examples, the premise is added again, and the next premise is relaxed. • A rule is considered minimal, if any further relaxation would destroy the correctness of the rule. • The search for minimal rules starts from any not yet considered example, i.e. examples that are not covered by already discovered minimal rules.

  30. RELAX - Example • Consider the following positive example for a consequent C: (pos, (x=1, y=0, z=1)) • This example is represented as a rule: x ∩ ¬ y ∩ z → C • In case of no negative examples, RELAX constructs and tests the following set of rules: 1) x ∩ ¬ y ∩ z → C 5) x → C 2) ¬ y ∩ z → C 6) ¬ y → C 3) x ∩ z → C 7) z → C 4) x ∩ ¬ y → C 8) → C

  31. Summary: Specialization and Generalization • Specialization and Generalization are dual search directions in a given rule set. • Specialization starts at the ‘Top‘ element and covers negative examples. • Generalization starts with the ‘Bottom‘ element and uncovers positive examples.

  32. JoJo – Refinement of Rule Sets • In general, it cannot be determined which search direction is the better one. • Note that ID3 makes in fact already use of both by first constructing rules via specialization and second by generalizing the rule set. • JoJo is an algorithm that combines both search directions in one heuristic search procedure. • JoJo can start at an arbitrary point in the lattice of complexes and generalizes and specializes as long as the quality and correctness can be improved, i.e. until a local optimum can be found, or no more search resources are available (e.g., time, memory).

  33. JoJo (2) • While specialization moves solely from ‘Top‘ towards ‘Bottom‘ and generalization from ‘Bottom‘ towards ‘Top‘, JoJo is able to move freely in the search space. • Either of the two strategies can be used interchangeable, which makes JoJo more expressive than comparable algorithms that apply the two in sequential order (e.g. ID3).

  34. JoJo (3) • A starting point in JoJo is described by two parameters: • Vertical position (length of the description) • Horizontal position (chosen premises) • Reminder: JoJo can start at any arbitrary point, while specialization requires a highly general point and generalization requires a most specific point. • In general, it is possible to carry out several runs of JoJo with different starting points. Rules that were already produced can be used as subsequent starting points.

  35. JoJo – Choosing a Starting Point • Criteria for choosing a vertical position: • Approximation of possible lenght or experience from earlier runs. • Random production of rules; distribution by means of the average correctness of the rules with the same length (so-called quality criterion). • Start with a small sample or very limited resources to discover a real starting point from an arbitrary one. • Randomly chosen starting point (same average expectation of success as starting with ‘Top‘ or ‘Bottom‘). • Heuristic: few positive examples and maximal-specific descriptions suggest long rules, few negative examples and maximal-generic descriptions rather short rules.

  36. JoJo – Choosing a Starting Point (2) • Criteria for choosing a horizontal position: • From the vertical position, one can select the premise with the highest correlation with the goal concept (consequent)

  37. JoJo Principle Components • JoJo consists of three components: • Specializer and Generalizer • Scheduler • The former two can be provided by any such components depending on the chosen strategies and preference criterias. • The Scheduler is responsible for selecting the next description out of all possible generalizations and specializations available (by means of a t-preference, total preference). • Simple example scheduler: • Specialize, if the error rate is above threshold; • Otherwise, choose the best generalization with allowable error rate; • Otherwise stop.

  38. Incremental Refinement of Rules with JoJo • Refinement of rules refers to the modification of a given rule set based on additonal examples. • The input to the task is a so-called hypothesis (a set of rules) and a set of old and new positive and negative examples. • The output of the algorithm are a refined set of rules and the total set of examples. • The new set of rules is correct, complete, non-redundant and (if necessary) minimal.

  39. Incremental Refinement of Rules with JoJo (2) • Correctness: • Modify overly general rules that cover too many negative examples. • Replace a rule by a new set of rules that cover the positive examples, but not the negative ones. • Completeness: • Compute new correct rules that cover the not yet considered positive examples (up to a threshold). • Non-redundancy: • Remove rules that are more specific than other rules (i.e. rules that have premises that are a superset of the premises of another rule).

  40. Summary • Associative Rules help to discover otherwise hidden knowledge. • To discover rules it is important to understand the significance of data sets and rules, and to have relevance measures in place; e.g. coverage, leverage or confidence. • Alternative notations for rules are so-called decision trees. • The ID3 algorithm by Quilan is a very important example. • In this context, we need to understand the principle of entropy (information theory) to determine the ‘amount of information‘ that a given attribute brings along.

  41. Summary (2) • Rules cover positive examples and should not cover negative examples. • There are two main approaches for determining rules: • Generalization • Specification • RELAX is a presented example of a generalization algorithm • JoJo combines the two and allows the algorithm to tranverse the entire search space by either generalizing or specializing rules inter-changeably. • JoJo can also be applied to incrementally refine rule sets.

  42. Specialization and Generalization Algorithms • AQ: Michalski, Mozetic, Hong and Lavrac: The Multi-Purpose Incremental Learning System AQ15 and its Testing Application to Three Medical Domains. AAAI-86, pp. 1041-1045. • C4: Quinlan: Learning Logical Definitions from Relations: Machine Learning 5(3), 1990, pp. 239-266. • CN2: Clark and Boswell: Rule Induction with CN2: Some recent Improvement. EWSL-91, pp. 151-163. • CABRO: Huyen and Bao: A method for generating rules from examples and its application. CSNDAL-91, pp. 493-504. • FOIL: Quinlan: Learning Logical Definitions from Relations: Machine Learning 5(3), 1990, pp. 239-266. • PRISM: Cendrowska: PRISM: An algorithm for inducing modular rules. Journal Man-Machine Studies 27, 1987, pp. 349-370. • RELAX: Fensel and Klein: A new approach to rule induction and pruning. ICTAI-91.

  43. Further References • Agrawal, Imielinsky and Swami: Mining Association Rules between Sets of Items in Large Databases. ACM SIGMOD Conference, 1993, pp. 207-216. • Quinlan: Generating Production Rules From Decision Trees. 10th Int’l Joint Conference on Artificial Intelligence, 1987, pp. 304-307. • Fensel and Wiese: Refinement of Rule Sets with JoJo. European Conference on Machine Learning, 1993, pp. 378-383.

More Related