Data Mining Tutorial

Data Mining Tutorial Tomasz Imielinski Rutgers University

What is data mining? • Finding interesting, useful, unexpected • Finding patterns, clusters, associations, classifications • Answering inductive queries • Aggregations and their changes on multidimensional cubes

Table of Content • Association Rules • Interesting Rules • OLAP • Cubegrades – unification of association rules and OLAP • Classification and Clustering methods – not included in this tutorial

Association Rules • [AIS 1993] – Agrawal, Imielinski, Swami “Mining Association Rules” SIGMOD 1993 • [AS 1994] - Agrawal, Srikant “Fast algortihms for mining association rules in large databases” VLDB 94 • [ [B 1998] – Bayardo “Efficiently Mining Long Patterns from databases” Sigmod 98 • [SA 1996] – Srikant, Agrawal “Mining Quantitative Association Rules in Large Relational Tables”, Sigmod 96 • [T 1996] – Toivonen “Sampling Large Databases for Association Rules”, VLDB 96 • [BMS 1997] – Brin, Motwani, Silverstein “Beyond Market Baskets: Generalizing Association Rules to Correlations” • [IV 1999] – Imielinski, Virmani “MSQL: A query language for database mining” DMKD 1999

Baskets • I1,…Im a set of (binary) attributes called items • T is a database of transactions • t[k] = 1 if transaction t bought item k • Association rule X => I with support s and confidence c • Support – what fraction of T satisfies X • Confidence – what fraction of X satisfiesI

Baskets • Minsup. Minconf • Frequent sets – sets of items X such that their support sup(X) > minsup • If X is frequent all its subsets are (closure downwards)

Examples • 20% of transactions which bought cereal and milk also bought bread (support 2%) • Worst case – exponential number (in terms of size of the set of items) of such rules. • What is the set of transactions which leads to exponential blow up of the rule set? • Fortunately worst cases are unlikely – not typical. Support provides excellent pruning ability.

General Strategy • Generate frequent sets • Get association rules X=>I and their confidence and support as s=support(X+I) and confidence c= supportX+I)/support(X) • Key property: downward closure of the frequent sets – don’t have to consider supersets of X if X is not frequent

General strategies • Make repetitive passes through the database of transactions • In each pass count support of CANDIDATE frequent sets • In the next pass continue with frequent sets obtained so far by “expanding” them. Do not expand sets which were determined NOT to be frequent

AIS Algorithm (R. Agrawal, T. Imielinski, A. Swami, “Mining Association Rules Between Sets of Items in Large Databases”, SIGMOD’93)

AIS – generating association rules (R. Agrawal, T. Imielinski, A. Swami, “Mining Association Rules Between Sets of Items in Large Databases”, SIGMOD’93)

AIS – estimation part (R. Agrawal, T. Imielinski, A. Swami, “Mining Association Rules Between Sets of Items in Large Databases”, SIGMOD’93)

Apriori (R. Agrawal, R Srikant, “Fast Algorithms for Mining Association Rules”, VLDB’94)

Apriori algorithm (R. Agrawal, R Srikant, “Fast Algorithms for Mining Association Rules”, VLDB’94)

Pruning in apriori through self-join (R. Agrawal, R Srikant, “Fast Algorithms for Mining Association Rules”, VLDB’94)

Performance improvement due to Apriori pruning (R. Agrawal, R Srikant, “Fast Algorithms for Mining Association Rules”, VLDB’94)

Other pruning techniques • Key question: At any point of time how to determine which extensions of a given candidate set are “worth” counting • Apriori – only these for which all subsets are frequent • Only these for which the estimated upper bound of the count is above minsup • Take a risk – count a large superset of the given candidate set. If it is frequent than all its subsets are also – large saving. If not, at least we have pruned all its supersets.

Jump ahead schemes: Bayardo’s MaxMine (R. Bayardo, “Efficiently Mining Long Patterns from Databases, SIGMOD’98)

Jump ahead scheme • h(g) and t(g): head and tail of an item group. Tail is the maximal set of items which g can be possibly extended with

Max-miner (R. Bayardo, “Efficiently Mining Long Patterns from Databases, SIGMOD’98)

Max-miner vs Apriori vs Apriori LB • Max-miner is over two orders of magnitude faster than apriori in identifying maximal frequent patterns on data sets with long max patterns • Considers fewer candidate sets • Indexes only on head items • Dynamic item reordering

Quantitative Rules • Rules which involve contignous/quantitative attributes • Standard approach: discretize into intervals • Problem: it is arbitrary, we will miss rules • MinSup problem: if the number of intervals is large their support will be low • MinConf problem: if intervals are large rules may not meet min confidence

Correlation Rules [BMS 1997] • Suppose the conditional probability that the customer buys coffee given that he buys tea is 80%, is this an important/interesting rule? • It depends…if apriori probability of a customer buying coffee is 90%, than it is not • Need 2x2 contingency tables rather than just pure association rules. Chi-square test for correlation rather than just support/confidence framework which can be misleading

Correlation Rules • Events A and B are independent if p(AB) = p(A) x p(B) • If any of the AB, A(notB), (notA)B, (notA)(notB) are dependent than AB are correlated; likewise for three items if any of the eight combinations of A, B and C are dependent then A, B, C are correlated • I={i1,…in} is correlation rule iff the occurrences of i1,…in are correlated • Correlation is upward closed; if S is correlated so is any superset of S

Downward vs upward closure • Downward closure (frequent sets) is a pruning property • Upward closure – minimal correlated itemsets, such that no subsets of them are correlated. Then finding correlation is a pruning step – prune all the parents of a correlated itemset because they are not minimal. • Border of correlation

Pruning based on support-correlation • Correlation can be additional pruning criterion next to support • Unlike support/confidence where confidence is not upward closed

Chi-square (S. Brin, R. Motwani, C. Silverstein, “Beyond Market Baskets: Generalizing Association Rules to Correlations”, SIGMOD’97)

Correlation Rules (S. Brin, R. Motwani, C. Silverstein, “Beyond Market Baskets: Generalizing Association Rules to Correlations”, SIGMOD’97)

(S. Brin, R. Motwani, C. Silverstein, “Beyond Market Baskets: Generalizing Association Rules to Correlations”, SIGMOD’97)

Algorithms for Correlation Rules • Border can be large, exponential in terms of the size of the item set – need better pruning functions • Support function needs to be defined but also for negative dependencies • A set of items S has support s at the p% level if at least p% of the cells in the contingency table for S have value s • Problem (p<50% all items have support at the level one) • For p > 25% at least two cells in the contingency table will have support s

Pruning… • Antisupport (for rare events) • Prune itemsets with very high chi-square to eliminate obvious correlations • Combine chi-squared correlation rules with pruning via support • Itemset is significant iff it is supported and minimally correlated

Algorithm 2-support • INPUT: A chi-squared significance level , support s, support fraction p > 0.25. • Basket data B. • OUTPUT: A set of minimal correlated itemsets, from B. • For each item , do count O(i). We can use these values to calculate any necessary • expected value. • Initialize • For each pair of items such that and , do add to • . • 5. If is empty, then return SIG and terminate. • For each itemset in , do construct the contingency table for the itemset. If less than • p percent of the cells have count s, then goto Step 8. • 7. If the value for contingency table is at least , then add the itemset to SIG, • else add the items to NOTSIG. • Continue with the next itemset in . If there are no more itemsets in , • then set to be the set of all sets S such that every subset of size |S| - 1 is not . • Goto Step 4. (S. Brin, R. Motwani, C. Silverstein, “Beyond Market Baskets: Generalizing Association Rules to Correlations”, SIGMOD’97)

Sampling Large Databases for Correlation Rules [T1996] • Pick a random sample • Find all association rules which hold in that sample • Verify the results with the rest of the database • Missing rules can be found in the second pass

Key idea – more detail • Find a collection of frequent sets in the sample using lower support threshold. This collection is likely to be a superset of the frequent sets in entire database • Concept of negative border: minimal sets which are not in a set collection S

Algorithm (H. Toivonen, “Sampling Large Databases for Association Rules”, VLDB’96)

Second pass • Negative border consists of the “closest” itemsets which can be frequent too • These have to be tried (measured)

(H. Toivonen, “Sampling Large Databases for Association Rules”, VLDB’96)

Probabilty that a sample s has exactly c rows that contain X (H. Toivonen, “Sampling Large Databases for Association Rules”, VLDB’96)

Bounding error (H. Toivonen, “Sampling Large Databases for Association Rules”, VLDB’96)

Approximate mining (H. Toivonen, “Sampling Large Databases for Association Rules”, VLDB’96)

Summary • Discover all frequent sets in one pass in a fraction of 1-D of the cases when D is given by the user; missing sets may be found in second pass

Rules and what’s next? • Querying rules • Embedding rules in applications (API)

MSQL (T. Imielinski, A. Virmani, “MSQL: A Query Language for Database Mining”, Data Mining and Knowledge Discovery 3, 99)

Applications with embedded rules (what are rules good for) • Typicality • Characteristic of • Changing patterns • Best N • What if • Prediction • Classification

OLAP • Multidimensional queries • Dimensions • Measures • Cubes

Data Mining Tutorial