Frequent itemset mining and temporal extensions

Frequent itemset mining and temporal extensions Sunita Sarawagi sunita@it.iitb.ac.in http://www.it.iitb.ac.in/~sunita

Association rules • Given several sets of items, example: • Set of items purchased • Set of pages visited on a website • Set of doctors visited • Find all rules that correlate the presence of one set of items with another • Rules are of the form X  Y where X and Y are sets of items • Eg: Purchase of books A&B  purchase of C

Parameters: Support and Confidence • All rules X  Z have two parameters • Support probability that a transaction has X and Z • confidenceconditional probability that a transaction having X also contains Z • Two parameters to association rule mining: • Minimum support s • Minimum confidence c S: 50%, and c: 50% • A  C (50%, 66.6%) • C  A (50%, 100%)

Applications of fast itemset counting • Cross selling in retail, banking • Catalog design and store layout • Applications in medicine: find redundant tests • Improve predictive capability of classifiers that assume attribute independence • Improved clustering of categorical attributes

Finding association rules in large databases • Number of transactions: in millions • Number of distinct items: tens of thousands • Lots of work on scalable algorithms • Typically two parts to the algorithm: • Finding all frequent itemsets with support > S • Finding rules with confidence greater than C • Frequent itemset search more expensive • Apriori algorithm, FP-tree algorithm

The Apriori Algorithm • L1 = {frequent items of size one}; • for(k = 1; Lk !=; k++) • Ck+1 = candidates generated from Lk by • Join Lk with itself • Prune any k+1 itemset whose subset not in Lk • for each transaction t in database do • increment the count of all candidates in Ck+1 that are contained in t • Lk+1 = candidates in Ck+1 with min_support • returnkLk;

How to Generate Candidates? • Suppose the items in Lk-1 are listed in an order • Step 1: self-joining Lk-1 insert intoCk select p.item1, p.item2, …, p.itemk-1, q.itemk-1 from Lk-1 p, Lk-1 q where p.item1=q.item1, …, p.itemk-2=q.itemk-2, p.itemk-1 < q.itemk-1 • Step 2: pruning forall itemsets c in Ckdo forall (k-1)-subsets s of c do if (s is not in Lk-1) then delete c from Ck

The Apriori Algorithm — Example Database D L1 C1 Scan D C2 C2 L2 Scan D L3 Scan D C3

Improvements to Apriori • Apriori with well-designed data structures works well in practice when frequent itemsets not too long (common case) • Lots of enhancements proposed • Sampling: count in two passes • Invert database to be column major instead of row major and count by intersection • Count multiple length itemsets in one-pass • Reducing passes not useful since I/O not bottleneck: • Main bottleneck: candidate generation and counting  not optimized for long itemsets

Mining Frequent Patterns Without Candidate Generation • Compress a large database into a compact, Frequent-Pattern tree (FP-tree) structure • highly condensed, but complete for frequent pattern mining • Develop an efficient, FP-tree-based frequent pattern mining method • A divide-and-conquer methodology: decompose mining tasks into smaller ones • Avoid candidate generation

Construct FP-tree from Database min_support = 0.5 TID Items bought (ordered) frequent items 100 {f, a, c, d, g, i, m, p}{f, c, a, m, p} 200 {a, b, c, f, l, m, o}{f, c, a, b, m} 300 {b, f, h, j, o}{f, b} 400 {b, c, k, s, p}{c, b, p} 500{a, f, c, e, l, p, m, n}{f, c, a, m, p} {} Scan DB once, find frequent 1-itemset Order frequent items by decreasing frequency Scan DB again, construct FP-tree Item frequency f 4 c 4 a 3 b 3 m 3 p 3 f:4 c:1 c:3 b:1 b:1 a:3 p:1 m:2 b:1 p:2 m:1

{} Header Table Item frequency f 4 c 4 a 3 b 3 m 3 p 3 f:4 c:1 c:3 b:1 b:1 a:3 p:1 m:2 b:1 p:2 m:1 Step 1: FP-tree to Conditional Pattern Base • Starting at the frequent header table in the FP-tree • Traverse the FP-tree by following the link of each frequent item • Accumulate all of transformed prefix paths of that item to form a conditional pattern base Conditional pattern bases item cond. pattern base c f:3 a fc:3 b fca:1, f:1, c:1 m fca:2, fcab:1 p fcam:2, cb:1

{} f:3 c:3 a:3 m-conditional FP-tree Step 2: Construct Conditional FP-tree • For each pattern-base • Accumulate the count for each item in the base • Construct the FP-tree for the frequent items of the pattern base m-conditional pattern base: fca:2, fcab:1 All frequent patterns concerning m m, fm, cm, am, fcm, fam, cam, fcam 

Item Conditional pattern-base Conditional FP-tree p {(fcam:2), (cb:1)} {(c:3)}|p m {(fca:2), (fcab:1)} {(f:3, c:3, a:3)}|m b {(fca:1), (f:1), (c:1)} Empty a {(fc:3)} {(f:3, c:3)}|a c {(f:3)} {(f:3)}|c f Empty Empty Mining Frequent Patterns by Creating Conditional Pattern-Bases Repeat this recursively for higher items…

FP-growth vs. Apriori: Scalability With the Support Threshold Data set T25I20D10K

Criticism to Support and Confidence X and Y: positively correlated, X and Z, negatively related support and confidence of X=>Z dominates • Need to measure departure from expected. • For two items: • For k items, expected support derived from support of k-1 itemsets using iterative scaling methods

1998 Zzzz... Prevalent correlations are not interesting • Analysts already know about prevalent rules • Interesting rules are those that deviate from prior expectation • Mining’s payoff is in finding surprising phenomena 1995 bedsheets and pillow covers sell together! bedsheets and pillow covers sell together!

Does not match prior expectation Correlation between milk and cereal remains roughly constant over time Cannot be trivially derived from simpler rules Milk 10%, cereal 10% Milk and cereal 10% … surprising Eggs 10% Milk, cereal and eggs 0.1% … surprising! Expected 1% What makes a rule surprising?

Finding suprising temporal patterns • Algorithms to mine for surprising patterns • Encode itemsets into bit streams using two models • Mopt: The optimal model that allows change along time • Mcons: The constrained model that does not allow change along time • Surprise = difference in number of bits in Mopt and Mcons

One item: optimal model • Milk-buying habits modeled by biased coin • Customer tosses this coin to decide whether to buy milk • Head or “1” denotes “basket contains milk” • Coin bias is Pr[milk] • Analyst wants to study Pr[milk] along time • Single coin with fixed bias is not interesting • Changes in bias are interesting

Players A and B A has a set of coins with different biases A repeatedly Picks arbitrary coin Tosses it arbitrary number of times B observes H/T Guesses transition points and biases The coin segmentation problem Return Pick A Toss B

How to explain the data • Given n head/tail observations • Can assume n different coins with bias 0 or 1 • Data fits perfectly (with probability one) • Many coins needed • Or assume one coin • May fit data poorly • “Best explanation” is a compromise 5/7 1/3 1/4

Coding examples • Sequence of k zeroes • Naïve encoding takes k bits • Run length takes about log k bits • 1000 bits, 10 randomly placed 1’s, rest 0’s • Posit a coin with bias 0.01 • Data encoding cost is (Shannon’s theorem):

Shortest path How to find optimal segments Sequence of 17 tosses: Derived graph with 18 nodes: Data cost for Pr[head] = 5/7, 5 heads, 2 tails Edge cost = model cost + data cost Model cost = one node ID + one Pr[head]

Two or more items • “Unconstrained” segmentation • k items induce a 2ksided coin • “milk and cereal” = 11, “milk, not cereal” = 10, “neither” = 00, etc. • Shortest path finds significant shift in any of the coin face probabilities • Problem: some of these shifts may be completely explained by marginals

Drop in joint sale of milk and cereal is completely explained by drop in sale of milk Pr[milk & cereal] / (Pr[milk] Pr[cereal]) remains constant over time Call this ratio  Example

Constant-  segmentation • Compute global  over all time • All coins must have this common value of  • Segment as before • Compare with un-constrained coding cost Observed support Independence

Is all this really needed? • Simpler alternative • Aggregate data into suitable time windows • Compute support, correlation, , etc. in each window • Use variance threshold to choose itemsets • Pitfalls • Choices: windows, thresholds • May miss fine detail • Over-sensitive to outliers

Experiments • Millions of baskets over several years • Two algorithms • Complete MDL approach • MDL segmentation + statistical tests (MStat) • Data set • 2.8 million transactions • 7 years, 1987 to 1993 • 15800 items • Average 2.62 items per basket

Little agreement in itemset ranks • Simpler methods do not approximate MDL

MDL has high selectivity • Score of best itemsets stand out from the rest using MDL

 against time High MStat score Small marginals Polo shirt & shorts High correlation Small % variation Bedsheets & pillow cases High MDL score Significant gradual drift Men’s & women’s shorts Three anecdotes

Conclusion • New notion of surprising patterns based on • Joint support expected from marginals • Variation of joint support along time • Robust MDL formulation • Efficient algorithms • Near-optimal segmentation using shortest path • Pruning criteria • Successful application to real data

References • R. Agrawal and R. Srikant. Fast algorithms for mining association rules. VLDB'94 487-499, Santiago, Chile. • S. Chakrabarti, S. Sarawagi and B.Dom, Mining surprising patterns using temporal description length Proc. of the 24th Int'l Conference on Very Large Databases (VLDB), 1998 • J. Han, J. Pei, and Y. Yin. Mining frequent patterns without candidate generation. SIGMOD'00, 1-12, Dallas, TX, May 2000. • Jiawei Han, Micheline Kamber , Data Mining: Concepts and Techniques by, Morgan Kaufmann Publishers (Some of the slides in the talk are taken from this book) • H. Toivonen. Sampling large databases for association rules. VLDB'96, 134-145, Bombay, India, Sept. 1996

Frequent itemset mining and temporal extensions