Cartesian Contour: A Concise Representation for Frequent Pattern Mining

Cartesian Contour: A Concise Representation for a Collection of Frequent Sets Ruoming Jin Kent State University Joint work with Yang Xiang and Lin Liu (KSU)

Frequent Pattern Mining • Summarizing the underlying datasets, providing key insights • Key building block for data mining toolbox • Association rule mining • Classification • Clustering • Change Detection • etc… • Application Domains • Business, biology, chemistry, WWW, computer/networing security, software engineering, …

The Problem • The number of patterns is too large • Attempt • Maximal Frequent Itemsets • Closed Frequent Itemsets • Non-Derivable Itemsets • Compressed or Top-k Patterns • … • Tradeoff • Significant Information Loss • Large Size

Pattern Summarization • Using a small number of itemsets to best represent the entire collection of frequent itemsets • The Spanning Set Approach [Afrati-Gionis-Mannila, KDD04] • Exact Description = Maximal Frequent Itemsets • Our problem: • Can we find a concise representation which can allow both exact and approximate summarization of a collection of frequent itemsets?

Basic Idea {A,B,G,H}, {A,B,I,J}, {A,B,K,L} {C,D,G,H}, {C,D,I,J}, {C,D,K,L} {E,F,G,H}, {E,F,I,J}, {E,F,K,L} 9 itemsets, 36 items. Covering Picturing {{A,B},{C,D},{E,F}} Cartesian Product {{G,H},{I,J},{K,L}} 1 biclique, 6 itemsets, 12 items

Cartesian Covering Non-frequent itemsets

Problem Formulation • Cartesian product • e.g. • Cost of a Cartesian product • e.g. 1 biclique, 3 itemsets, and 5 items • Covering • e.g. How can we use Cartesian products to concisely represent a collection of frequent itemsets?

Exact and Approximate Covering Exact Representation Cost: 2 biclique, 4 itemsets, 6 items False positive: none Approximate Representation Cost: 1 biclique, 3 itemsets, 5 items False positive: {G,C},{G,D},{G,C,D}

Covering Maximal Frequent Itemsets MNOVWX CDEJKL CDEVWX MNOGHI CDEGHI PQRJKL CDESTU {{GHI}, {JKL}} ABCGHI ABCSTU {{STU}, {VWX}} {{ABC}, {CDE}} {{MNO}, {PQR}}

Problem Reformulation Given Maximal Frequent Itemsets: Exact representation Approximate representation Frequent Itemsets C1 C2 C1 C2

Minimal Biclique Set Cover Problem Ground Set: 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11 1, 2,3,4,6,7,8,9 5,10,11

NP-hardness • By reducing the Minimal Biclique Set Cover into our problem, we can easily prove our problem1 (exact) and problem2 (approximate) are NP-hard. • Minimal Biclique Set Cover is a Variant of the Classical Set Cover Problem Can we use the standard set-cover greedy algorithm?

Naïve greedy algorithm • Greedy algorithm: • Each time choose a biclique with the lowest price . • is the cost. • This method has a logarithmic approximation bound. • The problem? • The number of candidate bicliques are 2|X|+|Y| !!

Candidate Reduction • Assume one side of the biclique candidate is known, how to choose the other side?

Greedy Algorithm Biclique Candidate Split and sort Covering 4 Covering 3 Covering 3 Add 1st single Y-vertex Biclique Add 2nd single Y-vertex Biclique Add 3th single Y-vertex Biclique Fixed! Cheapest sub-biclique! Cost = 1; Cost = 5/7; Cost = 6/8 > 5/7

Approximation Bound of the Greedy Algorithm The greedy SubBiclique procedure can find a sub-biclique whose price is less than or equal to e/(1-e) of the price of the optimal sub-biclique (cheapest price)!

Further Reduction • Only using the IDEA1, the time complexity is still exponential . • How to reduce this further?? • Are all the combinations equally important? • No, because some are more likely to connect to the Y side. • Our solution: Frequent itemset mining!

Using Frequent Itemset Mining

Overall Algorithm • Step 1: Use the Frequent Itemset Mining tool to find all the (one-side maximal) biclique candidates; • Step 2: Calculate the cheapest sub-biclique for each candidate using the greedy procedure; • Step 3: Compare all the sub-bicliques, choose the cheapest one; • Step 4: if MFI totally covered, done; else go to Step 2.

Approximation Bound Our algorithm has e/(1-e) (ln (n)+1) approximation ratio with respect to the candidate set (all the sub-bicliques with one sides coming from the frequent itemset mining).

Speed-up techniques (1) • Using Closed itemsets for X and Y • Initially X and Y contain all the FI, respectively. • Using to cover MFI is similar to factorizing MFI; • MFI’s maximal factor itemsets are closed itemsets, whose number is much smaller!

Speed-up techniques (2) Dense Graph Sparse Graph TRADEOFF Frequent Itemset Supporting Transaction # Frequent itemsets is small; Valuable biclique candidates are not be fully used! # Frequent itemsets is big; Handling those candidates are too slow!

Speed-up techniques (3) • Iterative procedure • A large number of closed itemsets; • To cover MFI in one time can produce a huge number of biclique candidates; • So to cover MFI in several times ; • Support level is reduced gradually!

Experiments • Data sets:

Conclusion • We propose an interesting summarization problem which consider the interaction between frequent patterns • We transform this problem into a generalized minimal biclique covering problem and design an approximate algorithm with bound • The experimental results demonstrate the effective and efficiency of our approach

Thank you !!!

Reference [Bayardo98] Roberto J. Bayardo Jr. Efficiently mining long patterns from databases. SIGMOD98. [Pasquier99] Nicolas Pasquier, Yves Bastide, Rafik Taouil, and Lotfi Lakhal. Descovering frequent closed itemsets for association rules. ICDT99. [Calder07] Toon Calder and Bart Goethals. Non-derivable itemset mining. Data Min. Knowl. Discover. 07. [Han02] Jiawei Han, Jianyong Wang, Ying Lu and Petre Tzvetkov. Mining top-k frequent closed patterns without minimum support. ICDM02. [Xin06] Dong Xin, Hong Cheng, Xifeng Yan, and Jiawei Han. Extracting redundancy-aware top-k patterns. KDD06. [Xin05] Dong Xin, Jiawei Han, Xifeng Yan, and Hong Cheng. Mining compressed frequent-pattern sets. VLDB05. [Afrati04] Foto Afrati, Aristides Gionis, and Heikki Mannila. Approximating a collection of frequent sets. KDD04. [Yan05] Xifeng Yan, Hong Cheng, Jiawei Han, and Dong Xin. Summarization itemset patterns: a profile-based approach. KDD05. [Wang06] Chao Wang and Srinivasan Parthasarathy. Summarizing itemset patterns using probabilistic models. KDD06. [Jin08] Ruoming Jin, Muad Abu-Ata, Yang Xiang, and Ning Ruan. Effective and efficient itemset pattern summarization: regression-based approaches. KDD08. [Xiang08] Yang Xiang, Ruoming Jin, David Fuhy, and Feodor F. Dragan. Succinct Summarization of transactional databases: an overlapped hyperrectangle scheme. KDD08.

Related Work • K-itemset approximation: [Afrati04]. • Difference: • their work is a special case of our work; • their work is expensive for exact description; • Our work use set cover and max-k cover methods. • Restoring the frequency of frequent itemsets: [Yan05, Wang06, Jin08]. • Hyperrectangle covering problem: [Xiang08].

Cartesian Contour: A Concise Representation for Frequent Pattern Mining