An Efficient Polynomial Delay Algorithm for Pseudo Frequent Itemset Mining

An Efficient Polynomial Delay Algorithm forPseudo Frequent Itemset Mining Takeaki Uno (National Institute of Informatics) Hiroki Arimura(Hokkaido University) 2/Oct/2007Discovery Science 2007

Frequent Pattern Mining •problem of finding all frequently appearing patterns from (large scale) database database: transaction, tree, string, graph, vector pattern: subset, tree, path, sequence, graph, geograph… database •ex1● ,ex3 ▲ •ex2● ,ex4● •ex2●, ex3 ▲, ex4● •ex2▲,ex3 ▲ ．　　　　．　　　　． ATGCGCCGTA TAGCGGGTGG TTCGCGTTAG GGATATAAAT GCGCCAAATA ATAATGTATTA TTGAAGGGCG ACAGTCTCTCA ATAAGCGGCT •ATGCAT •CCCGGGTAA • GGCGTTA •ATAAGGG ．　　　　．　　　　． experiments Genome info

This Research minimum support threshold •address transaction database transaction database: each record (transaction) T of the database is a subset of the itemset E,i.e., D,∀T ∈D, T ⊆ E frequent itemset: subset of E included in at least σ transactions •problems - so many patterns for finding valuable patterns - inclusion is strict, to deal with errors  "patterns ambiguously included in many transactions" are impotant We introduce an ambiguous inclusion, and propose an efficient mining algorithm

Related Works •Such frequent itemset mining with ambiguity is called fault-tolerant pattern, degenerate pattern, soft occurrence - ambiguity for inclusion is, "pattern is included if the ratio of included items is more than the threshold - another approach: find combinations of itemset and transaction set, such that few pairs of item and transaction do not satisfy inclusion relation - similarity is used, for string matching and homology search •Few "enumeration type" research with completeness Look at practical models and algorithms, from algorithm theory

Notations for F.I.M. •For itemset K, occurrence of K:transaction ofD including K Occ(K):occurrence set of K: the set of occurrences of K frq(K):frequency of K: the size of Occ(K) Occ( {1,2} ) ＝{ {1,2,5,6,7,9}, {1,2,7,8,9} } 1,2,5,6,7,9 2,3,4,5 1,2,7,8,9 1,7,9 2,7,9 2 D ＝ Occ( {2,7,9} ) ＝{ {1,2,5,6,7,9}, 　　　　{1,2,7,8,9}, {2,7,9} }

Frequent Itemset Itemsets included in no less than 3 transactions {1} {2} {7} {9} {1,7} {1,9} {2,7} {2,9} {7,9} {1,7,9} {2,7,9} 1,2,5,6,7,9 2,3,4,5 1,2,7,8,9 1,7,9 2,7,9 2 D＝ •Frequent itemset:itemset with frequency no less than σ (σ is called minimum support (threshold) ) Ex.) Frequent itemset mining: problem of enumerating all frequent itemsets for given database Dand minimum support σ

Inclusion with Ambiguity •Ambiguous inclusion relation for itemset Pand transaction T • Popular definition: |P∩T| ／ |P| ≧ θ for threshold θ<1 　 lose monotonicity of frequent itemsets 　 there is a frequent itemset s.t. "any its subset is infrequent" 　 much cost for computation θ= 0.6 {1,2} {2,3} {1,3} {1,2,3} ⊆ {1,2,4,5} for θ= 0.6 {1,2,3,4,5,6,7} ⊆ {1,3,5,6,7} for θ= 0.6 {1,2,3} ⊆ {1,4,5} for θ= 0.6 {1,2,3} included in all subset  not for any

k-pseudo Inclusion • Use threshold for #non-included items: k-pseudo inclusion: |P＼T| ≦k for threshold k ≧ 0 ( k-pseudo [occurrence / occurrence set / frequency]) 　 monotonicity is kept 　 able to find characterizations such as "many transactionsinclude at least 3 items of P" {1,2,3} ⊆ {1,2,4,5} for k = 1 {1,2,3,4,5,6,7} ⊆ {1,3,5,6,7} for k = 1 {1,2,3} ⊆ {1,4,5} for k = 1

k Pseudo Frequent Itemset 1-pseudo frequent itemsets for σ=3 {1,2,3} {1,2,4} {1,2,5} {1,2,7} {1,2,9} {1,3,7} {1,3,9} {1,4,7} {1,4,9} {1,5,7} {1,5,9} {1,6,7} {1,6,9} {1,7,8} {1,7,9} {1,8,9} {2,3,7} {2,3,9} {2,4,7} {2,4,9} {2,5,7} {2,5,8} {2,5,9} {2,6,7} {2,6,9} {2,7,8} {2,7,9} {2,8,9} {3,7,9} {4,7,9} {5,7,9} {6,7,9} {7,8,9} {1,2,7,9} {1,3,7,9} {1,4,7,9}{1,5,7,9} {1,6,7,9} {1,7,8,9} {2,3,7,9} {2,4,7,9} {2,5,7,9} {2,6,7,9} {2,7,8,9} 1,2,5,6,7,9 2,3,4,5 1,2,7,8,9 1,7,9 2,7,9 2 D ＝ •k-pseudo frequent itemset:itemset k-pseudo included in at least σ transactions of D Many trivial patterns How to efficiently enumerate?

111…1 freq 000…0 1,2,3,4 1,2,3 1,2,4 1,3,4 2,3,4 1,3 1,4 2,3 2,4 3,4 1,2 1 3 4 2 φ Enumeration using Monotonicity •Pseudo frequent itemsets have monotone property thereby simple backtrack algorithm work •For each k-pseudo frequent itemset P, compute k-pseudo frequency of each P+e •If the k-pseudo frequency of P+e is no less than σ, generate recursive call to enumerate k-pseudo frequent itemsets including P+e Polynomial time enumeration How to efficiently computate?

Computing k-Pseudo Occurrences • Define Occ=h(P) = { T∈D | |P＼T| = h } set of transactions missing just h items of P  Occ≦k(P) = ∪h≦kOcc=h(P) •Occ=h(P∪e) = Occ=h(P)∩Occ(e) ∪ Occ=h-1(P)＼Occ(e)  update of pseudo occurrence set is done by taking intersection • compute Occ=h(P)∩Occ(e) for all pair of e and h A B C D E F G A B C D E F G A B E F G B A C D F A B C D F A B C F A B C D A B C D A B C D B C F Occ0 Occ1 Occ2 C D 8 9 10 11 12 P

Taking Intersections Efficiently 1: A,C,D 2: A,B,C,E,F 3: B 4: B 5: A,B 6: A 7: A,C,D,E 8: C 9: A,C,D,E •Occ=h(P∪e) = Occ=h(P)∩Occ(e) ∪ Occ=h-1(P)＼Occ(e)  having the same properties as usual occurrences  can use many existing techniques for updating occurrence set (down project, delivery, bitmap…) •Database reduction (FP-tree) is also available •In deeper levels of recursion, transactions to be scanned becomes few, thereby the computation is fast A: 1,2,5,6,7,9 B: 2,3,4,5 C: 1,2,7,8,9 D: 1,7,9 E: 2,7,9 F: 2

・・・ Using Bottom-wideness Since occurrences to be computed is few in lower levels, •Backtrack (depth-first search) generates several recursive calls in each iteration  The computation tree spreads exponentially by going down  The computation time is dominated by the bottom level iterations on the recursion tree long time short time Amortized computation time is reduced to that of bottom levels

For Large Minimum Support •Whenσ is large, we access many transactions on the bottom levels  Improvements by bottom-wideness is not drastic •Reduce the database to speed up the bottoms (1) Delete items less than the maximum item in P (2)Delete items being infrequent on the occurrence set database (since it never be added in the recursive call) (3)unify the same transactions •The database size is constant in the bottom levels in practice P={1,3}, k=1, σ=4 No big difference from small σ

Small & Trivial Patterns •Under the k-pseudo inclusion, itemsets of size no more than k is included in any transaction •itemsets of size bit greater than k is also included in many transactions Many small and trivial frequent itemsets •We want to ignore these itemsets in practice 　 Consider problem of directly finding pseudo frequent itemsets of size l

Directly Finding Large Itemset •Need exponential time if search all itemsets of sizel 　 Pruning unnecessary search is crucial 　 Take candidates according to partial structure •LetP be a k-pseudo frequent itemsetof size l •WLOG, P={1,…,l} and sorted in decreasing order of|Occ=k(P)＼Occ({e})| •Consider the (k-1)-pseudo frequency of itemset {1,…,y} •Any transaction in Occ=k(P)＼Occ({e}), e>y (k-1)-pseudo includes {1,…,y}

Search Route to Itemset of Size l •Any transaction in Occ=k(P)＼Occ({e}), e>y (k-1)-pseudo includes {1,…,y}  |Occk-1({1,…,y})| ≧ |∪e=y+1,...,|P| (Occk(P)＼Occ({e}))| • average of |Occk(P)＼Occ({e})| isno less than (k / |P|) |Occ=k(P)| • 1,…,y are sorted in increasing order of |Occk(P)＼Occ({e})|  |Occk-1({1,…,y})| ≧ |Occk(P)|×(|P|-y)/|P| Partial frequency condition There is a sequence of itemsets from empty set toPcomposed only of itemsets satisfyingpartial frequency condition

Example for Partial Frequency Condition 1,2,5,6,7,9 2,3,4,5 1,2,7,8,9 1,7,9 2,7,9 2 1-pseudo frequent itemsets satisfying the partial frequency condition {1} {2} {5} {7} {9} {1,2} {1,5} {1,6} {1,7} {1,8} {1,9} {2,3} {2,4} {2,5} {2,6} {2,7} {2,8} {2,9} {3,5} {4,5} {5,6} {5,7} {5,9} {6,7} {6,9} {7,8} {7,9} {8,9} D ＝ •Itemsets satisfying the partial frequency condition, fork=1, σ=3, l=3 #frequent itemsets to be searched is decreased,  efficient search is expected

Restricted Search Route by P.F.C. •Anyk-pseudo frequent itemset of size l can be found by passing through those satisfying partial frequency condition  Let's do backtrack search •Always exist an item whose removal satisfies the condition •Tail extension is not available (removal of tail may violate condition) •Simple hill climbing generates duplications •So, use a generation rule to avoid duplication (reverse search)

Reverse Search for P.F.C. •Rule: generate itemset P from P＼{e} maximizing |Occk-1(P＼{e})| (Tie is broken by choosing the minimum index) ReverseSearch (P) 1. ifP|=1then output P; return; 2. for eache∈P do ifP+e is a k-pseudo frequent itemset satisfying P.F.C. then ifemaximizes |Occk-1(P＼{e})|then ReverseSearch (P+e) 3. end for •|Occk-1(P＼{e})| can be efficiently computed by existing methods O(|P|×||D||) time for one iteration

Conclusion •Introduced ambiguous inclusion relation such that at most k items of the pattern is not included •Pseudo frequent itemset mining under the inclusion (monotonicity, intersection, many small-trivial patterns) •Reverse search for directly finding frequent itemset with fixed size Future works • implementation and experiments •extension of the technique to other pattern mining •approach to inclusion with "ratio r %"

An Efficient Polynomial Delay Algorithm for Pseudo Frequent Itemset Mining