SECURED OUTSOURCING OF FREQUENT ITEMSET MINING

SECURED OUTSOURCING OF FREQUENT ITEMSET MINING Hana Chih-Hua Tai Institute of Information Science, Academia Sinica

OUTLINE • Preliminary – Frequent ItemSet Mining • Motivation • Privacy Model – K-Support Anonymity • Algorithm • Performance Studies • Conclusion

OUTLINE • Preliminary – Frequent ItemSet Mining • Motivation

FREQUENT ITEMSET MINING (FIM) • Discover what happened frequently When threshold set as 3 (=60%), {wine} and {cigar} are frequent. When threshold set as 2 (=40%), {wine}, {cigar}, {tea}, {beer}, {wine, cigar}, and {wine, beer} are frequent.

FREQUENT ITEMSET MINING (FIM) • Discover what happened frequently • Frequent itemset mining (FIM) When threshold set as 3 (=60%), {wine} and {cigar} are frequent. When threshold set as 2 (=40%), {wine}, {cigar}, {tea}, {beer}, {wine, cigar}, and {wine, beer} are frequent.

THE NEEDS OF OUTSOURCING FIM • For those who lack of expertise in FIM and/or computing resources, they have the need of outsourcing the mining tasks to a professional third party. Data Owner Mining Services Provider (Cloud Computing)

THE NEEDS OF OUTSOURCING FIM • For those who lack of expertise in FIM and/or computing resources, they have the need of outsourcing the mining tasks to a professional third party. Data Owner Privacy?! Mining Services Provider (Cloud Computing)

THE RISKS OF OUTSOURCING FIM Data Owner • Encryption/decryption method is believed as the possible solution. Mining Services Provider (Cloud Computing)

THE RISKS OF OUTSOURCING FIM Data Owner • Encryption/decryption method is believed as the possible solution. How to achieve the encryption and decryption? • Privacy protected Mining Services Provider (Cloud Computing) • Correct mining results • Reasonable overhead

THE RISKS OF OUTSOURCING FIM Encrypt

THE RISKS OF OUTSOURCING FIM • Top frequency attack • Wine is the most frequent item  ‘a’ is ‘wine’ • Approximate support attack • The support of cigar is about 55%~60% ‘c’ is ‘cigar’ Encrypt

THE RISKS OF OUTSOURCING FIM The support information about the frequent itemsets can be utilized to effectively reveal the raw data as well as the sensitive information from the anonymized transactions. T. Mielik¨ainen. Privacy problems with anonymized transaction databases. In Proc. of Discovery Science, 2004. • Top frequency attack • Wine is the most frequent item  ‘a’ is ‘wine’ • Approximate support attack • The support of cigar is about 55%~60% ‘c’ is ‘cigar’ The Risks of Outsourcing FIM Encrypt

RELATED WORKS • Encrypt each real items by a one-many mapping function. Wong, W. K., Cheung, D. W., Hung, E., Kao, B., Mamoulis, N.: Security in Outsourcing of Association Rule Mining. In: Proc. of VLDB, 2007. • However, it does not try to anonymize the support information. • Recently it is cracked. Molloy, I., Li, N., Li, T.: On the (In)Security and (Im)Practicality of Outsourcing Precise Association Rule Mining. In: Proc. of ICDM, 2009.

OUTLINE • Preliminary – Frequent ItemSet Mining • Motivation • Privacy Model – K-Support Anonymity

K-SUPPORT ANONYMITY &ANONYMIZATION • For every sensitive item, there are at least k-1 other items of the same support. • The probability of an item being correctly re-identified is limited to 1/k, even when the precise support information is known. • Given a transactional database T, encrypt T into E(T) such that • There exist a decryption function D such that MiningResult(T, Δ)= D(MiningResult(E(T), Δ)), for any minimal support Δ. • E(T) is k-support anonymous.

SOLUTION 1: A NAÏVE APPROACH • For each set of real items of the same support, add enough fake items randomly into transactions to make the fake items as frequent as real ones. For k = 3, 16 additional items are required. 4 x 2 = 8 (e, f) for wine 3 x 2 = 6 (g, h) for cigar 2 x 1 = 2 (i) for beer and tea

A NAÏVE SOLUTION • For each set of real items of the same support, add enough fake items randomly into transactions to make the fake items as frequent as real ones. There could be too large storage overhead when k is large. For k = 3, 16 additional items are required. 4 x 2 = 8 (e, f) for wine 3 x 2 = 6 (g, h) for cigar 2 x 1 = 2 (i) for beer and tea

all prod. GENERALIZED FIM beverage cigar alcoholic tea • Discover all frequent items across concept levels, given a taxonomy indicating the hierarchical concepts between items beer wine When threshold set as 3 (=60%), {wine}, {cigar}, {alcoholic}, {beverage} and {all prod.} are frequent. {beverage, cigar} are also frequent.

all prod. GENERALIZED FIM beverage cigar alcoholic tea • Discover all frequent items across concept levels, given a taxonomy indicating the hierarchical concepts between items beer wine When threshold set as 3 (=60%), {wine}, {cigar}, {alcoholic}, {beverage} and {all prod.} are frequent. {beverage, cigar} are also frequent. • 1. The support of a parent node comes from the • supports of it child nodes. • 2. Only lead nodes need to appear in the transactions.

OUTLINE • Preliminary – Frequent ItemSet Mining • Motivation • Privacy Model – K-Support Anonymity • Algorithm

ANONYMIZATION: OVERVIEW • For storage efficiency, we suggest to convert FIM to GFIM. Data Owner Third Party Encrypt Transaction Data Encrypted Transaction Data Pseudo Taxonomy Transaction Data Pseudo Taxonomy Generation in the Encryption Generalized Frequent Itemset Mining Frequent Itemsets Decrypt Frequent Itemsets

ANONYMIZATION: STORAGE EFFICIENCY • In GFIM, items can be at multiple levels of a taxonomy and only the items at leaf level need to appear in the database. 1 2 1 Encrypt with k=3 wine {e, f, j} cigar {b, c, d} beer and tea {a, g, h} 4 additional items required k j i e f g h wine tea a b c d beer cigar

ANONYMIZATION: STORAGE EFFICIENCY • In GFIM, items can be at multiple levels of a taxonomy and only the items at leaf level need to appear in the database. Small storage overhead compared to the naïve method. 1 2 1 Encrypt with k=3 wine {e, f, j} cigar {b, c, d} beer and tea {a, g, h} 4 additional items required k j i e f g h wine tea a b c d beer cigar

ANONYMIZATION: EASY DECRYPTION • The real frequent itemsets can be obtained by filtering out patterns containing any fake item in 1 scan of the returned results. min_sup = 2 Results = {{beer}, {cigar}, {wine}, {tea}, {beer, wine}, {cigar, wine}} Results = {a, b, c, d, e, f, g, h, i, j, k, ac, af, bf, ce, …} k i j e f g h wine tea a b c d beer cigar

ANONYMIZATION: EASY DECRYPTION • The real frequent itemsets can be obtained by filtering out patterns containing any fake item in 1 scan of the returned results. The data owner can obtain the real results in 1 scan of the returned itemsets. min_sup = 2 Results = {{beer}, {cigar}, {wine}, {tea}, {beer, wine}, {cigar, wine}} Results = {a, b, c, d, e, f, g, h, i, j, k, ac, af, bf, ce, …} k i j e f g h wine tea a b c d beer cigar

ANONYMIZATION: ENCRYPTION The problem is how to build the taxonomy and encrypt T for k-support anonymity. Encrypt with k=3 k i j e f g h wine tea a b c d beer cigar

ANONYMIZATION: ENCRYPTION • 1: Generalization of the Mining Task • To generate a pseudo taxonomy that can • (a) conserve the correct and complete mining results, • (b) facilitate k-support anonymization. • 2: Anonymization with Taxonomy Tree • To encrypt T for k-support anonymity with the help of the constructed taxonomy tree.

1: GENERALIZATION OF THE MINING TASK • Build a k-bud tree of T • All real items at the leaf level • The number of nodes in three categories is equal to or greater than k Let xM denote the most frequent real item in T • A> = { v| sup(v) > sup(xM) and vis leaf}, • A= = { v| sup(v) = sup(xM)}, and • A< = { v| sup(v) < sup(xM) < sup(u), where u is the parent node of v }. 5 2 5 (tea) 4 4 (wine) 2 3 (beer) (cigar) 3-bud tree

1: GENERALIZATION OF THE MINING TASK 3 groups beer cigar wine tea

1: GENERALIZATION OF THE MINING TASK 3 subtrees 4 4 2 3 2 (beer) (cigar) (wine) (tea)

1: GENERALIZATION OF THE MINING TASK Iteratively connect a subtree which sup(root) ≧ sup(wine) with the other subtree 5 2 (tea) 4 4 (wine) (beer) 2 3 (cigar)

1: GENERALIZATION OF THE MINING TASK 5 3 bud-tree 2 5 (tea) 4 4 (wine) 2 3 (beer) (cigar)

2: ANONYMIZATION WITH TAXONOMY TREE • Alternate k-bud tree and modify T simultaneously to achieve k-support anonymity • Insertion • Split • Increase

2: ANONYMIZATION WITH TAXONOMY TREE • Alternate k-bud tree and modify T simultaneously to achieve k-support anonymity • Insertion (Ex.) • Split • Increase u u v p sup(v) < target-sup < sup(u) p: the node with target support q: randomly select sup(p) – sup(v) transactions from T(u) – T(v) T(x) is the set of transactions containing the item x. v q sup(u) and sup(v) should not be changed.

2: ANONYMIZATION WITH TAXONOMY TREE For wine 5 y 4 5 x 2 5 2 2 (tea) p1 (tea) 4 4 (wine) 2 3 3-bud tree (beer) (cigar) insertion

2: ANONYMIZATION WITH TAXONOMY TREE v • Alternate k-bud tree and modify T simultaneously to achieve k-support anonymity • Insertion • Split (Ex.) • Increase v p q target-sup < sup(v) p: randomly select target-sup transactions from T(v) q: T(p) = T(v) – T(q) T(x) is the set of transactions containing the item x. sup(v) should not be change. Split operation can raise up leaf nodes to internal nodes!

2: ANONYMIZATION WITH TAXONOMY TREE For wine For cigar 5 5 4 (wine) 5 3 1 y 4 5 x 2 5 2 2 (tea) p1 (tea) 4 4 (wine) p2 p3 2 3 3-bud tree (beer) (cigar) insertion split

2: ANONYMIZATION WITH TAXONOMY TREE u u • Alternate k-bud tree and modify T simultaneously to achieve k-support anonymity • Insertion • Split • Increase (Ex.) v v randomly select target-sup – sup(v) transactions from T(u) – T(v) sup(v) < target-sup sup(v) should not be changed. So, Increase operation is applicable only on node that does not belong to any anonymous group!

2: ANONYMIZATION WITH TAXONOMY TREE 5 5 4 For cigar For wine For cigar (wine) 3 5 5 4 (wine) 5 3 1 y 4 5 x 2 5 2 2 (tea) p1 (tea) 4 4 (wine) p2 p3 p3 2 3 3-bud tree (beer) (cigar) insertion split increase

2: ANONYMIZATION WITH TAXONOMY TREE 5 5 3-support anonymity 4 For cigar For wine For cigar (wine) 3 5 5 k 5 4 5 i j 4 2 2 4 4 (wine) f (wine) h (tea) e g 5 3 1 y 4 5 2 3 3 3 x a (beer) b (cigar) c d 2 5 2 2 (tea) p1 (tea) 4 4 (wine) p2 p3 p3 2 3 3-bud tree (beer) (cigar) insertion split increase

OUTLINE • Preliminary – Frequent ItemSet Mining • Motivation • Privacy Model – K-Support Anonymity • Algorithm • Performance Studies • Conclusion

PERFORMANCE STUDIES • Data sets • Retail dataset • 88162 transactions with 2117 different items • T10I1kD100k dataset • 100k transactions with 1000 different items • Security • Against precise item support attacks • Against precise itemset support attacks • Storage overhead • Execution efficiency

SECURITY • Against precise item support attacks • Item accuracy: The ratio of items being re-identified • DB accuracy: The avg. ratio of items in a transaction being re-identified 43 (a) Retail dataset (b) T10I1kD100k dataset

SECURITY • Against precise itemset support attacks • Item accuracy: The ratio of items being re-identified • DB accuracy: The avg. ratio of items in a transaction being re-identified 44 (a) Retail dataset (b) T10I1kD100k dataset

STORAGE OVERHEAD & EXECUTION EFFICIENCY (a) Retail dataset (a) Retail dataset (b) T10I1kD100k dataset (b) T10I1kD100k dataset

SUMMARY • We proposed k-support anonymity to enhance the privacy protection in outsourcing of frequent itemset mining (FIM). • For storage efficiency, we transformed FIM to GFIM, and proposed a taxonomy-based anonymization algorithm. • Our method allows the data owner to obtain the real frequent itemsets in 1 scan of the returned results. • Experimental results on both real and synthetic data sets showed that our method can achieve very good privacy protection with moderate storage overhead.

Q & A

SECURED OUTSOURCING OF FREQUENT ITEMSET MINING