1 / 47

SECURED OUTSOURCING OF FREQUENT ITEMSET MINING

SECURED OUTSOURCING OF FREQUENT ITEMSET MINING. Hana Chih-Hua Tai Institute of Information Science, Academia Sinica. OUTLINE. Preliminary – Frequent ItemSet Mining Motivation Privacy Model – K-Support Anonymity Algorithm Performance Studies Conclusion. OUTLINE.

jock
Download Presentation

SECURED OUTSOURCING OF FREQUENT ITEMSET MINING

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. SECURED OUTSOURCING OF FREQUENT ITEMSET MINING Hana Chih-Hua Tai Institute of Information Science, Academia Sinica

  2. OUTLINE • Preliminary – Frequent ItemSet Mining • Motivation • Privacy Model – K-Support Anonymity • Algorithm • Performance Studies • Conclusion

  3. OUTLINE • Preliminary – Frequent ItemSet Mining • Motivation

  4. FREQUENT ITEMSET MINING (FIM) • Discover what happened frequently When threshold set as 3 (=60%), {wine} and {cigar} are frequent. When threshold set as 2 (=40%), {wine}, {cigar}, {tea}, {beer}, {wine, cigar}, and {wine, beer} are frequent.

  5. FREQUENT ITEMSET MINING (FIM) • Discover what happened frequently • Frequent itemset mining (FIM) When threshold set as 3 (=60%), {wine} and {cigar} are frequent. When threshold set as 2 (=40%), {wine}, {cigar}, {tea}, {beer}, {wine, cigar}, and {wine, beer} are frequent.

  6. THE NEEDS OF OUTSOURCING FIM • For those who lack of expertise in FIM and/or computing resources, they have the need of outsourcing the mining tasks to a professional third party. Data Owner Mining Services Provider (Cloud Computing)

  7. THE NEEDS OF OUTSOURCING FIM • For those who lack of expertise in FIM and/or computing resources, they have the need of outsourcing the mining tasks to a professional third party. Data Owner Privacy?! Mining Services Provider (Cloud Computing)

  8. THE RISKS OF OUTSOURCING FIM Data Owner • Encryption/decryption method is believed as the possible solution. Mining Services Provider (Cloud Computing)

  9. THE RISKS OF OUTSOURCING FIM Data Owner • Encryption/decryption method is believed as the possible solution. How to achieve the encryption and decryption? • Privacy protected Mining Services Provider (Cloud Computing) • Correct mining results • Reasonable overhead

  10. THE RISKS OF OUTSOURCING FIM Encrypt

  11. THE RISKS OF OUTSOURCING FIM • Top frequency attack • Wine is the most frequent item  ‘a’ is ‘wine’ • Approximate support attack • The support of cigar is about 55%~60% ‘c’ is ‘cigar’ Encrypt

  12. THE RISKS OF OUTSOURCING FIM The support information about the frequent itemsets can be utilized to effectively reveal the raw data as well as the sensitive information from the anonymized transactions. T. Mielik¨ainen. Privacy problems with anonymized transaction databases. In Proc. of Discovery Science, 2004. • Top frequency attack • Wine is the most frequent item  ‘a’ is ‘wine’ • Approximate support attack • The support of cigar is about 55%~60% ‘c’ is ‘cigar’ The Risks of Outsourcing FIM Encrypt

  13. RELATED WORKS • Encrypt each real items by a one-many mapping function. Wong, W. K., Cheung, D. W., Hung, E., Kao, B., Mamoulis, N.: Security in Outsourcing of Association Rule Mining. In: Proc. of VLDB, 2007. • However, it does not try to anonymize the support information. • Recently it is cracked. Molloy, I., Li, N., Li, T.: On the (In)Security and (Im)Practicality of Outsourcing Precise Association Rule Mining. In: Proc. of ICDM, 2009.

  14. OUTLINE • Preliminary – Frequent ItemSet Mining • Motivation • Privacy Model – K-Support Anonymity

  15. K-SUPPORT ANONYMITY &ANONYMIZATION • For every sensitive item, there are at least k-1 other items of the same support. • The probability of an item being correctly re-identified is limited to 1/k, even when the precise support information is known. • Given a transactional database T, encrypt T into E(T) such that • There exist a decryption function D such that MiningResult(T, Δ)= D(MiningResult(E(T), Δ)), for any minimal support Δ. • E(T) is k-support anonymous.

  16. SOLUTION 1: A NAÏVE APPROACH • For each set of real items of the same support, add enough fake items randomly into transactions to make the fake items as frequent as real ones. For k = 3, 16 additional items are required. 4 x 2 = 8 (e, f) for wine 3 x 2 = 6 (g, h) for cigar 2 x 1 = 2 (i) for beer and tea

  17. A NAÏVE SOLUTION • For each set of real items of the same support, add enough fake items randomly into transactions to make the fake items as frequent as real ones. There could be too large storage overhead when k is large. For k = 3, 16 additional items are required. 4 x 2 = 8 (e, f) for wine 3 x 2 = 6 (g, h) for cigar 2 x 1 = 2 (i) for beer and tea

  18. all prod. GENERALIZED FIM beverage cigar alcoholic tea • Discover all frequent items across concept levels, given a taxonomy indicating the hierarchical concepts between items beer wine When threshold set as 3 (=60%), {wine}, {cigar}, {alcoholic}, {beverage} and {all prod.} are frequent. {beverage, cigar} are also frequent.

  19. all prod. GENERALIZED FIM beverage cigar alcoholic tea • Discover all frequent items across concept levels, given a taxonomy indicating the hierarchical concepts between items beer wine When threshold set as 3 (=60%), {wine}, {cigar}, {alcoholic}, {beverage} and {all prod.} are frequent. {beverage, cigar} are also frequent. • 1. The support of a parent node comes from the • supports of it child nodes. • 2. Only lead nodes need to appear in the transactions.

  20. OUTLINE • Preliminary – Frequent ItemSet Mining • Motivation • Privacy Model – K-Support Anonymity • Algorithm

  21. ANONYMIZATION: OVERVIEW • For storage efficiency, we suggest to convert FIM to GFIM. Data Owner Third Party Encrypt Transaction Data Encrypted Transaction Data Pseudo Taxonomy Transaction Data Pseudo Taxonomy Generation in the Encryption Generalized Frequent Itemset Mining Frequent Itemsets Decrypt Frequent Itemsets

  22. ANONYMIZATION: STORAGE EFFICIENCY • In GFIM, items can be at multiple levels of a taxonomy and only the items at leaf level need to appear in the database. 1 2 1 Encrypt with k=3 wine {e, f, j} cigar {b, c, d} beer and tea {a, g, h} 4 additional items required k j i e f g h wine tea a b c d beer cigar

  23. ANONYMIZATION: STORAGE EFFICIENCY • In GFIM, items can be at multiple levels of a taxonomy and only the items at leaf level need to appear in the database. Small storage overhead compared to the naïve method. 1 2 1 Encrypt with k=3 wine {e, f, j} cigar {b, c, d} beer and tea {a, g, h} 4 additional items required k j i e f g h wine tea a b c d beer cigar

  24. ANONYMIZATION: EASY DECRYPTION • The real frequent itemsets can be obtained by filtering out patterns containing any fake item in 1 scan of the returned results. min_sup = 2 Results = {{beer}, {cigar}, {wine}, {tea}, {beer, wine}, {cigar, wine}} Results = {a, b, c, d, e, f, g, h, i, j, k, ac, af, bf, ce, …} k i j e f g h wine tea a b c d beer cigar

  25. ANONYMIZATION: EASY DECRYPTION • The real frequent itemsets can be obtained by filtering out patterns containing any fake item in 1 scan of the returned results. The data owner can obtain the real results in 1 scan of the returned itemsets. min_sup = 2 Results = {{beer}, {cigar}, {wine}, {tea}, {beer, wine}, {cigar, wine}} Results = {a, b, c, d, e, f, g, h, i, j, k, ac, af, bf, ce, …} k i j e f g h wine tea a b c d beer cigar

  26. ANONYMIZATION: ENCRYPTION The problem is how to build the taxonomy and encrypt T for k-support anonymity. Encrypt with k=3 k i j e f g h wine tea a b c d beer cigar

  27. ANONYMIZATION: ENCRYPTION • 1: Generalization of the Mining Task • To generate a pseudo taxonomy that can • (a) conserve the correct and complete mining results, • (b) facilitate k-support anonymization. • 2: Anonymization with Taxonomy Tree • To encrypt T for k-support anonymity with the help of the constructed taxonomy tree.

  28. 1: GENERALIZATION OF THE MINING TASK • Build a k-bud tree of T • All real items at the leaf level • The number of nodes in three categories is equal to or greater than k Let xM denote the most frequent real item in T • A> = { v| sup(v) > sup(xM) and vis leaf}, • A= = { v| sup(v) = sup(xM)}, and • A< = { v| sup(v) < sup(xM) < sup(u), where u is the parent node of v }. 5 2 5 (tea) 4 4 (wine) 2 3 (beer) (cigar) 3-bud tree

  29. 1: GENERALIZATION OF THE MINING TASK 3 groups beer cigar wine tea

  30. 1: GENERALIZATION OF THE MINING TASK 3 subtrees 4 4 2 3 2 (beer) (cigar) (wine) (tea)

  31. 1: GENERALIZATION OF THE MINING TASK Iteratively connect a subtree which sup(root) ≧ sup(wine) with the other subtree 5 2 (tea) 4 4 (wine) (beer) 2 3 (cigar)

  32. 1: GENERALIZATION OF THE MINING TASK 5 3 bud-tree 2 5 (tea) 4 4 (wine) 2 3 (beer) (cigar)

  33. 2: ANONYMIZATION WITH TAXONOMY TREE • Alternate k-bud tree and modify T simultaneously to achieve k-support anonymity • Insertion • Split • Increase

  34. 2: ANONYMIZATION WITH TAXONOMY TREE • Alternate k-bud tree and modify T simultaneously to achieve k-support anonymity • Insertion (Ex.) • Split • Increase u u v p sup(v) < target-sup < sup(u) p: the node with target support q: randomly select sup(p) – sup(v) transactions from T(u) – T(v) T(x) is the set of transactions containing the item x. v q sup(u) and sup(v) should not be changed.

  35. 2: ANONYMIZATION WITH TAXONOMY TREE For wine 5 y 4 5 x 2 5 2 2 (tea) p1 (tea) 4 4 (wine) 2 3 3-bud tree (beer) (cigar) insertion

  36. 2: ANONYMIZATION WITH TAXONOMY TREE v • Alternate k-bud tree and modify T simultaneously to achieve k-support anonymity • Insertion • Split (Ex.) • Increase v p q target-sup < sup(v) p: randomly select target-sup transactions from T(v) q: T(p) = T(v) – T(q) T(x) is the set of transactions containing the item x. sup(v) should not be change. Split operation can raise up leaf nodes to internal nodes!

  37. 2: ANONYMIZATION WITH TAXONOMY TREE For wine For cigar 5 5 4 (wine) 5 3 1 y 4 5 x 2 5 2 2 (tea) p1 (tea) 4 4 (wine) p2 p3 2 3 3-bud tree (beer) (cigar) insertion split

  38. 2: ANONYMIZATION WITH TAXONOMY TREE u u • Alternate k-bud tree and modify T simultaneously to achieve k-support anonymity • Insertion • Split • Increase (Ex.) v v randomly select target-sup – sup(v) transactions from T(u) – T(v) sup(v) < target-sup sup(v) should not be changed. So, Increase operation is applicable only on node that does not belong to any anonymous group!

  39. 2: ANONYMIZATION WITH TAXONOMY TREE 5 5 4 For cigar For wine For cigar (wine) 3 5 5 4 (wine) 5 3 1 y 4 5 x 2 5 2 2 (tea) p1 (tea) 4 4 (wine) p2 p3 p3 2 3 3-bud tree (beer) (cigar) insertion split increase

  40. 2: ANONYMIZATION WITH TAXONOMY TREE 5 5 3-support anonymity 4 For cigar For wine For cigar (wine) 3 5 5 k 5 4 5 i j 4 2 2 4 4 (wine) f (wine) h (tea) e g 5 3 1 y 4 5 2 3 3 3 x a (beer) b (cigar) c d 2 5 2 2 (tea) p1 (tea) 4 4 (wine) p2 p3 p3 2 3 3-bud tree (beer) (cigar) insertion split increase

  41. OUTLINE • Preliminary – Frequent ItemSet Mining • Motivation • Privacy Model – K-Support Anonymity • Algorithm • Performance Studies • Conclusion

  42. PERFORMANCE STUDIES • Data sets • Retail dataset • 88162 transactions with 2117 different items • T10I1kD100k dataset • 100k transactions with 1000 different items • Security • Against precise item support attacks • Against precise itemset support attacks • Storage overhead • Execution efficiency

  43. SECURITY • Against precise item support attacks • Item accuracy: The ratio of items being re-identified • DB accuracy: The avg. ratio of items in a transaction being re-identified 43 (a) Retail dataset (b) T10I1kD100k dataset

  44. SECURITY • Against precise itemset support attacks • Item accuracy: The ratio of items being re-identified • DB accuracy: The avg. ratio of items in a transaction being re-identified 44 (a) Retail dataset (b) T10I1kD100k dataset

  45. STORAGE OVERHEAD & EXECUTION EFFICIENCY (a) Retail dataset (a) Retail dataset (b) T10I1kD100k dataset (b) T10I1kD100k dataset

  46. SUMMARY • We proposed k-support anonymity to enhance the privacy protection in outsourcing of frequent itemset mining (FIM). • For storage efficiency, we transformed FIM to GFIM, and proposed a taxonomy-based anonymization algorithm. • Our method allows the data owner to obtain the real frequent itemsets in 1 scan of the returned results. • Experimental results on both real and synthetic data sets showed that our method can achieve very good privacy protection with moderate storage overhead.

  47. Q & A

More Related