1 / 32

HIDING EMERGING PATTERNS WITH LOCAL RECODING GENERALIZATION

HIDING EMERGING PATTERNS WITH LOCAL RECODING GENERALIZATION. Presented by: Michael Cheng Supervisor: Dr. William Cheung Co-Supervisor: Dr. Byron Choi. Presentation Flow. Privacy-Preserving Data Publishing Introduction to Emerging Patterns (EPs) Introduction to Equivalence Class

Download Presentation

HIDING EMERGING PATTERNS WITH LOCAL RECODING GENERALIZATION

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. HIDING EMERGING PATTERNS WITH LOCAL RECODING GENERALIZATION Presented by: Michael Cheng Supervisor: Dr. William Cheung Co-Supervisor: Dr. Byron Choi

  2. Presentation Flow • Privacy-Preserving Data Publishing • Introduction to Emerging Patterns (EPs) • Introduction to Equivalence Class • Introduction to Generalization • Proposed Problem and Motivation • Heuristic for the Problem • Experimental Results • Future research plan

  3. Privacy Preserving Data Publishing- Introduction • Organizations often need to publish or share their data for legitimate reasons • Sensitive information (e.g. personal identities, restrictive patterns) maybe inferred from the published data

  4. Privacy Preserving Data Publishing- Objective • Transform the dataset before publishing, such that: • Sensitive information • In our case: Emerging Patterns (EPs) • Subsequence analysis • In our case: Frequent Itemset (FIS) Mining

  5. Introduction to Emerging Patterns (EPs) • Emerging Patterns (EPs) are itemsets exist in pair of datasets whose supports are significant in one dataset but insignificant in another {MSE, Exec} is an Emerging Pattern Edu Occup Marital MSE Exec Married MSE Exec Married BA Exec Married BA Manager Married BA Repair Never Income >= 50k Income < 50k

  6. Introduction to Emerging Patterns (EPs) • Formally, growth rate and EPs are defined as follow:

  7. Introduction to Equivalence Class • Tuples are said to be in the same Equivalence Class w.r.t. a set of Attribute A if they take same values of A Tuples {1,2,3} are in the same Equivalence Class w.r.t. {Occup, Marital} ID Edu Occup Marital 1 MSE Exec Married 2 MSE Exec Married 3 BA Exec Married 4 BA Manager Married 5 BA Repair Never

  8. Introduction to Generalization • Extensively studied in achieving k-Anonymity • Not studied before for hiding itemsets • Modify the original values in dataset into more general values according to a user-given hierarchy such that more tuples will share the same set of attribute values • Example: In Adult, “BA” and “MSE” maybe generalized to “Degree Holder”

  9. Types of Generalization • Single Dimensional Global Recoding • Multi Dimensional Global Recoding • Multi Dimensional Local Recoding

  10. Single Dimensional Global Recoding • If we decide to generalize some values to a single value, all tuples which contains these values will be affected Single Dimensional Global Recoding

  11. Multi Dimensional Global Recoding • If we decide to generalize some values to a single value, all tuples in the same equivalence class which contains those values will be affected Occup Occupation Multi Dimensional Global Recoding Occupation Occupation Manager Repair

  12. Multi Dimensional Local Recoding • Same as the Multi Dimensional Global Recoding except no Equivalence Class constraint Occup Occupation Multi Dimensional Local Recoding Occupation Exec Manager Repair

  13. Proposed Problem- Why EP and FIS ? • Emerging Pattern may reveal sensitive information • E.g. In the Adult dataset from UCI Repository, we found that: • {Never-Married, Own-Child} is an EP from the class “Income < 50k” to the class “Income >=50k” • Growth Rate: 35 • Frequent Itemset is a popular data mining task and supported by commercial data-mining software

  14. Proposed Problem-Why Generalization ? • Other methods studied in PPDP • For example: • Adding unknowns, remove tuples, adding fake tuples randomly • Either • Incomplete information • Fake information • In some applications, completeness and truthfulness of data are important • By using generalization, we can preserve the completeness and truthfulness of the data

  15. Proposed problem- Problem Illustration Emerging Patterns Frequent Itemsets Emerging Patterns Frequent Itemsets D D’ Transformation (Local Recoding)

  16. Intuition of Local Recoding • Support of FIS = 40% Growth Rate of EP = 3 • Frequent Itemset = {Exec, Married} • Emerging Pattern = {MSE ,Exec} Edu Occup Marital Edu Occup Marital BA Exec Married MSE Exec Married BA Exec Married MSE Exec Married BA Exec Married BA Exec Married BA Worker Married BA Manager Married MSE Manager Never BA Repair Never Income >= 50k Income < 50k

  17. Intuition of Local Recoding Edu Occup Marital Edu Occup Marital BA Exec Married MSE Exec Married BA Exec Married MSE Exec Married BA Exec Married BA Exec Married BA Worker Married BA Manager Married MSE Manager Never BA Repair Never Income >= 50k Income < 50k Edu Occup Marital Edu Occup Marital BA Exec Married MSE White col Married BA Exec Married MSE White col Married BA Exec Married BA Exec Married BA Worker Married BA Manager Married MSE White Col Never BA Repair Never Income >= 50k Income < 50k

  18. Heuristic for the Problem- Greedy Approach Repeat… Applying the generalization D Equivalence Classes Utility Gain Class 1 40 EPs Emerging Patterns Mining Class 2 90 EP 1 Class 3 60 EP 2 Class 4 20 EP 3 Class 5 15 EP 4 Until… All Emerging Patterns are removed

  19. Heuristic for the Problem-Greedy Approach • Drawbacks: • Trapped into some local minima • Solution: • Simulated Annealing Style Approach for choosing equivalence class

  20. Heuristic for the Problem- Simulated Annealing Style Approach • Choose Equivalence Class probabilistically • Two parameters: • Initial temperature ( T0 ) • Cooling Rate ( α ) • Acceptance Probability: • exp Utility Gain / Temperature • Temperature updating: • Tn = αTn-1 Acceptance probability of different utility gain and temperature

  21. Heuristic for the Problem- Simulated Annealing Style Approach Repeat… Applying the generalization and Decrease the temperature D Equivalence Classes Probability Class 1 0.2 EPs Emerging Patterns Mining Class 2 0.4 EP 1 Class 3 0.1 EP 2 Class 4 0.25 EP 3 Class 5 0.05 EP 4 Until… All Emerging Patterns are removed

  22. Two questions • How to choose an EP for generalization? • How to calculate the utility gain?

  23. How to choose an EP for generalization? • Choose the EP which overlaps with the remaining EPs the most • More likely to hide other EPs simultaneously Emerging Patterns MSE Never Married BA Divorced BA Divorced Worker BA Divorced Repairman BA Divorced Own - Child

  24. How to calculate utility gain? • Utility gain is a function of: • Recoding Distance (RD) • Reduction of Growth Rate (RG)

  25. How to calculate utility gain ?- Recoding Distance (RD) • The detail derivation is stated in the paper • Intuitively, it measures… • How many and how much FIS have been generalized? • How many FIS disappeared? • High level definition of RD: θqx (generalized FIS) + ( 1- θq )x (disappeared FIS) ,where θqis user defined parameter The larger the value of RD, the more the distortion generated on the Frequent Itemset

  26. How to calculate utility gain ?- Reduction of Growth Rate(RG) • After taken a local recoding, RG is defined as: • The reduction of growth rate of all EPs Local Recoding RG = 60 – 25 = 35

  27. How to calculate utility gain? • Putting all these together, utility gain is defined as: θp x RG – (1- θp ) x RD ,where θp is user defined parameters • It favors: • Local recoding which can reduce lots of growth rate • It penalizes: • Local recoding which generate large distortion on FIS

  28. Experimental Setup • Dataset: Adult dataset from UCI Repository • Popular benchmark dataset used for generalization • Total number of records: 30162 • Income > 50k : 7508 • Income <= 50k : 22654 • Use only 8 categorical attributes for experiment • A well accepted hierarchy is defined • Parameters: • Support of FIS : 40% • Growth rate of EP : 5 • Initial Temperature : 10 • Cooling Rate : 0.4

  29. Performance • Maximum RD: 623.1 RD / No. of FIS disappeared of the Greedy Approach RD / No. of FIS disappeared of Simulated Annealing Style Approach (Best of 5)

  30. Runtime (in minutes) Greedy Approach Simulated Annealing Style Approach (Best of 5)

  31. Future Research Plan • Hide EPs in temporal datasets • Consider multi-level FIS • Hiding a group of emerging patterns at a time

  32. Q & A Any Questions?

More Related