1 / 65

De- anonymizing Data

De- anonymizing Data. Source (http://xkcd.org/834/). CompSci 590.03 Instructor: Ashwin Machanavajjhala. Announcements. Project ideas will be posted on the site by Friday. You are welcome to send me (or talk to me about) your own ideas. Outline. Recap & Intro to Anonymization

etana
Download Presentation

De- anonymizing Data

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. De-anonymizing Data Source (http://xkcd.org/834/) CompSci 590.03Instructor: AshwinMachanavajjhala Lecture 2 : 590.03 Fall 12

  2. Announcements • Project ideas will be posted on the site by Friday. • You are welcome to send me (or talk to me about) your own ideas. Lecture 2 : 590.03 Fall 12

  3. Outline • Recap & Intro to Anonymization • Algorithmically De-anonymizing Netflix Data • Algorithmically De-anonymizing Social Networks • Passive Attacks • Active Attacks Lecture 2 : 590.03 Fall 12

  4. Outline • Recap & Intro to Anonymization • Algorithmically De-anonymizing Netflix Data • Algorithmically De-anonymizing Social Networks • Passive Attacks • Active Attacks Lecture 2 : 590.03 Fall 12

  5. Personal Big-Data Person 1 Person 2 Person 3 Person N r1 r2 r3 rN Google Census DB DB Hospital Information Retrieval Researchers Recommen-dationAlgorithms Medical Researchers Doctors Economists DB Lecture 2 : 590.03 Fall 12

  6. The Massachusetts Governor Privacy Breach [Sweeney IJUFKS 2002] • Governor of MA uniquely identified using ZipCode, Birth Date, and Sex. • Name linked to Diagnosis • Name • SSN • Visit Date • Diagnosis • Procedure • Medication • Total Charge • Name • Address • Date Registered • Party affiliation • Date last voted • Zip • Birth date • Sex Medical Data Voter List Lecture 2 : 590.03 Fall 12

  7. The Massachusetts Governor Privacy Breach [Sweeney IJUFKS 2002] • Governor of MA uniquely identified using ZipCode, Birth Date, and Sex. 87 % of US population • Name • SSN • Visit Date • Diagnosis • Procedure • Medication • Total Charge • Name • Address • Date Registered • Party affiliation • Date last voted • Zip • Birth date • Sex Quasi Identifier Medical Data Voter List Lecture 2 : 590.03 Fall 12

  8. Statistical Privacy (Trusted Collector) Problem Utility: Privacy: No breach about any individual Server DB Individual 1 Individual 2 Individual 3 Individual N r1 r2 r3 rN Lecture 2 : 590.03 Fall 12

  9. Statistical Privacy (Untrusted Collector) Problem Server f ( ) DB Individual 1 Individual 2 Individual 3 Individual N r1 r2 r3 rN Lecture 2 : 590.03 Fall 12

  10. Randomized Response • Flip a coin • heads with probability p, and • tails with probability 1-p (p > ½) • Answer question according to the following table: Lecture 2 : 590.03 Fall 12

  11. Statistical Privacy (Trusted Collector) Problem Server DB Individual 1 Individual 2 Individual 3 Individual N r1 r2 r3 rN Lecture 2 : 590.03 Fall 12

  12. Query Answering How many allergy patients? Hospital ‘ DB Correlate Genome to disease Individual 1 Individual 2 Individual 3 Individual N r1 r2 r3 rN Lecture 2 : 590.03 Fall 12

  13. Query Answering • Need to know the list of questions up front • Each answer will leak some information about individuals. After answering a few questions, server will run out of privacy budget and not be able to answer any more questions. • Will see this in detail later in the course. Lecture 2 : 590.03 Fall 12

  14. Anonymous/ Sanitized Data Publishing Hospital DB writingcenterunderground.wordpress.com I wont tell you what questions I am interested in! Individual 1 Individual 2 Individual 3 Individual N r1 r2 r3 rN Lecture 2 : 590.03 Fall 12

  15. Anonymous/ Sanitized Data Publishing Hospital Answer any # of questions directly on DB’ without any modifications. D’B DB Individual 1 Individual 2 Individual 3 Individual N r1 r2 r3 rN Lecture 2 : 590.03 Fall 12

  16. Today’s class • Identifying individual records and their sensitive values from data publishing (with insufficient sanitization). Lecture 2 : 590.03 Fall 12

  17. Outline • Recap & Intro to Anonymization • Algorithmically De-anonymizing Netflix Data • Algorithmically De-anonymizing Social Networks • Passive Attacks • Active Attacks Lecture 2 : 590.03 Fall 12

  18. Terms • Coin tosses of an algorithm • Union Bound • Heavy Tailed Distribution Lecture 2 : 590.03 Fall 12

  19. Terms (contd.) • Heavy Tailed Distribution Normal Distribution Not heavy tailed. Lecture 2 : 590.03 Fall 12

  20. Terms (contd.) • Heavy Tailed Distribution Laplace Distribution Heavy tailed. Lecture 2 : 590.03 Fall 12

  21. Terms (contd.) • Heavy Tailed Distribution Zipf Distribution Heavy tailed. Lecture 2 : 590.03 Fall 12

  22. Terms (contd.) • Cosine Similarity • Collaborative filtering • Problem of recommending new items to a user based on their ratings on previously seen items. θ Lecture 2 : 590.03 Fall 12

  23. Netflix Dataset Column/Attribute Movies Record (r) Users Rating + TimeStamp Lecture 2 : 590.03 Fall 12

  24. Definitions • Support • Set (or number) of non-null attributes in a record or column • Similarity • Sparsity Lecture 2 : 590.03 Fall 12

  25. Adversary Model • Aux(r) – some subset of attributes from r Lecture 2 : 590.03 Fall 12

  26. Privacy Breach • Definition 1: An algorithm A outputs an r’ such that • Definition 2: (When only a sample of the dataset is input) Lecture 2 : 590.03 Fall 12

  27. Algorithm ScoreBoard • For each record r’, compute Score(r’, aux) to be the minimum similarity of an attribute in aux to the same attribute in r’. • Pick r’ with the maximum score OR • Return all records with Score > α Lecture 2 : 590.03 Fall 12

  28. Analysis Theorem 1: Suppose we use Scoreboard with α = 1 – ε. If Aux contains m randomly chosen attributes s.t.Then Scoreboard returns a record r’ such that Pr [Sim(m, r’) > 1 – ε – δ] > 1 – ε Lecture 2 : 590.03 Fall 12

  29. Proof of Theorem 1 • Call r’ a false match if Sim(Aux, r’) < 1 – ε – δ. • For any false match, Pr[ Sim(Auxi, ri’) > 1 – ε ] < 1 – δ • Sim(Aux, r’) = min Sim(Auxi, ri’) • Therefore, Pr[ Sim(Aux, r’) > 1 – ε ] < (1 – δ)m • Pr[some false match has similarity > 1- ε] < N(1-δ)m • N(1-δ)m < ε when m > log(N/ε) / log(1/1-δ) Lecture 2 : 590.03 Fall 12

  30. Other results • If dataset D is (1-ε-δ, ε)-sparse, then D can be (1, 1-ε)-deanonymized. • Analogous results when a list of candidate records are returned Lecture 2 : 590.03 Fall 12

  31. Netflix Dataset • Slightly different algorithm Lecture 2 : 590.03 Fall 12

  32. Summary of Netflix Paper • Adversary can use a subset of ratings made by a user to uniquely identify the user’s record from the “anonymized” dataset with high probability • Simple Scoreboard algorithm provably guarantees identification of records. • A variant of Scoreboard can de-anonymize Netflix dataset. • Algorithms are robust to noise in the adversary’s background knowledge Lecture 2 : 590.03 Fall 12

  33. Outline • Recap & Intro to Anonymization • Algorithmically De-anonymizing Netflix Data • Algorithmically De-anonymizing Social Networks • Passive Attacks • Active Attacks Lecture 2 : 590.03 Fall 12

  34. Social Network Data • Social networks: graphs where each node represents a social entity, and each edge represents certain relationship between two entities • Example: email communication graphs, social interactions like in Facebook, Yahoo! Messenger, etc. Lecture 2 : 590.03 Fall 12

  35. Anonymizing Social Networks • Naïve anonymization • removes the label of each node and publish only the structure of the network • Information Leaks • Nodes may still be re-identified based on network structure Alice Bob Cathy Diane Ed Fred Grace Lecture 2 : 590.03 Fall 12

  36. Passive Attacks on an Anonymized Network • Consider the above email communication graph • Each node represents an individual • Each edge between two individuals indicates that they have exchanged emails Alice Bob Cathy Diane Ed Fred Grace Lecture 2 : 590.03 Fall 12

  37. Passive Attacks on an Anonymized Network • Alice has sent emails to three individuals only Alice Bob Cathy Diane Ed Fred Grace Lecture 2 : 590.03 Fall 12

  38. Passive Attacks on an Anonymized Network • Alice has sent emails to three individuals only • Only one node in the anonymized network has a degree three • Hence, Alice can re-identify herself Alice Bob Cathy Diane Ed Fred Grace Lecture 2 : 590.03 Fall 12

  39. Passive Attacks on an Anonymized Network • Cathy has sent emails to five individuals Alice Bob Cathy Diane Ed Fred Grace Lecture 2 : 590.03 Fall 12

  40. Passive Attacks on an Anonymized Network • Cathy has sent emails to five individuals • Only one node has a degree five • Hence, Cathy can re-identify herself Alice Bob Cathy Diane Ed Fred Grace Lecture 2 : 590.03 Fall 12

  41. Passive Attacks on an Anonymized Network • Now consider that Alice and Cathy share their knowledge about the anonymized network • What can they learn about the other individuals? Alice Bob Cathy Diane Ed Fred Grace Lecture 2 : 590.03 Fall 12

  42. Passive Attacks on an Anonymized Network • First, Alice and Cathy know that only Bob have sent emails to both of them Alice Bob Cathy Diane Ed Fred Grace Lecture 2 : 590.03 Fall 12

  43. Passive Attacks on an Anonymized Network • First, Alice and Cathy know that only Bob have sent emails to both of them • Bob can be identified Alice Bob Cathy Diane Ed Fred Grace Lecture 2 : 590.03 Fall 12

  44. Passive Attacks on an Anonymized Network • Alice has sent emails to Bob, Cathy, and Ed only Alice Bob Cathy Diane Ed Fred Grace Lecture 2 : 590.03 Fall 12

  45. Passive Attacks on an Anonymized Network • Alice has sent emails to Bob, Cathy, and Ed only • Ed can be identified Alice Bob Cathy Diane Ed Fred Grace Lecture 2 : 590.03 Fall 12

  46. Passive Attacks on an Anonymized Network • Alice and Cathy can learn that Bob and Ed are connected Alice Bob Cathy Diane Ed Fred Grace Lecture 2 : 590.03 Fall 12

  47. Passive Attacks on an Anonymized Network • The above attack is based on knowledge about degrees of nodes. [Liu and Terzi, SIGMOD 2008] • More sophisticated attacks can be launched given additional knowledge about the network structure, e.g., a subgraph of the network.[Zhou and Pei, ICDE 2008, Hay et al., VLDB 2008, ] • Protecting privacy becomes even more challenging when the nodes in the anonymized network are labeled.[Pang et al., SIGCOMM CCR 2006] Lecture 2 : 590.03 Fall 12

  48. Inferring Sensitive Values on a Network • Each individual has a single sensitive attribute. • Some individuals share the sensitive attribute, while others keep it private • GOAL: Infer the private sensitive attributes using • Links in the social network • Groups that the individuals belong to • Approach: Learn a predictive model (think classifier) using public profiles as training data. [Zheleva and Getoor, WWW 2009] Lecture 2 : 590.03 Fall 12

  49. Inferring Sensitive Values on a Network • Baseline: Most commonly appearing sensitive value amongst all public profiles. Lecture 2 : 590.03 Fall 12

  50. Inferring Sensitive Values on a Network • LINK: Each node x has a list of binary features Lx, one for every node in the social network. • Feature value Lx[y] = 1 if and only if (x,y) is an edge. • Train a model on all pairs (Lx, sensitive value(x)), for x’s with public sensitive values. • Use learnt model to predict private sensitive values Lecture 2 : 590.03 Fall 12

More Related