Ιδιωτικότητα σε Βάσεις Δεδομένων

Ιδιωτικότητα σε Βάσεις Δεδομένων Οκτώβρης 2011

Roadmap • Motivation • Core ideas • Extensions

Reasons for privacy preserving data publishing • Vast amount of data collected nowadays • Estimated user data per day • 8-10 GB public content • ~ 4 TB private content (emails, SMSs, content annotations, social networks…)

Reasons for privacy preserving data publishing • Organizations (hospitals, ministries, internet providers, …) publicly release data concerning individual records (internet searches, medical records, …) • The laws oblige these agencies to protect the individuals’ privacy

Reasons for privacy preserving data publishing • So, data are stripped of the attributes that can reveal the individuals’ identities • Unfortunately, this is not enough…

Sweeney’s breach of governor’s medical record • “ … In Massachusetts, the Group Insurance Commission (GIC) is responsible for purchasing health insurance for state employees. • For twenty dollars I purchased the voter registration list for Cambridge Massachusetts and received the information on two diskettes. • The rightmost circle in Figure 1 shows that these data included the name, address, ZIP code, birth date, and gender of each voter. • This information can be linked using ZIP code, birth date and gender to the medical information, thereby linking diagnosis, procedures, and medications to particularly named individuals. …”

Sweeney’s breach of governor’s medical record • “ … For example, William Weld was governor of Massachusetts at that time and his medical records were in the GIC data. Governor Weld lived in Cambridge Massachusetts. According to the Cambridge Voter list, six people had his particular birth date; only three of them were men; and, he was the only one in his 5-digit ZIP code.…”

AOL’s exposure of user 4417749 • August 2006: AOL publicizes anonymized data for 21M user queries • User’s 4417749 had a strong essence of geo and thematic locality • Researchers would focus more and more the search, based on these queries • Ms Arnold, 62, would prove to be the user search for medication, resorts, dogs and family members, …

The context of privacy-preserving data publishing Deborah, a star DBA & a TRUSTED data publisher Ben, the benevolent, data miner Detailed microdata T Anonymized public data T* Alice, the external attacker Bob (the victim) to be hidden

Anonymization • To retain privacy one must: • Remove the attributes that directly identify individuals (name, SSN, …) • Organize the tuples and the cell values of the data set in such a way that: • The statistical properties of the data set are retained • The attacker cannot guess to which individual a tuple correspondswith statistical meaningful guarantee

Fundamentals • Identifier(s): attribute(s) that explicitly reveal the identity of a person (name, SSN, …). These attributes are removed from the public data set • Quasi identifier: attribute(s) that if joined with external data can reveal sensitive information (zip code, birth date, sex,…) • Typically accompanied by “generalization hierarchies” • Sensitive attribute: containing the values that should be kept private (disease, salary,…)

Generalization hierarchies

General methods for Anonymization • “Hide tuples in the crowd” • Generalization • Anatomization • “Lies to the attacker, truth to the statistician” • Noise injection • Value perturbation

k-anonymity (TKDE 01, IJUFKS 02) • A relationΤ isk-anonymous when every tuple of the relation is identical to k-1 other tuples with respect to their Quasi-Identifierset of attributes.

Naïvel-diversity A relation Tsatisfies the naïve l-diversity property whenever every group of the relation contains at least l different values in its sensitive attributes.

Information utility • Must prevent the attackers, by satisfying the privacy criterion (k for k-anonymity, l for l-diversity) • Fundamental anonymization technique: hide individual in groups of identical QI values!! • Must serve the well-meaning users, by maximizing information utility i.e., by minimizing • The tuples we remove (see next) • theamount of generalization that we apply to the QI attributes.

Generalization vs suppression • This anonymization suppressed no tuples, and guarantees 3-anonymity. • What if we want 4-anonymity?

Generalization vs suppression Low height, 6 tuples suppressed Higher height, no tuples suppressed //the difference is in the work_class field

Incognito (SIGMOD 2005) • Two fundamental ideas can be exploited with hierarchies: • If a data set generalized at a certain level (e.g., 1345*) is k-anonymous, then it is also k-anonymous if it is even more generalized (e.g., 134**) • If a data set of N attributes is not k-anonymous if • n attributes are not fully anonymized (age) and N-n are fully anonymized (sex, zip) • then, the same data set is still not k- anonymous with • n+1 attributes are not fully anonymized (age,sex) and N-n-1 are fully anonymized (zip)

Incognito Birth date, zip code, sex Combinations of 2 attributes

Incognito Birth date, zip code, sex Combinations of 3 attributes, after non-anonymous gener. are pruned

What disease Bob is suffering from? Since Alice is Bob’s neighbor, she knows that Bob is a 31-year-old American male who lives in the zip code 13053. Therefore, Alice knows that Bob’s record number is 9,10,11, or 12. Now, all of those patients have the same medical condition (cancer), and so Alice concludes that Bob has cancer. Umeko is a 21 year old Japanese female who currently lives in zip code 13068. Therefore, Umeko’s information is contained in record number 1,2,3, or 4. +BCGR Knowledge: it is well-known that Japanese have an extremely low incidence of heart disease. Therefore, Alice concludes with near certainty that Umeko has a viral infection.

L-diversity (ICDE 2006) • Everyq-block group, has • At least k tuples • At least l well-represented values • Well-represented? • Ούτε όλες οι τιμές σε ένα group είναι ίδιες (έχω τουλάχιστον l, l>=2) • Ούτε κάποια τιμή είναι απίθανο να υπάρχει => μπορώ να συνάγω ότι ισχύει κάποια άλλη αν το l είναι σχετικά μικρό

Well-represented • Distinct l-diversity: simply l different values • Entropy l-diversity: for each pair (public value q*, sensitive value s) measure the value p(q*,s)logp(q*,s) • Entropy of a q-block with value q* is -Σsp()logp() over all sensitive values s • You need to have E > log(p) for all groups (and this can be guaranteed if it holds for the whole table, too) • Recursive l-diversity: the most frequent values do not appear too frequently and the less frequent do not appear too rarely

Mondrian (ICDE 2006) zip age Why must we generalize fully every attributes? Some records are in regions with many records and anonymity is easily preserved even by giving out more information. Some others are in sparse areas and need to be generalized more…

Mondrian (ICDE 2006) Global recoding Original data local recoding

M-invariance (SIGMOD ‘07) If I know that Bob is in group 1 + he has been taken to the hospital twice, I can deduce bronc. from: {dysp., bronch.}  {dysp.,gastr.}

M-invariance

Many other extensions • Concerning multi-relational privacy • Data perturbations • More sophisticated “local recoding” a-la Mondrian • Trajectory, set-valued, OLAP, … data • …

Ιδιωτικότητα σε Βάσεις Δεδομένων

Ιδιωτικότητα σε Βάσεις Δεδομένων

Presentation Transcript