170 likes | 173 Views
Beyond k -Anonimity: A Decision Theoretic Framework for Assessing Privacy Risk. M.Scannapieco, G.Lebanon, M.R.Fouad and E.Bertino. Introduction. Release of data Private organizations can benefit from sharing data with others Public organizations see data as a value for the society
E N D
Beyond k-Anonimity:A Decision Theoretic Frameworkfor Assessing Privacy Risk M.Scannapieco, G.Lebanon, M.R.Fouad and E.Bertino
Introduction • Release of data • Private organizations can benefit from sharing data with others • Public organizations see data as a value for the society • Privacy preservation • Data disclosure can lead to economic damages, threats to national security, etc. • Regulated by law in both private and public sectors
Two Facets of Data Privacy • Identity disclosure • Uncontrolled data release: even presence of identifiers • Anonymous data release: identifiers suppressed, but no control on possible linking with other sources
Linkage of Anonymous Data T1 QUASI-IDENTIFIER T2
Two Facets of Data Privacy (cont.) • Sensitive information disclosure • Once identity disclosure occurs, the loss due to such disclosure depends on how much sensitive are the related data • Data sensitivity is subjective • E.g.: for women the age is in general more sensitive than for men
Our proposal • A framework for assessing privacy risk that takes into accounts both facets of privacy • based on statistical decision theory • Definition and analysis of: disclosure policies modelled by disclosure rules and several privacy risk functions • Estimated risk as an upper-bound of true risk and realted complexity analysis • Algorithm for finding the disclosure rule minimizing the privacy risk
Disclosure rules • A disclosure rule is a function that maps a record to a new record in which some attributes may have been suppressed Zj= The j-th attribute is suppressed otherwise
Loss function • Let be the side information used by the attacker in the identification attempt • The loss function Measures the loss incurred by disclosing the data (z) due to possible identification based on • Empirical distribution p associated with records x1…xn
Risk Definition • The risk of the disclosure rule in the presence of the side information is the average loss of disclosing x1…xn :
Putting the pieces together so far… • An hypothetical attacker performs an indentification attempt on a disclosed record y=(x) on the basis of a side information , that can be a dictionary • The dictionary is used to link y with some entry present in the dictionary • Example: • y has the form (name, surname,phone#), is a phone book • if all attributes revealed, it is likely y linked with one entry • If phone# suppressed (or missing) y may or may not be linked to a single entity, depending on the popularity of (name, surname)
Risk formulation • Let’s decompose the loss function into an identification part and into a sensitivity part • Identification part: formalized by the random variable Z otherwise
Risk formulation (cont.) • Sensitivity part: • where higher value indicate higher sensitivity • Therefore the loss is:
Risk formulation (cont.) • Risk:
Disclosure Rule vs. Privacy Risk • Suppose that true is the true attacker’s dictionary which is publicly available and that * is the actual database starting from which data will be published • Under the following assumptions: • true contains more records than* (* <= true ) • The non- in true will be more limited than the non- in * Theorem: If θ* contains records that correspond to x1, . . . ,xn and θ*<=θtrue, then: R(, θtrue)<= R(, θ*)
Disclosure Rule vs. Privacy Risk (cont.) • The theorem proves that the true risk is bounded by R(, θ*) • Under the hypothesis that the distribution underlying factorizes into a product form Theorem: The rule that minimizes the risk *=arg min R(, θ) can be found in O(nNm) computation
K-anonimity • K anonimity is SIMPLY a special case of our framework in whcih: • θtrue=T • is a costant • is underspecified • Our framework underlies some questionable hypotheses of k-anonimity!!!
Conclusions • New framework for privacy risk taking into account sensitivity • Risk estimation as an upperbound for the true privacy risk • Efficient algorithm for risk computation • K-anonimity generalization