To Do or Not To Do:
This presentation is the property of its rightful owner.
Sponsored Links
1 / 41

Oren Fine Nov. 2008 CS Seminar in Databases (236826) PowerPoint PPT Presentation

  • Uploaded on
  • Presentation posted in: General

To Do or Not To Do: The Dilemma of Disclosing Anonymized Data Lakshmanan L, Ng R, Ramesh G Univ. of British Columbia. Oren Fine Nov. 2008 CS Seminar in Databases (236826). Once Upon a Time…. The police is after Edgar, a drug lord suspect.

Download Presentation

Oren Fine Nov. 2008 CS Seminar in Databases (236826)

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript

To Do or Not To Do: The Dilemma of Disclosing Anonymized DataLakshmanan L, Ng R, Ramesh GUniv. of British Columbia

Oren Fine

Nov. 2008

CS Seminar in Databases (236826)

Once Upon a Time…

  • The police is after Edgar, a drug lord suspect.

    • Intel. has gathered calls & meetings data records as a transactional database

    • In order to positively frame Edgar, the police must find hard evidence, and wishes to outsource data mining tasks to “We Mind your Data Ltd.”

    • But, the police is subject to the law, and is obligated to keep the privacy of the people in the database – including Edgar, which is innocent until proven otherwise.

    • Furthermore, Edgar is seeking for the smallest hint to disappear…

I have the pleasure to introduce Edgar vs. The Police



  • The Classic Dilemma:

    • Keep your data close to your chest and never risk privacy or confidentiality or…

    • Disclose the data and gain potential valuable knowledge and benefits

  • In order to decide, we need to answer a major question

    • “Just how safe is the anonymized data?”

    • Safe = protecting the identities of the of the objects.


  • Anonymization

  • Model the Attacker’s Knowledge

  • Determine the risk to our data

Anonymization or De-Identification

  • Transform sensitive data into generated unique content (strings, numbers)

  • Example

Anonymization or De-Identification

  • Advantages

    • Very simple

    • Does not affect final outcome or perturb data characteristics

  • We do not suggest that anonymization is the “right” way, but it is probably the most common

Frequent Set Mining Crash Course

  • Transactional database

  • Each transaction has TID and a set of items

  • An association rule of the form XY has

    • Support s if s% of the transactions include (X,Y)

    • Confidence c if c% of the transactions that include X also include Y

  • Support = frequent sets

  • Confidence = association rules

  • A k-itemset is a set of k items


Example (Cont.)

  • First, we look for frequent sets, according to a support threshold

  • 2-itemsets: {Angela, Edgar}, {Edgar, Steve} have 50% support (4 out of 8 transactions).

  • 3-itemsets: {Angela, Edgar, Steve}, {Benny, Edgar, Steve} and {Tommy, Edgar, Steve} have only 25% support (2 out of 8 transactions)

  • The rule {Edgar, Steve} {Angela} has 50% confidence (2 out 4 transactions) and the rule {Tommy} {Edgar, Steve} has 66.6% confidence.

Frequent Set Mining Crash Course (You’re Qualified!)

  • Widely used in market basket analysis, intrusion detection, Web usage mining and bioinformatics

  • Aimed at discovering non trivial or not necessarily intuitive relation between items/variables of large databases“Extracting wisdom out of data”

  • Who knows what is the most famous frequent set?

Big Mart’s Database

Modeling the Attacker’s Knowledge

  • We believe that the attacker has prior knowledge about the items in the original domain

  • The prior information regards the frequencies of items in the original domain

  • We capture the attacker’s knowledge with “Belief Functions”

Examples of Belief Functions

Consistent Mapping

  • Mapping anonymized entities to original entities only according tothe belief function

Ignorant Belief Function (Q)

  • How does the graph look like?

  • What is the expected number of cracks?

  • Suppose n items. Further suppose that we are only interested in a partial group, of size n1

  • What is the expected number of cracks now?

  • Don’t you underestimate Edgar…

Ignorant Belief Function (A)

Compliant Point-Valued Belief Function (Q)

  • How does the graph look like?

  • What is the expected number of cracks?

  • Suppose n items. Further suppose that we are only interested in a partial group, of size n1

  • What is the expected number of cracks now?

  • Unless he has inner source, we shouldn’t overestimate Edgar either…

Compliant Point-Valued Belief Function (A)

Compliant IntervalBelief Functions

  • Direct Computation Method

    • Build a graph G and adjacency matrix AG

    • The probability of cracking k out of n items:

  • Computing the permanent is know to be #P-complete problem, state of the art approximation running time O(n22) !!

  • What the !#$!% is a permanent or #P-complete?


  • A permanent of an n*n matrix is

  • The sum is over all permutations of 1,2,…

  • Calculating the permanent is #P-complete

  • Which brings us to…


  • Unlike well known complexity classes which are of decision problems, this is a class of function problems

  • "compute f(x)," where f is the number of accepting paths of an NP machine

  • Example

    • NP: Are there any subsets of a list of integers that add up to zero?

    • #P: How many subsets of a list of integers add up to zero?

Chain Belief Functions

Chain Belief Functions


  • General Belief Function does not always produce a chain…

  • We seek for way to estimate the number of cracks.

The O-estimate Heuristic

  • Suppose Graph G, interval belief function β.

  • For each x, let Ox denote the outdegree of x in G.

  • The probability of cracking x is simply

  • The expected number of cracks is

Properties of O-estimate

  • Inexact (hence “estimate”)

  • Monotonic

-Compliant Belief Function

  • Suppose we “somehow” know which items are guessed wrong

  • We sum the O-estimates only over the compliant frequency groups

Risk Assessment

  • Worst case \ Best case – unrealistic

  • Determine the intervals width

    • Twice the median gap of all successive frequency groups

    • Why?

  • Determine the degree of compliancy

    • Perform binary search on , subject to a “degree of tolerance” – .

End to End Example

  • These Intel. Calls & Meeting DR are classified “Top Secret”

We Anonymize the Database

Frequency Groups

  • The gaps between the frequency groups:1/8, 1/8, 1/8, 1/8, 2/8

  • The median gap = 1/8

The Attacker’s Prior Knowledge













The Graph, By the Way…













Calculating the Risk

  • Oest=1/4+1/7+1/3+1/4+1/7+1/9+1/7+ 1/9+1/9+1/7+1/7+1/7 = 2.023

  • Now, it’s a question of how much would you tolerate...

  • Note, that this is the expected number of cracks. However, if we are interested in Edgar, as we’ve seen in previous lemmas, the probability of crack – 1/3.


Open Problems

  • The attacker’s prior knowledge remains a largely unsolved issue

  • This article does not really deal with frequent sets but rather frequent items

    • Frequent sets can add more information and differentiate objects from one frequency group

Modeling the Attacker’s Knowledge in the Real World

  • In a report for the Canadian Privacy Commissioner appears a broad mapping of adversary knowledge

    • Mapping phone directories

    • CV’s

    • Inferring gender, year of birth and postal code from different details

    • Data remnants on 2nd hand hard disks

    • Etc.

סוף טוב, הכל טוב


  • Lakshmanan L., Ng R., Ramesh G. To Do or Not To Do: The Dilemma of Disclosing Anonymized Data. ACM SIGMOD Conference, 2005.

  • Agrawal, R. and Srikant, R. 1994. Fast algorithms for mining association rules. In Proc. 1994 Int. Conf. Very Large Data Bases (VLDB’94), Santiago, Chile, pp. 487–499.

  • Pan-Canadian De-Identification Guidelines for Personal Health Information, Khaled El-Emam et al., April 2007.

  • Wikipedia

    • Association rule

    • #P

    • Permanent

Questions ?

  • Login