Loading in 5 sec....

Oren Fine Nov. 2008 CS Seminar in Databases (236826)PowerPoint Presentation

Oren Fine Nov. 2008 CS Seminar in Databases (236826)

- 37 Views
- Uploaded on

Download Presentation
## PowerPoint Slideshow about ' Oren Fine Nov. 2008 CS Seminar in Databases (236826)' - susannah-short

**An Image/Link below is provided (as is) to download presentation**

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript

### To Do or Not To Do: The Dilemma of Disclosing Anonymized DataLakshmanan L, Ng R, Ramesh GUniv. of British Columbia

Oren Fine

Nov. 2008

CS Seminar in Databases (236826)

Once Upon a Time…

- The police is after Edgar, a drug lord suspect.
- Intel. has gathered calls & meetings data records as a transactional database
- In order to positively frame Edgar, the police must find hard evidence, and wishes to outsource data mining tasks to “We Mind your Data Ltd.”
- But, the police is subject to the law, and is obligated to keep the privacy of the people in the database – including Edgar, which is innocent until proven otherwise.
- Furthermore, Edgar is seeking for the smallest hint to disappear…

Motivation

- The Classic Dilemma:
- Keep your data close to your chest and never risk privacy or confidentiality or…
- Disclose the data and gain potential valuable knowledge and benefits

- In order to decide, we need to answer a major question
- “Just how safe is the anonymized data?”
- Safe = protecting the identities of the of the objects.

Agenda

- Anonymization
- Model the Attacker’s Knowledge
- Determine the risk to our data

Anonymization or De-Identification

- Transform sensitive data into generated unique content (strings, numbers)
- Example

Anonymization or De-Identification

- Advantages
- Very simple
- Does not affect final outcome or perturb data characteristics

- We do not suggest that anonymization is the “right” way, but it is probably the most common

Frequent Set Mining Crash Course

- Transactional database
- Each transaction has TID and a set of items
- An association rule of the form XY has
- Support s if s% of the transactions include (X,Y)
- Confidence c if c% of the transactions that include X also include Y

- Support = frequent sets
- Confidence = association rules
- A k-itemset is a set of k items

Example (Cont.)

- First, we look for frequent sets, according to a support threshold
- 2-itemsets: {Angela, Edgar}, {Edgar, Steve} have 50% support (4 out of 8 transactions).
- 3-itemsets: {Angela, Edgar, Steve}, {Benny, Edgar, Steve} and {Tommy, Edgar, Steve} have only 25% support (2 out of 8 transactions)
- The rule {Edgar, Steve} {Angela} has 50% confidence (2 out 4 transactions) and the rule {Tommy} {Edgar, Steve} has 66.6% confidence.

Frequent Set Mining Crash Course (You’re Qualified!)

- Widely used in market basket analysis, intrusion detection, Web usage mining and bioinformatics
- Aimed at discovering non trivial or not necessarily intuitive relation between items/variables of large databases“Extracting wisdom out of data”
- Who knows what is the most famous frequent set?

Modeling the Attacker’s Knowledge

- We believe that the attacker has prior knowledge about the items in the original domain
- The prior information regards the frequencies of items in the original domain
- We capture the attacker’s knowledge with “Belief Functions”

Consistent Mapping

- Mapping anonymized entities to original entities only according tothe belief function

Ignorant Belief Function (Q)

- How does the graph look like?
- What is the expected number of cracks?
- Suppose n items. Further suppose that we are only interested in a partial group, of size n1
- What is the expected number of cracks now?
- Don’t you underestimate Edgar…

Compliant Point-Valued Belief Function (Q)

- How does the graph look like?
- What is the expected number of cracks?
- Suppose n items. Further suppose that we are only interested in a partial group, of size n1
- What is the expected number of cracks now?
- Unless he has inner source, we shouldn’t overestimate Edgar either…

Compliant Point-Valued Belief Function (A)

Compliant IntervalBelief Functions

- Direct Computation Method
- Build a graph G and adjacency matrix AG
- The probability of cracking k out of n items:

- Computing the permanent is know to be #P-complete problem, state of the art approximation running time O(n22) !!
- What the !#$!% is a permanent or #P-complete?

Permanent

- A permanent of an n*n matrix is
- The sum is over all permutations of 1,2,…
- Calculating the permanent is #P-complete
- Which brings us to…

#P-Complete

- Unlike well known complexity classes which are of decision problems, this is a class of function problems
- "compute f(x)," where f is the number of accepting paths of an NP machine
- Example
- NP: Are there any subsets of a list of integers that add up to zero?
- #P: How many subsets of a list of integers add up to zero?

Unfortunately…

- General Belief Function does not always produce a chain…
- We seek for way to estimate the number of cracks.

The O-estimate Heuristic

- Suppose Graph G, interval belief function β.
- For each x, let Ox denote the outdegree of x in G.
- The probability of cracking x is simply
- The expected number of cracks is

Properties of O-estimate

- Inexact (hence “estimate”)
- Monotonic

-Compliant Belief Function

- Suppose we “somehow” know which items are guessed wrong
- We sum the O-estimates only over the compliant frequency groups

Risk Assessment

- Worst case \ Best case – unrealistic
- Determine the intervals width
- Twice the median gap of all successive frequency groups
- Why?

- Determine the degree of compliancy
- Perform binary search on , subject to a “degree of tolerance” – .

End to End Example

- These Intel. Calls & Meeting DR are classified “Top Secret”

Frequency Groups

- The gaps between the frequency groups: 1/8, 1/8, 1/8, 1/8, 2/8
- The median gap = 1/8

2

3

4

5

6

7

8

9

10

11

12

The Graph, By the Way…Angela

Ariel

Edgar

Steve

Benny

Hassan

Tommy

Joe

Sara

Israel

Noa

Mahhmud

Calculating the Risk

- Oest=1/4+1/7+1/3+1/4+1/7+1/9+1/7+ 1/9+1/9+1/7+1/7+1/7 = 2.023
- Now, it’s a question of how much would you tolerate...
- Note, that this is the expected number of cracks. However, if we are interested in Edgar, as we’ve seen in previous lemmas, the probability of crack – 1/3.

Open Problems

- The attacker’s prior knowledge remains a largely unsolved issue
- This article does not really deal with frequent sets but rather frequent items
- Frequent sets can add more information and differentiate objects from one frequency group

Modeling the Attacker’s Knowledge in the Real World

- In a report for the Canadian Privacy Commissioner appears a broad mapping of adversary knowledge
- Mapping phone directories
- CV’s
- Inferring gender, year of birth and postal code from different details
- Data remnants on 2nd hand hard disks
- Etc.

Bibliography

- Lakshmanan L., Ng R., Ramesh G. To Do or Not To Do: The Dilemma of Disclosing Anonymized Data. ACM SIGMOD Conference, 2005.
- Agrawal, R. and Srikant, R. 1994. Fast algorithms for mining association rules. In Proc. 1994 Int. Conf. Very Large Data Bases (VLDB’94), Santiago, Chile, pp. 487–499.
- Pan-Canadian De-Identification Guidelines for Personal Health Information, Khaled El-Emam et al., April 2007.
- Wikipedia
- Association rule
- #P
- Permanent

Download Presentation

Connecting to Server..