Privacy preserving data mining
Download
1 / 22

Privacy-Preserving Data Mining - PowerPoint PPT Presentation


  • 126 Views
  • Updated On :

Privacy-Preserving Data Mining. Rakesh Agrawal Ramakrishnan Srikant IBM Almaden Research Center 650 Harry Road, San Jose, CA 95120 Published in: ACM SIGMOD International Conference on Management of Data, 2000. Slides by: Adam Kaplan (for CS259, Fall’03). What is Data Mining?.

Related searches for Privacy-Preserving Data Mining

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Privacy-Preserving Data Mining' - cheng


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
Privacy preserving data mining l.jpg

Privacy-Preserving Data Mining

Rakesh Agrawal Ramakrishnan Srikant

IBM Almaden Research Center

650 Harry Road, San Jose, CA 95120

Published in: ACM SIGMOD International Conference on Management of Data, 2000.

Slides by: Adam Kaplan (for CS259, Fall’03)


What is data mining l.jpg
What is Data Mining?

  • Information Extraction

    • Performed on databases.

  • Combines theory from

    • machine learning

    • statistical analysis

    • database technology

  • Finds patterns/relationships in data

  • Trains the system to predict future results

  • Typical applications:

    • customer profiling

    • fraud detection

    • credit risk analysis.

Definition borrowed from the Two Crows corporate website: http://www.twocrows.com


A simple example of data mining l.jpg

CREDIT RISK

Age < 25

Salary < 50k

High

High

Low

A Simple Example of Data Mining

TRAINING DATA

Data Mining

  • Recursively partition training data into decision tree classifier

    • Non-leaf nodes = split points; test a specific data attribute

    • Leaf nodes = entirely or “mostly” represent data from the same class

  • Previous well-known methods to automate classification

    • [Mehta, Agrawal, Rissanen EDBT’96] – SLIQ paper

    • [Shafer, Agrawal, Mehta VLDB’96] – SPRINT paper


Where does privacy fit in l.jpg
Where does privacy fit in?

  • Data mining performed on databases

    • Many of them contain sensitive/personal data

      • e.g. Salary, Age, Address, Credit History

    • Much of data mining concerned with aggregates

      • Statistical models of many records

      • May not require precise information from each record

  • Is it possible to…

    • Build a decision-tree classifier accurately

      • without accessing actual fields from user records?

        • (…thus protecting privacy of the user)


Preserving data privacy 1 l.jpg
Preserving Data Privacy (1)

  • Value-Class Membership

    • Discretization:values for an attribute are discretized into intervals

      • Intervals need not be of equal width.

      • Use the interval covering the data in computation, rather than the data itself.

    • Example:

      • Perhaps Adam doesn’t want people to know he makes $4000/year.

        • Maybe he’s more comfortable saying he makes between $0 - $20,000 per year.

    • The most often used method for hiding individual values.


Preserving data privacy 2 l.jpg
Preserving Data Privacy (2)

  • Value Distortion

    • Instead of using the actual data xi

    • Use xi + r, where r is a random value from a distribution.

      • Uniform Distribution

        • r is uniformly distributed between [-α, +α]

        • Average r is 0.

      • Gaussian Distribution

        • r has a normal distribution

        • Mean μ(r) is 0.

        • Standard_deviation(r) is σ


What do we mean by private l.jpg
What do we mean by “private?”

W = width of intervals in discretization

  • If we can estimate with c% confidence

    • The value x lies within the interval [x1, x2]

    • Privacy = (x2 - x1), the size of the range.

  • If we want very high privacy

    • 2α> W

    • Value distortion methods (Uniform, Gaussian) provide more privacy than discretization at higher confidence levels.


Reconstructing original distribution from distorted values 1 l.jpg
Reconstructing Original Distribution From Distorted Values (1)

  • Define:

    • Original data values: x1, x2, …, xn

    • Random variable distortion: y1, y2, …, yn

    • Distorted samples: x1+y1, x2+y2, …, xn+yn

    • FY : The Cumulative Distribution Function (CDF) of random distortion variables yi

    • FX : The CDF of original data values xi


Reconstructing original distribution from distorted values 2 l.jpg
Reconstructing Original Distribution From Distorted Values (2)

  • The Reconstruction Problem

    • Given

      • FY

      • distorted samples (x1+y1,…, xn+yn)

    • Estimate FX


Reconstruction algorithm 1 l.jpg
Reconstruction Algorithm (1) (2)

  • How it works (incremental refinement of FX ) :

  • The f(x, 0) initialized to uniform distribution

  • For j=0 until stopping, do

  • Find f(x, j+1) as a function of f(x, j) and FY

  • When loop stops, f(x) estimates FX


Reconstruction algorithm 2 l.jpg
Reconstruction Algorithm (2) (2)

  • Stopping Criterion

  • Compare successive estimates f(x, j).

  • Stop when difference between successive estimates very small.


Distribution reconstruction results 1 l.jpg
Distribution Reconstruction Results (2)(1)

Original = original distribution

Randomized = effect of randomization on original dist.

Reconstructed = reconstructed distribution


Distribution reconstruction results 2 l.jpg
Distribution Reconstruction Results (2)(2)

Original = original distribution

Randomized = effect of randomization on original dist.

Reconstructed = reconstructed distribution


Summary of reconstruction experiments l.jpg
Summary of Reconstruction Experiments (2)

  • Authors are able to reconstruct

    • Original shape of data

    • Almost same aggregate distribution

  • This can be done even when randomized data distribution looks nothing like the original.


Decision tree classifiers w perturbed data l.jpg

CREDIT RISK (2)

Age < 25

Salary < 50k

High

High

Low

Decision-Tree Classifiers w/ Perturbed Data

  • When/how to recover original distributions in order to build tree?

  • Global - for each attribute, reconstruct original distribution before building tree

  • ByClass – for each attribute, split the training data into classes, and reconstruct distributions separately for each class; then build tree

  • Local – like ByClass, reconstruct distribution separately for each class, but do this reconstruction while building decision tree


Experimental results classification w perturbed data l.jpg
Experimental Results – Classification w/ Perturbed Data (2)

  • Compare Global, ByClass, Local algorithms against control series:

    • Original – result of classification of unperturbed training data

    • Randomized– result of classification on perturbed data with no correction

  • Run on five classification functions Fn1 through Fn5. (classify data into groups based on attributes)




Experimental results varying privacy l.jpg
Experimental Results – Varying Privacy (2)

  • Using ByClass algorithm on each classification function (except Fn4)

    • Vary privacy level from 10% - 200%

    • Show

      • Original – unperturbed data

      • ByClass(G) – ByClass with Gaussian perturbation

      • ByClass(U) – ByClass with Uniform perturbation

      • Random(G) – uncorrected data with Gaussian perturbation

      • Random(U) – uncorrected data with Uniform perturbation



Results accuracy vs privacy 2 l.jpg
Results – Accuracy vs. Privacy (2)(2)

Note: Function 4 skipped because almost same results as Function 5.


Conclusion l.jpg
Conclusion (2)

  • Perturb sensitive values in a user record by adding random noise.

  • Ensures privacy because original values are unknown.

  • Aggregate functions/classifiers can be accurately performed on perturbed values.

  • Gaussian distribution of noise provides more privacy than Uniform distribution at higher confidence levels.


ad