architectures and algorithms for data privacy n.
Download
Skip this Video
Loading SlideShow in 5 Seconds..
Architectures and Algorithms for Data Privacy PowerPoint Presentation
Download Presentation
Architectures and Algorithms for Data Privacy

Loading in 2 Seconds...

play fullscreen
1 / 86

Architectures and Algorithms for Data Privacy - PowerPoint PPT Presentation


  • 173 Views
  • Uploaded on

Architectures and Algorithms for Data Privacy . Dilys Thomas Stanford University, April 30th, 2007 Advisor: Rajeev Motwani. TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: A A. RoadMap. Motivation for Data Privacy Research Sanitizing Data for Privacy

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Architectures and Algorithms for Data Privacy' - danno


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
architectures and algorithms for data privacy

Architectures and Algorithms for Data Privacy

Dilys Thomas

Stanford University, April 30th, 2007

Advisor: Rajeev Motwani

TexPoint fonts used in EMF.

Read the TexPoint manual before you delete this box.: AA

roadmap
RoadMap
  • Motivation for Data Privacy Research
  • Sanitizing Data for Privacy
    • Privacy Preserving OLAP
    • K-Anonymity/ Clustering for Anonymity
    • Probabilistic Anonymity
    • Masketeer
  • Auditing for Privacy
  • Distributed Architectures for Privacy
motivation 1 data privacy in enterprises

Health

Personal medical details

Disease history

Clinical research data

Govt. Agencies

Census records

Economic surveys

Hospital Records

Banking

Bank statement

Loan Details

Transaction history

Manufacturing

Process details

Blueprints

Production data

Finance

Portfolio information

Credit history

Transaction records

Investment details

Outsourcing

Customer data for testing

Remote DB Administration

BPO & KPO

Insurance

Claims records

Accident history

Policy details

Retail Business

Inventory records

Individual credit card details

Audits

Motivation 1: Data Privacy in Enterprises

Privacy

motivation 3 personal information
Motivation 3: Personal Information
  • Emails
  • Searches on Google/Yahoo
  • Profiles on Social Networking sites
  • Passwords / Credit Card / Personal information at multiple E-commerce sites / Organizations
  • Documents on the Computer / Network
losses due to lack of privacy id theft
Losses due to Lack of Privacy: ID-Theft
  • 3% of households in the US affected by ID-Theft
  • US $5-50B losses/year
  • UK £1.7B losses/year
  • AUS $1-4B losses/year
roadmap1
RoadMap
  • Motivation for Data Privacy Research
  • Sanitizing Data for Privacy
    • Privacy Preserving OLAP
    • K-Anonymity/ Clustering for Anonymity
    • Probabilistic Anonymity
    • Masketeer
  • Auditing for Privacy
  • Distributed Architectures for Privacy
privacy preserving data analysis i e online analytical processing olap

Privacy Preserving Data Analysis i.e. Online Analytical Processing OLAP

Computing statistics of data collected from multiple data sources while maintaining the privacy of each individual source

Agrawal, Srikant, Thomas

SIGMOD 2005

privacy preserving olap
Privacy Preserving OLAP
  • Motivation
  • Problem Definition
  • Query Reconstruction

Inversion method

Single attribute

Multiple attributes

Iterative method

  • Privacy Guarantees
  • Experiments
horizontally partitioned personal information
Horizontally Partitioned Personal Information

Client C1

Original Row r1

Perturbed p1

Table T for analysis

at server

Client C2

Original Row r2

Perturbed p2

EXAMPLE: What number of children in this county go to college?

Client Cn

Original Row rn

Perturbed pn

vertically partitioned enterprise information
Vertically Partitioned Enterprise Information

EXAMPLE: What fraction of United customers to New York fly Virgin Atlantic to travel to London?

Original Relation D1

Perturbed Relation D’1

Perturbed Joined Relation D’

Original Relation D2

Perturbed Relation D’2

privacy preserving olap problem definition
Privacy Preserving OLAP: Problem Definition

Compute

select count(*) from T

where P1and P2 and P3 and …. Pk

Eg Find # of people between age[30-50] and salary[80-150]

i.e. COUNTT( P1and P2 and P3 and …. Pk )

Goal:

provide error bounds to analyst.

provide privacy guarantees to data sources.

scale to larger # of attributes

perturbation example uniform retention replacement
Perturbation Example: Uniform Retention Replacement

Throw a biased coin

Heads: Retain

Tails: Replace with a random number from a predefined pdf

5

Tails

4

Tails

3

Heads

BIAS=0.2

1

Tails

Tails

3

HEADS: RETAIN

TAILS: REPLACE U.A.R. FROM [1-5]

retention replacement perturbation
Retention Replacement Perturbation
  • Done for each column
  • The replacing pdf need not be uniform
    • Best to use original pdf if available/ estimable
  • Different columns can have different biases

for retention

single attribute example
Single Attribute Example

What is the fraction of people in this building with age 30-50?

  • Assume age between 0-100
  • Whenever a person enters the building flips a coin of with heads probability p=0.2.
    • Heads -- report true age RETAIN
    • Tails -- random number uniform in 0-100 reported PERTURB
  • Totally 100 randomized numbers collected.
  • Of these 22 are 30-50.
  • How many among the original are 30-50?
privacy preserving olap1
Privacy Preserving OLAP
  • Motivation
  • Problem Definition
  • Query Reconstruction

Inversion method

Single attribute

Multiple attributes

Iterative method

  • Privacy Guarantees
  • Experiments
analysis
Analysis

20 Retained

80 Perturbed

Out of 100 :

80 perturbed (0.8 fraction), 20 retained (0.2 fraction)

analysis contd
Analysis Contd.

16

Perturbed, Age[30-50]

20

Retained

64

Perturbed, NOT Age[30-50]

20% of the 80 randomized rows, i.e. 16 of them

satisfy Age[30-50]. The remaining 64 don’t.

analysis contd1
Analysis Contd.

6

Retained, Age[30-50]

Since there were 22 randomized rows in [30-50].

22-16=6 of them come from the 20 retained rows.

16

Perturbed, Age[30-50]

14

Retained, NOT Age[30-50]

64

Perturbed, NOT Age[30-50]

scaling up
Scaling up

30

?

Thus 30 people had age 30-50 in expectation.

formally select count from r where pred
Formally : Select count(*) from R where Pred

p = retention probability (0.2 in example)

1-p = probability that an element is replaced

by replacing p.d.f.

b = probability that an element from the

replacing p.d.f. satisfies predicate Pred

( in example)

a = 1-b

transition matrix
Transition matrix

=

i.e. Solve xA=y

A00 = probability that original element satisfies

: P and after perturbation satisfies : P

p = probability it was retained

(1-p)a = probability it was perturbed and satisfies : P

A00 = (1-p)a+p

multiple attributes
Multiple Attributes

For k attributes,

  • x, y are vectors of size 2k
  • x=y A-1

Where A=A1 ­A2­ .. ­ Ak [Tensor Product]

Ai is the transition matrix for column i

error bounds
Error Bounds
  • In our example, we want to say when estimated answer is 30, the actual answer lies in [28-32] with probability greater than 0.9
  • Given T !a T’ , with n rows f(T) is (n,e,d) reconstructible by g(T’) if |f(T) – g(T’)| < max (e, e f(T)) with probability greater than (1- d).

f(T) =2,  =0.1 in above example

theoretical basis and results
Theoretical Basis and Results

Theorem: Fraction, f, of rows in [low,high] in the original table estimated by matrix inversion on the table obtained after uniform perturbation is a (n, ,  ) estimator for f if n > 4 log(2/)(p )-2 , by Chernoff bounds

Theorem: Vector, x, obtained by matrix inversion is the MLE (maximum likelihood estimator), by using Lagrangian Multiplier method and showing that the Hessian is negative

iterative algorithm as00
Iterative Algorithm [AS00]

Initialize:

x0=y

Iterate:

xpT+1 = Sq=0t yq (apqxpT / (Sr=0t arq xrT))

[ By Application of Bayes Rule]

Stop Condition:

Two consecutive x iterates do not differ much

iterative algorithm
Iterative Algorithm

We had proved,

  • Theorem: Inversion Algorithm gives the MLE
  • Theorem [AA01]: The Iterative Algorithm gives the MLE with the additional constraint that

0 < xi , 8 0 < i < 2k-1

    • Models the fact the probabilities are non-negative
    • Results better as shown in experiments
privacy guarantees
Privacy Guarantees

Say initially know with probability < 0.3

that Alice’s age > 25

After seeing perturbed value can say that with probability > 0.95

Then we say there is a (0.3,0.95) privacy breach

More subtle differential privacy in the thesis

privacy preserving olap2
Privacy Preserving OLAP
  • Motivation
  • Problem Definition
  • Query Reconstruction
  • Privacy Guarantees
  • Experiments
experiments
Experiments
  • Real data: Census data from the UCI Machine Learning Repository having 32000 rows
  • Synthetic data: Generated multiple columns of Zipfian data, number of rows varied between 1000 and 1000000
  • Error metric: l1 norm of difference between x and y.
  • L1 norm between 2 probability distributions

Eg for 1-dim queries |x1 – y1| + | x0 – y0|

inversion vs iterative reconstruction
Inversion vs Iterative Reconstruction

2 attributes: Census Data

3 attributes: Census Data

Iterative algorithm (MLE on constrained space) outperforms Inversion (global MLE)

slide34

Error as a function of Number of Columns:

Iterative Algorithm: Zipf Data

The error in the iterative algorithm flattens

out as its maximum value is bounded by 2

error as a function of number of columns
Error as a function of Number of Columns

Census Data

Inversion Algorithm

Iterative Algorithm

Error increases exponentially with increase in number of columns

error as a function of number of rows
Error as a function of number of Rows

Error decreases as as number of rows, n increases

conclusion
Conclusion

Possible to run OLAP on data across multiple servers so that probabilistically approximate answers are obtained and data privacy is maintained

The techniques have been tested experimentally on real and synthetic data. More experiments in the paper.

Privacy Preserving OLAP is Practical

roadmap2
RoadMap
  • Motivation for Data Privacy Research
  • Sanitizing Data for Privacy
    • Privacy Preserving OLAP
    • K-Anonymity/ Clustering for Anonymity
    • Probabilistic Anonymity
    • Masketeer
  • Auditing for Privacy
  • Distributed Architectures for Privacy
slide39

Anonymizing Tables: ICDT05

Creating tables that do not identify individuals for research or out-sourced software development purposes

Aggarwal, Feder, Kenthapadi, Motwani, Panigrahy, Thomas, Zhu

probabilistic anonymity submitted

Achieving Anonymity via Clustering: PODS06

Probabilistic Anonymity: (submitted)

Aggarwal, Feder, Kenthapadi, Khuller, Panigrahy, Thomas, Zhu

Lodha, Thomas

data privacy
Data Privacy
  • Value disclosure: What is the value of attribute salary of person X
    • Perturbation
      • Privacy Preserving OLAP
  • Identity disclosure: Whether an individual is present in the database table
    • Randomization, K-Anonymity etc.
      • Data for Outsourcing / Research
quasi identifiers
Quasi-Identifiers

Uniquely

identify

you!

Quasi-identifiers:

approximate foreign keys

k anonymity model swe00
k-Anonymity Model [Swe00]
  • Modify some entries of quasi-identifiers
    • each modified row becomes identical to at least k-1 other rows with respect to quasi-identifiers
  • Individual records hidden in a crowd of size k
2 anonymity with clustering
2-Anonymity with Clustering

Cluster centers published

27=(25+27+29)/3

70=(50+60+100)/3

37=(35+39)/2

115=(110+120)/2

Clustering formulation: NP Hard

clustering metrics

10 points,radius 5

50 points, radius 15

20 points, radius10

Clustering Metrics
cellular clustering linear program
Cellular Clustering: Linear Program

Minimize c ( i xicdc + fc yc)

Sum of Cellular cost and facility cost

Subject to:

c xic¸ 1 Each Point belongs to a cluster

xic· yc Cluster must be opened for point to belong

0 · xic· 1 Points belong to clusters positively

0 · yc· 1 Clusters are opened positively

quasi identifier
Quasi-identifier

0.6Fraction uniquely identified by Fruit. Hence Fruit is 0.6 Quasi-identifier.

0.87 fraction of U.S. population uniquely identified by (DOB, Gender, Zipcode)

hence it is a 0.87 quasi-identifier

quasi identifier1
Quasi-Identifier
  • Find probability distribution over D distinct values that maximizes expected number of uniquely identified fraction of records.
  • D distinct values, n rows
  • If D <=n
    • D/en (skewed distribution)
  • Else
    • e-n/D (uniform distribution)
distinct values identifier
Distinct values- Identifier
  • DOB : 60*365=2*104
  • Gender: 2
  • Zipcode: 105
  • (DOB, Gender, Zipcode) has together 2*104*2*105=4*109
  • US population=3*108
  • Fraction of singletons=

e-3*10^8/4*10^9=0.92

distinct values and k anonymity
Distinct values and K-anonymity
  • Eg. Apply HIPAA to

(Age in Years, Zipcode, Gender,Doctor details)

  • Want k=20,000=2*104 anonymity with n=300 million=3*108 people.
  • The number of distinct values is D=n/k=1.5*104
  • D=Distinct values=

z(zipcode)*100(age in years)*2(gender)=200z

  • 1.5*104=200z, z=102 approximately.
  • Retain first two digits of zipcode (retain states)
experiments1
Experiments
  • Efficient Algorithms based on randomized algorithms to find quantiles in small space
    • 10 seconds to anonymize quarter million rows. Or approximately 3GB per hour on a machine running 2.66Ghz Processor, 504 MB RAM, Windows XP with Service Pack 2
      • order of magnitude better in running time for a quasi-identifier of size 10 than previous implementation
  • Optimal algorithms to anonymize the dataset.
  • Scalable
    • Almost independent of anonymity parameter k
    • linear in quasi-identifier size (previously exponential)
    • linear in dataset size
masketeer a tool for data privacy
Masketeer: A tool for data privacy

Das, Lodha, Patwardhan, Sundaram, Thomas.

roadmap3
RoadMap
  • Motivation for Data Privacy Research
  • Sanitizing Data for Privacy
    • Privacy Preserving OLAP
    • K-Anonymity/ Clustering for Anonymity
    • Probabilistic Anonymity
    • Masketeer
  • Auditing for Privacy
  • Distributed Architectures for Privacy
auditing batches of sql queries
Auditing Batches of SQL Queries

Motwani, Nabar, Thomas

PDM Workshop with ICDE 2007

Given a set of SQL queries that have been posed

over a database, determine whether some subset

of these queries have revealed private information

about an individual or a group of individuals

example
Example

SELECT zipcode

FROM Patients p

WHERE p.disease = ‘diabetes’

AUDIT zipcode

FROM Patients p

WHERE p.disease = ‘high blood pressure’

AUDIT disease

FROM Patients p

WHERE p.zipcode = 94305

Not Suspicious wrt this

Suspicious if someone in 94305 has diabetes

query suspicious wrt an audit expression
Query Suspicious wrt an Audit Expression
  • If all columns of audit expression are covered by the query
  • If the audit expression and the query have one tuple in common
sql batch auditing
SQL Batch Auditing

Query 1

Query 2

Query 3

Query 4

Audit expression

Audited tuple columns are

covered

semantically

syntactically

Query batch suspicious wrt audit expression iff queries together cover all audited columns of at least audited tuple

on table T

on some table T

syntactic and semantic auditing
Syntactic and Semantic Auditing
  • Checking for semantic suspiciousness has polynomial time algorithm
  • Checking for syntactic suspiciousness is NP complete
roadmap4
RoadMap
  • Motivation for Data Privacy Research
  • Sanitizing Data for Privacy
    • Privacy Preserving OLAP
    • K-Anonymity/ Clustering for Anonymity
    • Probabilistic Anonymity
    • Masketeer
  • Auditing for Privacy
  • Distributed Architectures for Privacy
two can keep a secret a distributed architecture for secure database services

Two Can Keep a Secret: A Distributed Architecture for Secure Database Services

  • How to distribute data across multiple sites for
  • redundancy and (2) privacy so that a single
  • site being compromised does not lead to data loss

Aggarwal, Bawa, Ganesan, Garcia-Molina, Kenthapadi, Motwani, Srivastava, Thomas, Xu

CIDR 2005

distributing data and partitioning and integrating queries for secure distributed databases

Distributing data and Partitioning and Integrating Queries for Secure Distributed Databases

Feder, Ganapathy, Garcia-Molina, Motwani, Thomas

Work in Progress

motivation
Motivation
  • Data outsourcing growing in popularity
    • Cheap, reliable data storage and management
      • 1TB $399  < $0.5 per GB
      • $5000 – Oracle 10g / SQL Server
      • $68k/year DBAdmin
  • Privacy concerns looming ever larger
    • High-profile thefts (often insiders)
      • UCLA lost 900k records
      • Berkeley lost laptop with sensitive information
      • Acxiom, JP Morgan, Choicepoint
      • www.privacyrights.org
present solutions
Present solutions
  • Application level: Salesforce.com
    • On-Demand Customer Relationship Managemen $65/User/Month ---- $995 / 5 Users / 1 Year
  • Amazon Elastic Compute Cloud
    • 1 instance = 1.7Ghz x86 processor, 1.75GB RAM, 160GB local disk, 250 Mb/s network bandwidth

Elastic, Completely controlled, Reliable, Secure

$0.10 per instance hour

$0.20 per GB of data in/out of Amazon

$0.15 per GB-Month of Amazon S3 storage used

  • Google Apps for your domain

Small businesses, Enterprise, School, Family or Group

encryption based solution
Encryption Based Solution

Encrypt

DSP

Client

Query Q

Q’

Client-side Processor

Answer

“Relevant Data”

Problem: Q’ “SELECT *”

the power of two
The Power of Two

DSP1

Client

DSP2

the power of two1
The Power of Two

DSP1

Q1

Query Q

Client-side Processor

Q2

DSP2

Key: Ensure Cost (Q1)+Cost (Q2)  Cost (Q)

sb1386 privacy
SB1386 Privacy
  • { Name, SSN},

{ Name, LicenceNo}

{ Name, CaliforniaID}

{ Name, AccountNumber}

{ Name, CreditCardNo, SecurityCode}

are all to be kept private.

  • A set is private if at least one of its elements is “hidden”.
    • Element in encrypted form ok
techniques
Techniques
  • Vertical Fragmentation
    • Partition attributes across R1 and R2
    • E.g., to obey constraint {Name, SSN},

R1  Name, R2  SSN

    • Use tuple IDs for reassembly. R = R1 JOIN R2
  • Encoding

One-time Pad

    • For each value v, construct random bit seq. r
    • R1  v XOR r, R2  r

Deterministic Encryption

    • R1  EK (v) R2  K
    • Can detect equality and push selections with equality predicate

Random addition

    • R1  v+r , R2  r
    • Can push aggregate SUM
example1
Example
  • An Employee relation: {Name, DoB, Position, Salary, Gender, Email, Telephone, ZipCode}
  • Privacy Constraints
    • {Telephone}, {Email}
    • {Name, Salary}, {Name, Position}, {Name, DoB}
    • {DoB, Gender, ZipCode}
    • {Position, Salary}, {Salary, DoB}
  • Will use just Vertical Fragmentation and Encoding.
example 2
Example (2)

R1

Constraints

Salary

ID

  • {Telephone}
  • {Email}
  • {Name, Salary}
  • {Name, Position}
  • {Name, DoB}
  • {DoB, Gender,ZipCode}
  • {Position, Salary}
  • {Salary, DoB}

Name

DoB

Position

Salary

Gender

Email

Email

Telephone

Telephone

ZipCode

ID

R2

partitioning execution
Partitioning, Execution
  • Partitioning Problem
    • Partition to minimize communication cost for given workload
    • Even simplified version hard to approximate
    • Hill Climbing algorithm after starting with weighted set cover
  • Query Reformulation and Execution
    • Consider only centralized plans
    • Algorithm to partition select and where clause predicates between the two partitions
acknowledgements stanford faculty
Acknowledgements: Stanford Faculty
  • Advisor: Rajeev Motwani
  • Members of Orals Committee:

Rajeev Motwani, Hector Garcia-Molina,

Dan Boneh, John Mitchell, Ashish Goel

  • Many other professors at Stanford, esp. Jennifer Widom
acknowledgements projects
Acknowledgements: Projects
  • STREAM: Jennifer Widom, Rajeev Motwani
  • PORTIA: Hector Garcia-Molina, Rajeev Motwani, Dan Boneh, John Mitchell
  • TRUST: Dan Boneh, John Mitchell, Rajeev Motwani, Hector Garcia-Molina
  • RAIN: Rajeev Motwani, Ashish Goel, Amin Saberi
acknowledgements internship mentors
Acknowledgements: Internship Mentors

Rakesh Agrawal, Ramakrishnan Srikant, Surajit Chaudhuri, Nicolas Bruno, Phil Gibbons, Sachin Lodha, Anand Rajaraman

acknowledgements coauthors a k
Acknowledgements: CoAuthors[A-K]

Gagan Aggarwal, Rakesh Agrawal, Arvind Arasu, Brian Babcock, Shivnath Babu, Mayank Bawa, Nicolas Bruno, Renato Carmo, Surajit Chaudhuri, Mayur Datar, Prasenjit Das, A A Diwan, Tomás Feder, Vignesh Ganapathy, Prasanna Ganesan, Hector Garcia-Molina, Keith Ito, Krishnaram Kenthapadi, Samir Khuller, Yoshiharu Kohayakawa,

acknowledgements coauthors l z
Acknowledgements: CoAuthors[L-Z]

Eduardo Sany Laber, Sachin Lodha, Nina Mishra, Rajeev Motwani, Shubha Nabar, Itaru Nishizawa, LiadanBoyen, Rina Panigrahy, Nikhil Patwardhan, Ramakrishnan Srikant, Utkarsh Srivastava, S. Sudarshan, Sharada Sundaram, Rohit Varma, Jennifer Widom, Ying Xu, An Zhu

acknowledgements others not in previous list
Acknowledgements: Others not in previous list
  • Aristides, Gurmeet, Aleksandra, Sergei, Damon, Anupam, Arnab, Aaron, Adam, Mukund, Vivek, Anish, Parag, Vijay, Piotr, Moses, Sudipto, Bob, David, Paul, Zoltan etc.
  • Members of Rajeev’s group, Stanford Theory, Database, Security groups, Also many PhD students of the incoming year 2002 -- Paul etc. and many other students at Stanford
  • Lynda, Maggie, Wendy, Jam, Kathi, Claire, Meredith for administrative help
  • Andy, Miles, Lilian for keeping the machines running!
  • Various outing clubs and groups at Stanford, Catholic community here, SIA, RAINS groups, Ivgrad, DB Movie and Social Committee
acknowledgements more
Acknowledgements: More!
  • Jojy Michael, Joshua Easow and families
  • Roommates: Omkar Deshpande, Alex Joseph, Mayur Naik, Rajiv Agrawal, Utkarsh Srivastava, Rajat Raina, Jim Cybluski, Blake Blailey
  • Batchmates and Professors from IITs
  • Friends and relatives, grandparents
  • sister Dina, and Parents
data streams
Data Streams
  • Traditional DBMS – data stored in finite, persistentdata sets
  • New Applications – data input as continuous, ordereddata streams
    • Network and traffic monitoring
    • Telecom call records
    • Network security
    • Financial applications
    • Sensor networks
    • Web logs and clickstreams
    • Massive data sets
scheduling algorithms for data streams
Scheduling Algorithms for Data Streams
  • Minimizing the overhead over the disk system. Motwani, Thomas. SODA 2004
  • Operator Scheduling in Data Stream Systems – Minimizing memory consumption and latency. Babu, Babcock, Datar, Motwani, Thomas. VLDB Journal 2004
  • Stanford STREAM Data Manager. Stanford Stream Group. IEEE Bulletin 2003