- By
**danno** - Follow User

- 180 Views
- Uploaded on

Download Presentation
## PowerPoint Slideshow about 'Architectures and Algorithms for Data Privacy' - danno

**An Image/Link below is provided (as is) to download presentation**

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript

### Architectures and Algorithms for Data Privacy

### Privacy Preserving Data Analysis i.e. Online Analytical Processing OLAP

### Probabilistic Anonymity: (submitted)

RoadMapRoadMap### Two Can Keep a Secret: A Distributed Architecture for Secure Database Services

### Distributing data and Partitioning and Integrating Queries for Secure Distributed Databases

Dilys Thomas

Stanford University, April 30th, 2007

Advisor: Rajeev Motwani

TexPoint fonts used in EMF.

Read the TexPoint manual before you delete this box.: AA

RoadMap

- Motivation for Data Privacy Research
- Sanitizing Data for Privacy
- Privacy Preserving OLAP
- K-Anonymity/ Clustering for Anonymity
- Probabilistic Anonymity
- Masketeer
- Auditing for Privacy
- Distributed Architectures for Privacy

Personal medical details

Disease history

Clinical research data

Govt. Agencies

Census records

Economic surveys

Hospital Records

Banking

Bank statement

Loan Details

Transaction history

Manufacturing

Process details

Blueprints

Production data

Finance

Portfolio information

Credit history

Transaction records

Investment details

Outsourcing

Customer data for testing

Remote DB Administration

BPO & KPO

Insurance

Claims records

Accident history

Policy details

Retail Business

Inventory records

Individual credit card details

Audits

Motivation 1: Data Privacy in EnterprisesPrivacy

Motivation 3: Personal Information

- Emails
- Searches on Google/Yahoo
- Profiles on Social Networking sites
- Passwords / Credit Card / Personal information at multiple E-commerce sites / Organizations
- Documents on the Computer / Network

Losses due to Lack of Privacy: ID-Theft

- 3% of households in the US affected by ID-Theft
- US $5-50B losses/year
- UK £1.7B losses/year
- AUS $1-4B losses/year

RoadMap

- Motivation for Data Privacy Research
- Sanitizing Data for Privacy
- Privacy Preserving OLAP
- K-Anonymity/ Clustering for Anonymity
- Probabilistic Anonymity
- Masketeer
- Auditing for Privacy
- Distributed Architectures for Privacy

Computing statistics of data collected from multiple data sources while maintaining the privacy of each individual source

Agrawal, Srikant, Thomas

SIGMOD 2005

Privacy Preserving OLAP

- Motivation
- Problem Definition
- Query Reconstruction

Inversion method

Single attribute

Multiple attributes

Iterative method

- Privacy Guarantees
- Experiments

Horizontally Partitioned Personal Information

Client C1

Original Row r1

Perturbed p1

Table T for analysis

at server

Client C2

Original Row r2

Perturbed p2

EXAMPLE: What number of children in this county go to college?

Client Cn

Original Row rn

Perturbed pn

Vertically Partitioned Enterprise Information

EXAMPLE: What fraction of United customers to New York fly Virgin Atlantic to travel to London?

Original Relation D1

Perturbed Relation D’1

Perturbed Joined Relation D’

Original Relation D2

Perturbed Relation D’2

Privacy Preserving OLAP: Problem Definition

Compute

select count(*) from T

where P1and P2 and P3 and …. Pk

Eg Find # of people between age[30-50] and salary[80-150]

i.e. COUNTT( P1and P2 and P3 and …. Pk )

Goal:

provide error bounds to analyst.

provide privacy guarantees to data sources.

scale to larger # of attributes

Perturbation Example: Uniform Retention Replacement

Throw a biased coin

Heads: Retain

Tails: Replace with a random number from a predefined pdf

5

Tails

4

Tails

3

Heads

BIAS=0.2

1

Tails

Tails

3

HEADS: RETAIN

TAILS: REPLACE U.A.R. FROM [1-5]

Retention Replacement Perturbation

- Done for each column
- The replacing pdf need not be uniform
- Best to use original pdf if available/ estimable
- Different columns can have different biases

for retention

Single Attribute Example

What is the fraction of people in this building with age 30-50?

- Assume age between 0-100
- Whenever a person enters the building flips a coin of with heads probability p=0.2.
- Heads -- report true age RETAIN
- Tails -- random number uniform in 0-100 reported PERTURB
- Totally 100 randomized numbers collected.
- Of these 22 are 30-50.
- How many among the original are 30-50?

Privacy Preserving OLAP

- Motivation
- Problem Definition
- Query Reconstruction

Inversion method

Single attribute

Multiple attributes

Iterative method

- Privacy Guarantees
- Experiments

Analysis Contd.

16

Perturbed, Age[30-50]

20

Retained

64

Perturbed, NOT Age[30-50]

20% of the 80 randomized rows, i.e. 16 of them

satisfy Age[30-50]. The remaining 64 don’t.

Analysis Contd.

6

Retained, Age[30-50]

Since there were 22 randomized rows in [30-50].

22-16=6 of them come from the 20 retained rows.

16

Perturbed, Age[30-50]

14

Retained, NOT Age[30-50]

64

Perturbed, NOT Age[30-50]

Formally : Select count(*) from R where Pred

p = retention probability (0.2 in example)

1-p = probability that an element is replaced

by replacing p.d.f.

b = probability that an element from the

replacing p.d.f. satisfies predicate Pred

( in example)

a = 1-b

Transition matrix

=

i.e. Solve xA=y

A00 = probability that original element satisfies

: P and after perturbation satisfies : P

p = probability it was retained

(1-p)a = probability it was perturbed and satisfies : P

A00 = (1-p)a+p

Multiple Attributes

For k attributes,

- x, y are vectors of size 2k
- x=y A-1

Where A=A1 A2 .. Ak [Tensor Product]

Ai is the transition matrix for column i

Error Bounds

- In our example, we want to say when estimated answer is 30, the actual answer lies in [28-32] with probability greater than 0.9
- Given T !a T’ , with n rows f(T) is (n,e,d) reconstructible by g(T’) if |f(T) – g(T’)| < max (e, e f(T)) with probability greater than (1- d).

f(T) =2, =0.1 in above example

Theoretical Basis and Results

Theorem: Fraction, f, of rows in [low,high] in the original table estimated by matrix inversion on the table obtained after uniform perturbation is a (n, , ) estimator for f if n > 4 log(2/)(p )-2 , by Chernoff bounds

Theorem: Vector, x, obtained by matrix inversion is the MLE (maximum likelihood estimator), by using Lagrangian Multiplier method and showing that the Hessian is negative

Iterative Algorithm [AS00]

Initialize:

x0=y

Iterate:

xpT+1 = Sq=0t yq (apqxpT / (Sr=0t arq xrT))

[ By Application of Bayes Rule]

Stop Condition:

Two consecutive x iterates do not differ much

Iterative Algorithm

We had proved,

- Theorem: Inversion Algorithm gives the MLE
- Theorem [AA01]: The Iterative Algorithm gives the MLE with the additional constraint that

0 < xi , 8 0 < i < 2k-1

- Models the fact the probabilities are non-negative
- Results better as shown in experiments

Privacy Guarantees

Say initially know with probability < 0.3

that Alice’s age > 25

After seeing perturbed value can say that with probability > 0.95

Then we say there is a (0.3,0.95) privacy breach

More subtle differential privacy in the thesis

Privacy Preserving OLAP

- Motivation
- Problem Definition
- Query Reconstruction
- Privacy Guarantees
- Experiments

Experiments

- Real data: Census data from the UCI Machine Learning Repository having 32000 rows
- Synthetic data: Generated multiple columns of Zipfian data, number of rows varied between 1000 and 1000000
- Error metric: l1 norm of difference between x and y.
- L1 norm between 2 probability distributions

Eg for 1-dim queries |x1 – y1| + | x0 – y0|

Inversion vs Iterative Reconstruction

2 attributes: Census Data

3 attributes: Census Data

Iterative algorithm (MLE on constrained space) outperforms Inversion (global MLE)

Error as a function of Number of Columns:

Iterative Algorithm: Zipf Data

The error in the iterative algorithm flattens

out as its maximum value is bounded by 2

Error as a function of Number of Columns

Census Data

Inversion Algorithm

Iterative Algorithm

Error increases exponentially with increase in number of columns

Error as a function of number of Rows

Error decreases as as number of rows, n increases

Conclusion

Possible to run OLAP on data across multiple servers so that probabilistically approximate answers are obtained and data privacy is maintained

The techniques have been tested experimentally on real and synthetic data. More experiments in the paper.

Privacy Preserving OLAP is Practical

RoadMap

- Motivation for Data Privacy Research
- Sanitizing Data for Privacy
- Privacy Preserving OLAP
- K-Anonymity/ Clustering for Anonymity
- Probabilistic Anonymity
- Masketeer
- Auditing for Privacy
- Distributed Architectures for Privacy

Creating tables that do not identify individuals for research or out-sourced software development purposes

Aggarwal, Feder, Kenthapadi, Motwani, Panigrahy, Thomas, Zhu

Achieving Anonymity via Clustering: PODS06

Aggarwal, Feder, Kenthapadi, Khuller, Panigrahy, Thomas, Zhu

Lodha, Thomas

Data Privacy

- Value disclosure: What is the value of attribute salary of person X
- Perturbation
- Privacy Preserving OLAP
- Identity disclosure: Whether an individual is present in the database table
- Randomization, K-Anonymity etc.
- Data for Outsourcing / Research

k-Anonymity Model [Swe00]

- Modify some entries of quasi-identifiers
- each modified row becomes identical to at least k-1 other rows with respect to quasi-identifiers
- Individual records hidden in a crowd of size k

2-Anonymity with Clustering

Cluster centers published

27=(25+27+29)/3

70=(50+60+100)/3

37=(35+39)/2

115=(110+120)/2

Clustering formulation: NP Hard

Cellular Clustering: Linear Program

Minimize c ( i xicdc + fc yc)

Sum of Cellular cost and facility cost

Subject to:

c xic¸ 1 Each Point belongs to a cluster

xic· yc Cluster must be opened for point to belong

0 · xic· 1 Points belong to clusters positively

0 · yc· 1 Clusters are opened positively

Quasi-identifier

0.6Fraction uniquely identified by Fruit. Hence Fruit is 0.6 Quasi-identifier.

0.87 fraction of U.S. population uniquely identified by (DOB, Gender, Zipcode)

hence it is a 0.87 quasi-identifier

Quasi-Identifier

- Find probability distribution over D distinct values that maximizes expected number of uniquely identified fraction of records.
- D distinct values, n rows
- If D <=n
- D/en (skewed distribution)
- Else
- e-n/D (uniform distribution)

Distinct values- Identifier

- DOB : 60*365=2*104
- Gender: 2
- Zipcode: 105
- (DOB, Gender, Zipcode) has together 2*104*2*105=4*109
- US population=3*108
- Fraction of singletons=

e-3*10^8/4*10^9=0.92

Distinct values and K-anonymity

- Eg. Apply HIPAA to

(Age in Years, Zipcode, Gender,Doctor details)

- Want k=20,000=2*104 anonymity with n=300 million=3*108 people.
- The number of distinct values is D=n/k=1.5*104
- D=Distinct values=

z(zipcode)*100(age in years)*2(gender)=200z

- 1.5*104=200z, z=102 approximately.
- Retain first two digits of zipcode (retain states)

Experiments

- Efficient Algorithms based on randomized algorithms to find quantiles in small space
- 10 seconds to anonymize quarter million rows. Or approximately 3GB per hour on a machine running 2.66Ghz Processor, 504 MB RAM, Windows XP with Service Pack 2
- order of magnitude better in running time for a quasi-identifier of size 10 than previous implementation
- Optimal algorithms to anonymize the dataset.
- Scalable
- Almost independent of anonymity parameter k
- linear in quasi-identifier size (previously exponential)
- linear in dataset size

Masketeer: A tool for data privacy

Das, Lodha, Patwardhan, Sundaram, Thomas.

- Motivation for Data Privacy Research
- Sanitizing Data for Privacy
- Privacy Preserving OLAP
- K-Anonymity/ Clustering for Anonymity
- Probabilistic Anonymity
- Masketeer
- Auditing for Privacy
- Distributed Architectures for Privacy

Auditing Batches of SQL Queries

Motwani, Nabar, Thomas

PDM Workshop with ICDE 2007

Given a set of SQL queries that have been posed

over a database, determine whether some subset

of these queries have revealed private information

about an individual or a group of individuals

Example

SELECT zipcode

FROM Patients p

WHERE p.disease = ‘diabetes’

AUDIT zipcode

FROM Patients p

WHERE p.disease = ‘high blood pressure’

AUDIT disease

FROM Patients p

WHERE p.zipcode = 94305

Not Suspicious wrt this

Suspicious if someone in 94305 has diabetes

Query Suspicious wrt an Audit Expression

- If all columns of audit expression are covered by the query
- If the audit expression and the query have one tuple in common

SQL Batch Auditing

Query 1

Query 2

Query 3

Query 4

Audit expression

Audited tuple columns are

covered

semantically

syntactically

Query batch suspicious wrt audit expression iff queries together cover all audited columns of at least audited tuple

on table T

on some table T

Syntactic and Semantic Auditing

- Checking for semantic suspiciousness has polynomial time algorithm
- Checking for syntactic suspiciousness is NP complete

- Motivation for Data Privacy Research
- Sanitizing Data for Privacy
- Privacy Preserving OLAP
- K-Anonymity/ Clustering for Anonymity
- Probabilistic Anonymity
- Masketeer
- Auditing for Privacy
- Distributed Architectures for Privacy

- How to distribute data across multiple sites for
- redundancy and (2) privacy so that a single
- site being compromised does not lead to data loss

Aggarwal, Bawa, Ganesan, Garcia-Molina, Kenthapadi, Motwani, Srivastava, Thomas, Xu

CIDR 2005

Feder, Ganapathy, Garcia-Molina, Motwani, Thomas

Work in Progress

Motivation

- Data outsourcing growing in popularity
- Cheap, reliable data storage and management
- 1TB $399 < $0.5 per GB
- $5000 – Oracle 10g / SQL Server
- $68k/year DBAdmin
- Privacy concerns looming ever larger
- High-profile thefts (often insiders)
- UCLA lost 900k records
- Berkeley lost laptop with sensitive information
- Acxiom, JP Morgan, Choicepoint
- www.privacyrights.org

Present solutions

- Application level: Salesforce.com
- On-Demand Customer Relationship Managemen $65/User/Month ---- $995 / 5 Users / 1 Year
- Amazon Elastic Compute Cloud
- 1 instance = 1.7Ghz x86 processor, 1.75GB RAM, 160GB local disk, 250 Mb/s network bandwidth

Elastic, Completely controlled, Reliable, Secure

$0.10 per instance hour

$0.20 per GB of data in/out of Amazon

$0.15 per GB-Month of Amazon S3 storage used

- Google Apps for your domain

Small businesses, Enterprise, School, Family or Group

Encryption Based Solution

Encrypt

DSP

Client

Query Q

Q’

Client-side Processor

Answer

“Relevant Data”

Problem: Q’ “SELECT *”

SB1386 Privacy

- { Name, SSN},

{ Name, LicenceNo}

{ Name, CaliforniaID}

{ Name, AccountNumber}

{ Name, CreditCardNo, SecurityCode}

are all to be kept private.

- A set is private if at least one of its elements is “hidden”.
- Element in encrypted form ok

Techniques

- Vertical Fragmentation
- Partition attributes across R1 and R2
- E.g., to obey constraint {Name, SSN},

R1 Name, R2 SSN

- Use tuple IDs for reassembly. R = R1 JOIN R2
- Encoding

One-time Pad

- For each value v, construct random bit seq. r
- R1 v XOR r, R2 r

Deterministic Encryption

- R1 EK (v) R2 K
- Can detect equality and push selections with equality predicate

Random addition

- R1 v+r , R2 r
- Can push aggregate SUM

Example

- An Employee relation: {Name, DoB, Position, Salary, Gender, Email, Telephone, ZipCode}
- Privacy Constraints
- {Telephone}, {Email}
- {Name, Salary}, {Name, Position}, {Name, DoB}
- {DoB, Gender, ZipCode}
- {Position, Salary}, {Salary, DoB}
- Will use just Vertical Fragmentation and Encoding.

Example (2)Email

Email

R1

Constraints

Salary

ID

- {Telephone}
- {Email}
- {Name, Salary}
- {Name, Position}
- {Name, DoB}
- {DoB, Gender,ZipCode}
- {Position, Salary}
- {Salary, DoB}

Name

DoB

Position

Salary

Gender

Telephone

Telephone

ZipCode

ID

R2

Partitioning, Execution

- Partitioning Problem
- Partition to minimize communication cost for given workload
- Even simplified version hard to approximate
- Hill Climbing algorithm after starting with weighted set cover
- Query Reformulation and Execution
- Consider only centralized plans
- Algorithm to partition select and where clause predicates between the two partitions

Acknowledgements: Stanford Faculty

- Advisor: Rajeev Motwani
- Members of Orals Committee:

Rajeev Motwani, Hector Garcia-Molina,

Dan Boneh, John Mitchell, Ashish Goel

- Many other professors at Stanford, esp. Jennifer Widom

Acknowledgements: Projects

- STREAM: Jennifer Widom, Rajeev Motwani
- PORTIA: Hector Garcia-Molina, Rajeev Motwani, Dan Boneh, John Mitchell
- TRUST: Dan Boneh, John Mitchell, Rajeev Motwani, Hector Garcia-Molina
- RAIN: Rajeev Motwani, Ashish Goel, Amin Saberi

Acknowledgements: Internship Mentors

Rakesh Agrawal, Ramakrishnan Srikant, Surajit Chaudhuri, Nicolas Bruno, Phil Gibbons, Sachin Lodha, Anand Rajaraman

Acknowledgements: CoAuthors[A-K]

Gagan Aggarwal, Rakesh Agrawal, Arvind Arasu, Brian Babcock, Shivnath Babu, Mayank Bawa, Nicolas Bruno, Renato Carmo, Surajit Chaudhuri, Mayur Datar, Prasenjit Das, A A Diwan, Tomás Feder, Vignesh Ganapathy, Prasanna Ganesan, Hector Garcia-Molina, Keith Ito, Krishnaram Kenthapadi, Samir Khuller, Yoshiharu Kohayakawa,

Acknowledgements: CoAuthors[L-Z]

Eduardo Sany Laber, Sachin Lodha, Nina Mishra, Rajeev Motwani, Shubha Nabar, Itaru Nishizawa, LiadanBoyen, Rina Panigrahy, Nikhil Patwardhan, Ramakrishnan Srikant, Utkarsh Srivastava, S. Sudarshan, Sharada Sundaram, Rohit Varma, Jennifer Widom, Ying Xu, An Zhu

Acknowledgements: Others not in previous list

- Aristides, Gurmeet, Aleksandra, Sergei, Damon, Anupam, Arnab, Aaron, Adam, Mukund, Vivek, Anish, Parag, Vijay, Piotr, Moses, Sudipto, Bob, David, Paul, Zoltan etc.
- Members of Rajeev’s group, Stanford Theory, Database, Security groups, Also many PhD students of the incoming year 2002 -- Paul etc. and many other students at Stanford
- Lynda, Maggie, Wendy, Jam, Kathi, Claire, Meredith for administrative help
- Andy, Miles, Lilian for keeping the machines running!
- Various outing clubs and groups at Stanford, Catholic community here, SIA, RAINS groups, Ivgrad, DB Movie and Social Committee

Acknowledgements: More!

- Jojy Michael, Joshua Easow and families
- Roommates: Omkar Deshpande, Alex Joseph, Mayur Naik, Rajiv Agrawal, Utkarsh Srivastava, Rajat Raina, Jim Cybluski, Blake Blailey
- Batchmates and Professors from IITs
- Friends and relatives, grandparents
- sister Dina, and Parents

Data Streams

- Traditional DBMS – data stored in finite, persistentdata sets
- New Applications – data input as continuous, ordereddata streams
- Network and traffic monitoring
- Telecom call records
- Network security
- Financial applications
- Sensor networks
- Web logs and clickstreams
- Massive data sets

Scheduling Algorithms for Data Streams

- Minimizing the overhead over the disk system. Motwani, Thomas. SODA 2004
- Operator Scheduling in Data Stream Systems – Minimizing memory consumption and latency. Babu, Babcock, Datar, Motwani, Thomas. VLDB Journal 2004
- Stanford STREAM Data Manager. Stanford Stream Group. IEEE Bulletin 2003

Download Presentation

Connecting to Server..