hippocratic data management
Download
Skip this Video
Download Presentation
Hippocratic Data Management

Loading in 2 Seconds...

play fullscreen
1 / 78

Hippocratic - PowerPoint PPT Presentation


  • 206 Views
  • Uploaded on

Hippocratic Data Management. Rakesh Agrawal IBM Almaden Research Center. Thesis. We need information systems that respect the privacy of data they manage AND do not impede the useful flow of information. It is feasible to reconcile the apparent contradiction.

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Hippocratic' - Solomon


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
hippocratic data management

HippocraticData Management

Rakesh Agrawal

IBM Almaden Research Center

thesis
Thesis
  • We need information systems that
    • respect the privacy of data they manage

AND

    • do not impede the useful flow of information.
  • It is feasible to reconcile the apparent contradiction
outline
Outline
  • Why Privacy in Data Systems
  • Some Technology Directions
  • Some Challenging Problems
drivers for privacy
Drivers for Privacy
  • Privacy Surveys:
    • 17% privacy fundamentalists, 56% pragmatic majority, 27% marginally concerned (Understanding net users\' attitude about online privacy, April 99)
    • 83% would stop doing business with a company if it misused customer information (Privacy on and off the Internet: What consumers want, Nov. 2001)
  • Govt. legislations & guidelines:
    • Fair Information Practices Act (US, 1974)
    • OECD Guidelines (Europe, 1980)
    • Canadian Standards Association’s Model Code (1995)
    • Australian Privacy Amendment (2000)
    • Japan: proposed legislation (2003)
    • HIPAA, GLB, Recent U.S. Federal & State Initiatives
privacy violations
Privacy Violations
  • Accidents:
    • Kaiser, GlobalHealthrax
  • Lax security:
    • Massachusetts govt.
  • Ethically questionable behavior:
    • Lotus & Equifax, Lexis-Nexis, Medical Marketing Service, Boston University, CVS & Giant Food
  • Illegal:
    • Toysmart
assertion
Assertion
  • Enterprises lack tools and technologies for managing private data and enforcing privacy policies.
founding tenets of current database systems
Founding Tenets of Current Database Systems
  • Ullman, “Principles of Database and Knowledgebase Systems”
  • Fundamental:
    • Manage persistent data.
    • Access a large amount of data efficiently.
  • Desirable:
    • Support for data model, high-level languages, transaction management, access control, and resiliency.
  • Similar list in other database textbooks.
statistical secure databases
Statistical & Secure Databases
  • Statistical Databases
    • Provide statistical information (sum, count, etc.) without compromising sensitive information about individuals, [AW89]
  • Multilevel Secure Databases
    • Multilevel relations, e.g., records tagged “secret”, “confidential”, or “unclassified”, e.g. [JS91]
  • Need to protect privacy in transactional databases that support daily operations.
    • Cannot restrict queries to statistical queries.
    • Cannot tag all the records “top secret”.
our research directions
Our Research Directions
  • Privacy Preserving Data Mining
  • Hippocratic Databases
data mining and privacy
Data Mining and Privacy
  • The primary task in data mining: development of models about aggregated data.
  • Can we develop accurate models without access to precise information in individual data records?

R. Agrawal, R. Srikant. Privacy Preserving Data Mining.

ACM Int’l Conf. On Management of Data (SIGMOD), May 2000.

privacy preserving data mining

30 | 25K | …

50 | 40K | …

Randomizer

Randomizer

65 | 50K | …

35 | 60K | …

Reconstruct

Age Distribution

Reconstruct

Salary Distribution

Data Mining

Algorithm

Model

Privacy Preserving Data Mining
reconstruction problem
Reconstruction Problem
  • Original values x1, x2, ..., xn
    • from probability distribution X
  • To hide these values, we use y1, y2, ..., yn
    • from probability distribution Y
  • Given
    • x1+y1, x2+y2, ..., xn+yn
    • the probability distribution of Y

Estimate the probability distribution of X.

intuition reconstruct single point
Intuition (Reconstruct single point)
  • Use Bayes\' rule for density functions
intuition reconstruct single point15
Intuition (Reconstruct single point)
  • Use Bayes\' rule for density functions
reconstruction intuition
Reconstruction: Intuition
  • Combine estimates of where a point came from for all the points:
    • yields estimate of original distribution.
reconstruction algorithm
Reconstruction Algorithm
  • fX0 := Uniform distribution
  • j := 0
  • repeat
    • fXj+1(a) := Bayes’ Rule
    • j := j+1
  • until (stopping criterion met)
  • Converges to maximum likelihood estimate.
    • D. Agrawal & C.C. Aggarwal, PODS 2001.
classification
Classification
  • Naïve Bayes
    • Assumes independence between attributes.
  • Decision Tree
    • Correlations are weakened by randomization.
experimental methodology
Experimental Methodology
  • Compare accuracy against
    • Original: unperturbed data without randomization.
    • Randomized: perturbed data but without making any corrections for randomization.
  • Test data not randomized.
  • Synthetic data benchmark from [AGI+92].
  • Training set of 100,000 records, split equally between the two classes.
so far
So far…
  • Question: Can we develop accurate models without access to precise information in individual data records?
  • Answer: yes, by randomization.
    • for numerical attributes, classification
  • How about Association Rules?
associations recap
Associations Recap
  • A transaction t is a set of items (e.g. books)
  • All transactions form a set Tof transactions
  • Any itemset A has support s in Tif
  • Itemset A is frequent if s smin
  • Task: Find all frequent itemsets
the problem
The Problem
  • How to randomize transactions so that
    • we can find frequent itemsets
    • while preserving privacy at transaction level?

Evfimievski, R. Srikant, R. Agrawal, J. Gehrke.

Mining Association Rules Over Privacy Preserving Data.

8th Int\'l Conf. on Knowledge Discovery in Databases and Data Mining, July 2002.

randomization overview
Randomization Overview

Alice

J.S. Bach,

painting,

nasa.gov,

J.S. Bach,

painting,

nasa.gov,

Recommendation

Service

B. Spears,

baseball,

cnn.com,

Bob

B. Spears,

baseball,

cnn.com,

B. Marley,

camping,

linux.org,

Chris

B. Marley,

camping,

linux.org,

randomization overview27
Randomization Overview

Alice

J.S. Bach,

painting,

nasa.gov,

J.S. Bach,

painting,

nasa.gov,

Recommendation

Service

B. Spears,

baseball,

cnn.com,

Bob

Associations

B. Spears,

baseball,

cnn.com,

B. Marley,

camping,

linux.org,

Chris

Recommendations

B. Marley,

camping,

linux.org,

randomization overview28
Randomization Overview

Alice

J.S. Bach,

painting,

nasa.gov,

Metallica,

painting,

nasa.gov,

Recommendation

Service

Support Recovery

B. Spears,

soccer,

bbc.co.uk,

Bob

Associations

B. Spears,

baseball,

cnn.com,

B. Marley,

camping,

ibm.com

Chris

Recommendations

B. Marley,

camping,

linux.org,

uniform randomization
Uniform Randomization
  • Given a transaction,
    • keep item with, say 20% probability,
    • replace with a new random item with 80% probability.
example x y z

10 M transactions of size 10 with 10 K items:

1%

have

{x, y,z}

5% have

{x, y}, {x,z},

or {y,z} only

94%

have one or zero

items of {x, y, z}

Example: {x, y, z}

at most

• 0.2• (9/10,000)2

• 0.23

• 0.22 • 8/10,000

0.008%

800 ts.

97.8%

0.00016%

16 trans.

1.9%

less than 0.00002%

2 transactions

0.3%

Privacy Breach: Given {x, y, z} in the randomized transaction,

we have about 98% certainty of {x, y, z} in the original one

privacy breach
Privacy Breach
  • Suppose:
    • t is an original transaction;
    • t’ is the corresponding randomized transaction;
    • A is a (frequent) itemset.
  • Definition: Itemset A causes a privacy breach of level  if, for some item z A,
our solution
Our Solution

“Where does a wise man hide a leaf? In the forest. But what does he do if there is no forest?” “He grows a forest to hide it in.”

G.K. Chesterton

  • Insert many false items into each transaction
  • Hide true itemsets among false ones

Can we still find frequent itemsets while having sufficient privacy?

cut and paste randomization
Cut and Paste Randomization
  • Given transaction t of size m, construct t’:

t =

a, b, c, u, v, w, x, y, z

t’ =

cut and paste randomization34
Cut and Paste Randomization
  • Given transaction t of size m, construct t’:
    • Choose a number j between 0 and Km (cutoff);

t =

a, b, c, u, v, w, x, y, z

t’ =

j = 4

cut and paste randomization35
Cut and Paste Randomization
  • Given transaction t of size m, construct t’:
    • Choose a number j between 0 and Km (cutoff);
    • Include j items of t into t’;

t =

a, b, c, u, v, w, x, y, z

t’ =

b, v, x, z

j = 4

cut and paste randomization36
Cut and Paste Randomization
  • Given transaction t of size m, construct t’:
    • Choose a number j between 0 and Km (cutoff);
    • Include j items of t into t’;
    • Each other item is included into t’ with probability pm .

The choice of Km and pm is based on the desired level of privacy.

t =

a, b, c, u, v, w, x, y, z

t’ =

b, v, x, z

œ, å, ß, ξ, ψ, €, א, ъ, ђ, …

j = 4

partial supports
Partial Supports

To recover original support of an itemset, we need randomized supports of its subsets.

  • Given an itemset A of size k and transaction size m,
  • A vector of partial supports of A is
    • Here sk is the same as the support of A.
    • Randomized partial supports are denoted by
transition matrix
Transition Matrix
  • Let k = |A|, m = |t|.
  • Transition matrixP = P (k, m) connects randomized partial supports with original ones:
  • Randomized supports are distributed as a sum of multinomial distributions.
the unbiased estimators
The Unbiased Estimators
  • Given randomized partial supports, we can estimate original partial supports:
  • Covariance matrix for this estimator:
  • To estimate it, substitute sl with (sest)l .
    • Special case: estimators for support and its variance
privacy breach analysis
Privacy Breach Analysis
  • How many added items are enough to protect privacy?
    • Have to satisfy Pr [zt | At’] <  ( no privacy breaches)
    • Select parameters so that it holds for all itemsets.
    • Use formula ( ):
  • Parameters are to be selected in advance!
    • Construct a privacy-challenging test: an itemset whose all subsets have maximum possible support.
    • Enough to know maximal support of an itemset for each size.
lowest discoverable support
Lowest Discoverable Support
  • LDS is s.t., when predicted, is 4away from zero.
  • Roughly, LDS is proportional to

|t| = 5, = 50%

lds vs breach level
LDS vs. Breach Level

|t| = 5, |T| = 5 M

  • Reminder: breach level is the limit on Pr [zt | A  t’]
real datasets soccer mailorder
Real Datasets: soccer, mailorder
  • Soccer is the clickstream log of WorldCup’98 web site, split into sessions of HTML requests.
    • 11 K items (HTMLs), 6.5 M transactions
  • Mailorder is a purchase dataset from a certain on-line store
    • Products are replaced with their categories
    • 96 items (categories), 2.9 M transactions
results
Results

Breach level = 50%.

Soccer:

smin = 0.2%

 0.07% for 3-itemsets

Mailorder:

smin = 0.2%

  0.05% for 3-itemsets

summary
Summary
  • Can have our cake and mine it too!
  • Randomization is an interesting approach for building data mining models while preserving user privacy!!!
  • Y. Lindell, B. Pinkas. Privacy Preserving Data Mining. Crypto 2000.

S. Rizvi, J. Haritsa, “Privacy-Preserving Association Rule Mining”, VLDB 2002

J. Vaidya, C.W. Clifton. Privacy Preserving Association Rule Mining in

Vertically Partitioned Data. KDD 2002.

the hippocratic oath
The Hippocratic Oath

“What I may see or hear in the course of treatment or even outside of the treatment in regard to the life of men, which on no account [ought to be] spread abroad, I will keep to myself, holding such things shameful to be spoken about.”

– Hippocratic Oath, 8 (circa 400 BC)

hippocratic databases
Hippocratic Databases

Founding tenet:

Responsibility for the privacy of data they manage.

R. Agrawal, J. Kiernan, R. Srikant, Y. Xu

Hippocratic Databases

28th Int\'l Conf. on Very Large Databases (VLDB), August 2002..

approach
Approach
  • Derive founding principles from current privacy legislation.
  • Strawman Design
ten principles of hippocratic databases
Ten Principles of Hippocratic Databases
  • Collection Group
    • Purpose Specification, Consent, Limited Collection
  • Use Group
    • Limited Use, Limited Disclosure, Limited Retention, Accuracy
  • Security & Openness Group
    • Safety, Openness, Compliance
collection group
Collection Group
  • Purpose Specification
    • For personal information stored in the database, the purposes for which the information has been collected shall be associated with that information.
  • Consent
    • The purposes associated with personal information shall have consent of the donor (person whose information is being stored).
  • Limited Collection
    • The information collected shall be limited to the minimum necessary for accomplishing the specified purposes.
use group
Use Group
  • Limited Use
    • The database shall run only those queries that are consistent with the purposes for which the information has been collected.
  • Limited Disclosure
    • Personal information shall not be communicated outside the database for purposes other than those for which there is consent from the donor of the information.
use group 2
Use Group (2)
  • Limited Retention
    • Personal information shall be retained only as long as necessary for the fulfillment of the purposes for which it has been collected.
  • Accuracy
    • Personal information stored in the database shall be accurate and up-to-date.
security openness group
Security & Openness Group
  • Safety
    • Personal information shall be protected by security safeguards against theft and other misappropriations.
  • Openness
    • A donor shall be able to access all information about the donor stored in the database.
  • Compliance
    • A donor shall be able to verify compliance with the above principles. Similarly, the database shall be able to address a challenge concerning compliance.
strawman architecture
Strawman Architecture

Privacy

Policy

Data

Collection

Queries

Other

Store

architecture policy
Architecture: Policy

Privacy

Policy

Converts privacy policy into privacy metadata tables.

  • For each purpose & piece of information (attribute):
    • External recipients
    • Retention period
    • Authorized users
  • Different designs possible.

Privacy

Metadata

Creator

Limited

Disclosure

Limited

Retention

Store

Privacy

Metadata

architecture data collection
Architecture: Data Collection

Data

Collection

Privacy policy compatible with user’s privacy preference?

Privacy

Constraint

Validator

Consent

Audit trail for compliance.

Compliance

Audit

Info

Store

Privacy

Metadata

Audit

Trail

architecture data collection58
Architecture: Data Collection

Data

Collection

Data cleansing, e.g., errors in address.

Privacy

Constraint

Validator

Accuracy

Data

Accuracy

Analyzer

Associate set of purposes with each record.

Purpose

Specification

Audit

Info

Store

Record

Access

Control

Privacy

Metadata

Audit

Trail

architecture queries
Architecture: Queries

Queries

2. Query tagged “telemarketing” cannot see credit card info.

Attribute

Access

Control

Safety

Safety

1. Telemarketing cannot issue query tagged “charge”.

3. Telemarketing query only sees records that include “telemarketing” in set of purposes.

Limited

Use

Store

Record

Access

Control

Privacy

Metadata

architecture queries60
Architecture: Queries

Queries

Attribute

Access

Control

Telemarketing query that asks for all phone numbers.

Safety

Query

Intrusion

Detector

  • Compliance
  • Training data for query intrusion detector

Compliance

Audit

Info

Store

Record

Access

Control

Privacy

Metadata

Audit

Trail

architecture other
Architecture: Other

Other

Analyze queries to identify unnecessary collection, retention & authorizations.

Limited

Collection

Data

Collection

Analyzer

Delete items in accordance with privacy policy.

Limited

Retention

Data

Retention

Manager

Additional security for sensitive data.

Safety

Store

Encryption

Support

Privacy

Metadata

strawman architecture62
Strawman Architecture

Privacy

Policy

Data

Collection

Queries

Other

Attribute

Access

Control

Data

Collection

Analyzer

Privacy

Constraint

Validator

Privacy

Metadata

Creator

Query

Intrusion

Detector

Data

Accuracy

Analyzer

Data

Retention

Manager

Audit

Info

Audit

Info

Store

Record

Access

Control

Encryption

Support

Privacy

Metadata

Audit

Trail

status
Status
  • Prototyping core functionality of the design
  • Nibbling at some of the open problems (see VLDB-2002 paper)
privacy preserving synthetic datasets for data mining research
Privacy-Preserving Synthetic Datasets for Data Mining Research
  • How to randomize to be able to build multiple types of models
  • How to handle combination of data types
  • How to handle rare events
network is the database

Credit Application

Jane’s Data

Decision

Approval Function

Jane’s Data

Result

Network is the Database
  • What if private data never leaves a person’s data store?
    • Computations travel to data
decision making across private data repositories
Decision-Making Across Private Data Repositories

Minimal Necessary Sharing

  • Separate databases due to statutory, competitive, or security reasons.
    • Selective, minimal sharing on need-to-know basis.
  • Example:Among those who took a particular drug, how many had adverse reaction and their DNA contains a specific sequence?
    • Researchers must not learn anything beyond counts.

R

R  S

  • R must not know that S has b & y
  • S must not know that R has a & x

R  S

S

Count (R  S)

  • R & S do not learn anything except that the result is 2.
closing thoughts
Closing Thoughts
  • The right to privacy: the most cherished of human freedoms

-- Warren & Brandeis, 1890

  • Code is law … it is all a matter of code: the software and hardware that now rule

-- L. Lessig

  • We can architect computing systems to protect values we believe are fundamental, or we can architect them to allow those values to disappear.
  • What do we want to do as computer scientists?
references
References
  • R. Agrawal, R. Srikant. Information Integration Across Autonomous Enterprises. ACM Int’l Conf. On Management of Data (SIGMOD), San Diego, California, June 2003.
  • R. Agrawal, J. Kiernan, R. Srikant, Y. Xu. An Xpath Based Preference Language for P3P.12th Int\'l World Wide Web Conf. (WWW), Budapest, Hungary, May 2003.
  • R. Agrawal, J. Kiernan, R. Srikant, Y. Xu. Implementing P3P Using Database Technology.19th Int\'l Conf.on Data Engineering(ICDE), Bangalore, India, March 2003.
  • R. Agrawal, J. Kiernan, R. Srikant, Y. Xu. Server Centric P3P.W3C Workshop on the Future of P3P, Dulles, Virginia, Nov. 2002.
  • R. Agrawal, J. Kiernan, R. Srikant, Y. Xu. Hippocratic Databases.28th Int\'l Conf. on Very Large Databases (VLDB), Hong Kong, August 2002.
  • R. Agrawal, J. Kiernan. Watermarking Relational Databases.28th Int\'l Conf. on Very Large Databases (VLDB), Hong Kong, August 2002.
  • A. Evfimievski, R. Srikant, R. Agrawal, J. Gehrke. Mining Association Rules Over Privacy Preserving Data.8th Int\'l Conf. on Knowledge Discovery in Databases and Data Mining (KDD), Edmonton, Canada, July 2002.
  • R. Agrawal, R. Srikant. Privacy Preserving Data Mining. ACM Int’l Conf. On Management of Data (SIGMOD), Dallas, Texas, May 2000.
new challenges
New Challenges
  • General
    • Language
    • Efficiency
  • Use
    • Limited Collection
    • Limited Disclosure
    • Limited Retention
  • Security and Openness
    • Safety
    • Openness
    • Compliance
language
Language
  • Need a language for privacy policies & user preferences.
  • P3P can be used as starting point.
    • Developed primarily for web shopping.
    • What about richer domains?
  • How do we balance expressibility and usability?
  • Arrange concepts in hierarchy or subsumption relationship.
    • Purpose:
  • P3P recipients:

contact

Ours

Same

Delivery

Unrelated

Public

email

phone

home

work

language 2
Language (2)
  • How do we accommodate user negotiation models?
    • User willing to disclose information only if fairly compensated.
    • Value of privacy as coalitional game [KPR2001]
efficiency
Efficiency
  • How do we minimize the cost of privacy checking?
  • How do we incorporate purpose into database design and query optimization?
  • Tradeoffs between space & running time.
      • Only tag records in customer table with purpose, not all records. But now need to do a join when scanning records in order table.
  • How does the secure databases work on decomposition of multilevel relations into single-level relations [JS91] apply here?
limited collection
Limited Collection
  • How do we identify attributes that are collected but not used?
    • Assets are only needed for mortgage when salary is below some threshold.
  • What’s the needed granularity for numeric attributes?
    • Queries only ask “Salary > threshold” for rent application.
  • How do we generate minimal queries?
    • Redundancy may be hidden in application code.
limited disclosure
Limited Disclosure
  • Can the user dynamically determine the set of recipients?
  • Example: Alice wants to add EasyCredit to set of recipients in EquiRate’s database.
  • Digital signatures.
limited retention
Limited Retention
  • Completely forgetting some information is non-trivial.
  • How do we delete a record from the logs and checkpoints, without affecting recovery?
  • How do we continue to support historical analysis and statistical queries without incurring privacy breaches?
safety
Safety
  • Encryption provides additional layer of security.
  • How do we index encrypted data?
  • How do we run queries against encrypted data?
  • [SWP00], [HILM02]
openness
Openness
  • A donor shall be able to access all information about the donor stored in the database.
  • How does the database check Alice is really Alice and not somebody else?
    • Princeton admissions office broke into Yale’s admissions using applicant’s social security number and birth date.
  • How does Alice find out what databases have information about her?
    • Symmetrically private information retrieval [GIKM98].
compliance
Compliance
  • Universal Logging
    • Can we provide each user whose data is accessed with a log of that access, along with the query reading the data?
    • Use intermediaries who aggregate and analyze logs for many users.
  • Tracking Privacy Breaches
    • Insert “fingerprint” records with emails, telephone numbers, and credit card numbers.
    • Some data may be more valuable for spammers or credit card theft. How do we identify categories to do stratified fingerprinting rather than randomly inserting records?