Hippocratic data management
Download
1 / 78

Hippocratic - PowerPoint PPT Presentation


  • 206 Views
  • Updated On :

Hippocratic Data Management. Rakesh Agrawal IBM Almaden Research Center. Thesis. We need information systems that respect the privacy of data they manage AND do not impede the useful flow of information. It is feasible to reconcile the apparent contradiction.

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Hippocratic' - Solomon


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
Hippocratic data management l.jpg

HippocraticData Management

Rakesh Agrawal

IBM Almaden Research Center


Thesis l.jpg
Thesis

  • We need information systems that

    • respect the privacy of data they manage

      AND

    • do not impede the useful flow of information.

  • It is feasible to reconcile the apparent contradiction


Outline l.jpg
Outline

  • Why Privacy in Data Systems

  • Some Technology Directions

  • Some Challenging Problems


Drivers for privacy l.jpg
Drivers for Privacy

  • Privacy Surveys:

    • 17% privacy fundamentalists, 56% pragmatic majority, 27% marginally concerned (Understanding net users' attitude about online privacy, April 99)

    • 83% would stop doing business with a company if it misused customer information (Privacy on and off the Internet: What consumers want, Nov. 2001)

  • Govt. legislations & guidelines:

    • Fair Information Practices Act (US, 1974)

    • OECD Guidelines (Europe, 1980)

    • Canadian Standards Association’s Model Code (1995)

    • Australian Privacy Amendment (2000)

    • Japan: proposed legislation (2003)

    • HIPAA, GLB, Recent U.S. Federal & State Initiatives


Privacy violations l.jpg
Privacy Violations

  • Accidents:

    • Kaiser, GlobalHealthrax

  • Lax security:

    • Massachusetts govt.

  • Ethically questionable behavior:

    • Lotus & Equifax, Lexis-Nexis, Medical Marketing Service, Boston University, CVS & Giant Food

  • Illegal:

    • Toysmart


Assertion l.jpg
Assertion

  • Enterprises lack tools and technologies for managing private data and enforcing privacy policies.


Founding tenets of current database systems l.jpg
Founding Tenets of Current Database Systems

  • Ullman, “Principles of Database and Knowledgebase Systems”

  • Fundamental:

    • Manage persistent data.

    • Access a large amount of data efficiently.

  • Desirable:

    • Support for data model, high-level languages, transaction management, access control, and resiliency.

  • Similar list in other database textbooks.


Statistical secure databases l.jpg
Statistical & Secure Databases

  • Statistical Databases

    • Provide statistical information (sum, count, etc.) without compromising sensitive information about individuals, [AW89]

  • Multilevel Secure Databases

    • Multilevel relations, e.g., records tagged “secret”, “confidential”, or “unclassified”, e.g. [JS91]

  • Need to protect privacy in transactional databases that support daily operations.

    • Cannot restrict queries to statistical queries.

    • Cannot tag all the records “top secret”.


Our research directions l.jpg
Our Research Directions

  • Privacy Preserving Data Mining

  • Hippocratic Databases


Data mining and privacy l.jpg
Data Mining and Privacy

  • The primary task in data mining: development of models about aggregated data.

  • Can we develop accurate models without access to precise information in individual data records?

R. Agrawal, R. Srikant. Privacy Preserving Data Mining.

ACM Int’l Conf. On Management of Data (SIGMOD), May 2000.


Privacy preserving data mining l.jpg

30 | 25K | …

50 | 40K | …

Randomizer

Randomizer

65 | 50K | …

35 | 60K | …

Reconstruct

Age Distribution

Reconstruct

Salary Distribution

Data Mining

Algorithm

Model

Privacy Preserving Data Mining


Reconstruction problem l.jpg
Reconstruction Problem

  • Original values x1, x2, ..., xn

    • from probability distribution X

  • To hide these values, we use y1, y2, ..., yn

    • from probability distribution Y

  • Given

    • x1+y1, x2+y2, ..., xn+yn

    • the probability distribution of Y

      Estimate the probability distribution of X.


Intuition reconstruct single point l.jpg
Intuition (Reconstruct single point)

  • Use Bayes' rule for density functions


Intuition reconstruct single point15 l.jpg
Intuition (Reconstruct single point)

  • Use Bayes' rule for density functions


Reconstruction intuition l.jpg
Reconstruction: Intuition

  • Combine estimates of where a point came from for all the points:

    • yields estimate of original distribution.


Reconstruction algorithm l.jpg
Reconstruction Algorithm

  • fX0 := Uniform distribution

  • j := 0

  • repeat

    • fXj+1(a) := Bayes’ Rule

    • j := j+1

  • until (stopping criterion met)

  • Converges to maximum likelihood estimate.

    • D. Agrawal & C.C. Aggarwal, PODS 2001.



Classification l.jpg
Classification

  • Naïve Bayes

    • Assumes independence between attributes.

  • Decision Tree

    • Correlations are weakened by randomization.


Experimental methodology l.jpg
Experimental Methodology

  • Compare accuracy against

    • Original: unperturbed data without randomization.

    • Randomized: perturbed data but without making any corrections for randomization.

  • Test data not randomized.

  • Synthetic data benchmark from [AGI+92].

  • Training set of 100,000 records, split equally between the two classes.




So far l.jpg
So far

  • Question: Can we develop accurate models without access to precise information in individual data records?

  • Answer: yes, by randomization.

    • for numerical attributes, classification

  • How about Association Rules?


Associations recap l.jpg
Associations Recap

  • A transaction t is a set of items (e.g. books)

  • All transactions form a set Tof transactions

  • Any itemset A has support s in Tif

  • Itemset A is frequent if s smin

  • Task: Find all frequent itemsets


The problem l.jpg
The Problem

  • How to randomize transactions so that

    • we can find frequent itemsets

    • while preserving privacy at transaction level?

Evfimievski, R. Srikant, R. Agrawal, J. Gehrke.

Mining Association Rules Over Privacy Preserving Data.

8th Int'l Conf. on Knowledge Discovery in Databases and Data Mining, July 2002.


Randomization overview l.jpg
Randomization Overview

Alice

J.S. Bach,

painting,

nasa.gov,

J.S. Bach,

painting,

nasa.gov,

Recommendation

Service

B. Spears,

baseball,

cnn.com,

Bob

B. Spears,

baseball,

cnn.com,

B. Marley,

camping,

linux.org,

Chris

B. Marley,

camping,

linux.org,


Randomization overview27 l.jpg
Randomization Overview

Alice

J.S. Bach,

painting,

nasa.gov,

J.S. Bach,

painting,

nasa.gov,

Recommendation

Service

B. Spears,

baseball,

cnn.com,

Bob

Associations

B. Spears,

baseball,

cnn.com,

B. Marley,

camping,

linux.org,

Chris

Recommendations

B. Marley,

camping,

linux.org,


Randomization overview28 l.jpg
Randomization Overview

Alice

J.S. Bach,

painting,

nasa.gov,

Metallica,

painting,

nasa.gov,

Recommendation

Service

Support Recovery

B. Spears,

soccer,

bbc.co.uk,

Bob

Associations

B. Spears,

baseball,

cnn.com,

B. Marley,

camping,

ibm.com

Chris

Recommendations

B. Marley,

camping,

linux.org,


Uniform randomization l.jpg
Uniform Randomization

  • Given a transaction,

    • keep item with, say 20% probability,

    • replace with a new random item with 80% probability.


Example x y z l.jpg

10 M transactions of size 10 with 10 K items:

1%

have

{x, y,z}

5% have

{x, y}, {x,z},

or {y,z} only

94%

have one or zero

items of {x, y, z}

Example: {x, y, z}

at most

• 0.2• (9/10,000)2

• 0.23

• 0.22 • 8/10,000

0.008%

800 ts.

97.8%

0.00016%

16 trans.

1.9%

less than 0.00002%

2 transactions

0.3%

Privacy Breach: Given {x, y, z} in the randomized transaction,

we have about 98% certainty of {x, y, z} in the original one


Privacy breach l.jpg
Privacy Breach

  • Suppose:

    • t is an original transaction;

    • t’ is the corresponding randomized transaction;

    • A is a (frequent) itemset.

  • Definition: Itemset A causes a privacy breach of level  if, for some item z A,


Our solution l.jpg
Our Solution

“Where does a wise man hide a leaf? In the forest. But what does he do if there is no forest?” “He grows a forest to hide it in.”

G.K. Chesterton

  • Insert many false items into each transaction

  • Hide true itemsets among false ones

Can we still find frequent itemsets while having sufficient privacy?


Cut and paste randomization l.jpg
Cut and Paste Randomization

  • Given transaction t of size m, construct t’:

t =

a, b, c, u, v, w, x, y, z

t’ =


Cut and paste randomization34 l.jpg
Cut and Paste Randomization

  • Given transaction t of size m, construct t’:

    • Choose a number j between 0 and Km (cutoff);

t =

a, b, c, u, v, w, x, y, z

t’ =

j = 4


Cut and paste randomization35 l.jpg
Cut and Paste Randomization

  • Given transaction t of size m, construct t’:

    • Choose a number j between 0 and Km (cutoff);

    • Include j items of t into t’;

t =

a, b, c, u, v, w, x, y, z

t’ =

b, v, x, z

j = 4


Cut and paste randomization36 l.jpg
Cut and Paste Randomization

  • Given transaction t of size m, construct t’:

    • Choose a number j between 0 and Km (cutoff);

    • Include j items of t into t’;

    • Each other item is included into t’ with probability pm .

      The choice of Km and pm is based on the desired level of privacy.

t =

a, b, c, u, v, w, x, y, z

t’ =

b, v, x, z

œ, å, ß, ξ, ψ, €, א, ъ, ђ, …

j = 4


Partial supports l.jpg
Partial Supports

To recover original support of an itemset, we need randomized supports of its subsets.

  • Given an itemset A of size k and transaction size m,

  • A vector of partial supports of A is

    • Here sk is the same as the support of A.

    • Randomized partial supports are denoted by


Transition matrix l.jpg
Transition Matrix

  • Let k = |A|, m = |t|.

  • Transition matrixP = P (k, m) connects randomized partial supports with original ones:

  • Randomized supports are distributed as a sum of multinomial distributions.


The unbiased estimators l.jpg
The Unbiased Estimators

  • Given randomized partial supports, we can estimate original partial supports:

  • Covariance matrix for this estimator:

  • To estimate it, substitute sl with (sest)l .

    • Special case: estimators for support and its variance


Privacy breach analysis l.jpg
Privacy Breach Analysis

  • How many added items are enough to protect privacy?

    • Have to satisfy Pr [zt | At’] <  ( no privacy breaches)

    • Select parameters so that it holds for all itemsets.

    • Use formula ( ):

  • Parameters are to be selected in advance!

    • Construct a privacy-challenging test: an itemset whose all subsets have maximum possible support.

    • Enough to know maximal support of an itemset for each size.


Lowest discoverable support l.jpg
Lowest Discoverable Support

  • LDS is s.t., when predicted, is 4away from zero.

  • Roughly, LDS is proportional to

|t| = 5, = 50%


Lds vs breach level l.jpg
LDS vs. Breach Level

|t| = 5, |T| = 5 M

  • Reminder: breach level is the limit on Pr [zt | A  t’]


Real datasets soccer mailorder l.jpg
Real Datasets: soccer, mailorder

  • Soccer is the clickstream log of WorldCup’98 web site, split into sessions of HTML requests.

    • 11 K items (HTMLs), 6.5 M transactions

  • Mailorder is a purchase dataset from a certain on-line store

    • Products are replaced with their categories

    • 96 items (categories), 2.9 M transactions


Results l.jpg
Results

Breach level = 50%.

Soccer:

smin = 0.2%

 0.07% for 3-itemsets

Mailorder:

smin = 0.2%

  0.05% for 3-itemsets


Summary l.jpg
Summary

  • Can have our cake and mine it too!

  • Randomization is an interesting approach for building data mining models while preserving user privacy!!!

  • Y. Lindell, B. Pinkas. Privacy Preserving Data Mining. Crypto 2000.

S. Rizvi, J. Haritsa, “Privacy-Preserving Association Rule Mining”, VLDB 2002

J. Vaidya, C.W. Clifton. Privacy Preserving Association Rule Mining in

Vertically Partitioned Data. KDD 2002.


The hippocratic oath l.jpg
The Hippocratic Oath

“What I may see or hear in the course of treatment or even outside of the treatment in regard to the life of men, which on no account [ought to be] spread abroad, I will keep to myself, holding such things shameful to be spoken about.”

– Hippocratic Oath, 8 (circa 400 BC)


Hippocratic databases l.jpg
Hippocratic Databases

Founding tenet:

Responsibility for the privacy of data they manage.

R. Agrawal, J. Kiernan, R. Srikant, Y. Xu

Hippocratic Databases

28th Int'l Conf. on Very Large Databases (VLDB), August 2002..


Approach l.jpg
Approach

  • Derive founding principles from current privacy legislation.

  • Strawman Design


Ten principles of hippocratic databases l.jpg
Ten Principles of Hippocratic Databases

  • Collection Group

    • Purpose Specification, Consent, Limited Collection

  • Use Group

    • Limited Use, Limited Disclosure, Limited Retention, Accuracy

  • Security & Openness Group

    • Safety, Openness, Compliance


Collection group l.jpg
Collection Group

  • Purpose Specification

    • For personal information stored in the database, the purposes for which the information has been collected shall be associated with that information.

  • Consent

    • The purposes associated with personal information shall have consent of the donor (person whose information is being stored).

  • Limited Collection

    • The information collected shall be limited to the minimum necessary for accomplishing the specified purposes.


Use group l.jpg
Use Group

  • Limited Use

    • The database shall run only those queries that are consistent with the purposes for which the information has been collected.

  • Limited Disclosure

    • Personal information shall not be communicated outside the database for purposes other than those for which there is consent from the donor of the information.


Use group 2 l.jpg
Use Group (2)

  • Limited Retention

    • Personal information shall be retained only as long as necessary for the fulfillment of the purposes for which it has been collected.

  • Accuracy

    • Personal information stored in the database shall be accurate and up-to-date.


Security openness group l.jpg
Security & Openness Group

  • Safety

    • Personal information shall be protected by security safeguards against theft and other misappropriations.

  • Openness

    • A donor shall be able to access all information about the donor stored in the database.

  • Compliance

    • A donor shall be able to verify compliance with the above principles. Similarly, the database shall be able to address a challenge concerning compliance.


Strawman architecture l.jpg
Strawman Architecture

Privacy

Policy

Data

Collection

Queries

Other

Store


Architecture policy l.jpg
Architecture: Policy

Privacy

Policy

Converts privacy policy into privacy metadata tables.

  • For each purpose & piece of information (attribute):

    • External recipients

    • Retention period

    • Authorized users

  • Different designs possible.

Privacy

Metadata

Creator

Limited

Disclosure

Limited

Retention

Store

Privacy

Metadata



Architecture data collection l.jpg
Architecture: Data Collection

Data

Collection

Privacy policy compatible with user’s privacy preference?

Privacy

Constraint

Validator

Consent

Audit trail for compliance.

Compliance

Audit

Info

Store

Privacy

Metadata

Audit

Trail


Architecture data collection58 l.jpg
Architecture: Data Collection

Data

Collection

Data cleansing, e.g., errors in address.

Privacy

Constraint

Validator

Accuracy

Data

Accuracy

Analyzer

Associate set of purposes with each record.

Purpose

Specification

Audit

Info

Store

Record

Access

Control

Privacy

Metadata

Audit

Trail


Architecture queries l.jpg
Architecture: Queries

Queries

2. Query tagged “telemarketing” cannot see credit card info.

Attribute

Access

Control

Safety

Safety

1. Telemarketing cannot issue query tagged “charge”.

3. Telemarketing query only sees records that include “telemarketing” in set of purposes.

Limited

Use

Store

Record

Access

Control

Privacy

Metadata


Architecture queries60 l.jpg
Architecture: Queries

Queries

Attribute

Access

Control

Telemarketing query that asks for all phone numbers.

Safety

Query

Intrusion

Detector

  • Compliance

  • Training data for query intrusion detector

Compliance

Audit

Info

Store

Record

Access

Control

Privacy

Metadata

Audit

Trail


Architecture other l.jpg
Architecture: Other

Other

Analyze queries to identify unnecessary collection, retention & authorizations.

Limited

Collection

Data

Collection

Analyzer

Delete items in accordance with privacy policy.

Limited

Retention

Data

Retention

Manager

Additional security for sensitive data.

Safety

Store

Encryption

Support

Privacy

Metadata


Strawman architecture62 l.jpg
Strawman Architecture

Privacy

Policy

Data

Collection

Queries

Other

Attribute

Access

Control

Data

Collection

Analyzer

Privacy

Constraint

Validator

Privacy

Metadata

Creator

Query

Intrusion

Detector

Data

Accuracy

Analyzer

Data

Retention

Manager

Audit

Info

Audit

Info

Store

Record

Access

Control

Encryption

Support

Privacy

Metadata

Audit

Trail


Status l.jpg
Status

  • Prototyping core functionality of the design

  • Nibbling at some of the open problems (see VLDB-2002 paper)


Privacy preserving synthetic datasets for data mining research l.jpg
Privacy-Preserving Synthetic Datasets for Data Mining Research

  • How to randomize to be able to build multiple types of models

  • How to handle combination of data types

  • How to handle rare events


Network is the database l.jpg

Credit Application Research

Jane’s Data

Decision

Approval Function

Jane’s Data

Result

Network is the Database

  • What if private data never leaves a person’s data store?

    • Computations travel to data


Decision making across private data repositories l.jpg
Decision-Making Across Private Data Repositories Research

Minimal Necessary Sharing

  • Separate databases due to statutory, competitive, or security reasons.

    • Selective, minimal sharing on need-to-know basis.

  • Example:Among those who took a particular drug, how many had adverse reaction and their DNA contains a specific sequence?

    • Researchers must not learn anything beyond counts.

R

R  S

  • R must not know that S has b & y

  • S must not know that R has a & x

R  S

S

Count (R  S)

  • R & S do not learn anything except that the result is 2.


Closing thoughts l.jpg
Closing Thoughts Research

  • The right to privacy: the most cherished of human freedoms

    -- Warren & Brandeis, 1890

  • Code is law … it is all a matter of code: the software and hardware that now rule

    -- L. Lessig

  • We can architect computing systems to protect values we believe are fundamental, or we can architect them to allow those values to disappear.

  • What do we want to do as computer scientists?


References l.jpg
References Research

  • R. Agrawal, R. Srikant. Information Integration Across Autonomous Enterprises. ACM Int’l Conf. On Management of Data (SIGMOD), San Diego, California, June 2003.

  • R. Agrawal, J. Kiernan, R. Srikant, Y. Xu. An Xpath Based Preference Language for P3P.12th Int'l World Wide Web Conf. (WWW), Budapest, Hungary, May 2003.

  • R. Agrawal, J. Kiernan, R. Srikant, Y. Xu. Implementing P3P Using Database Technology.19th Int'l Conf.on Data Engineering(ICDE), Bangalore, India, March 2003.

  • R. Agrawal, J. Kiernan, R. Srikant, Y. Xu. Server Centric P3P.W3C Workshop on the Future of P3P, Dulles, Virginia, Nov. 2002.

  • R. Agrawal, J. Kiernan, R. Srikant, Y. Xu. Hippocratic Databases.28th Int'l Conf. on Very Large Databases (VLDB), Hong Kong, August 2002.

  • R. Agrawal, J. Kiernan. Watermarking Relational Databases.28th Int'l Conf. on Very Large Databases (VLDB), Hong Kong, August 2002.

  • A. Evfimievski, R. Srikant, R. Agrawal, J. Gehrke. Mining Association Rules Over Privacy Preserving Data.8th Int'l Conf. on Knowledge Discovery in Databases and Data Mining (KDD), Edmonton, Canada, July 2002.

  • R. Agrawal, R. Srikant. Privacy Preserving Data Mining. ACM Int’l Conf. On Management of Data (SIGMOD), Dallas, Texas, May 2000.


New challenges l.jpg
New Challenges Research

  • General

    • Language

    • Efficiency

  • Use

    • Limited Collection

    • Limited Disclosure

    • Limited Retention

  • Security and Openness

    • Safety

    • Openness

    • Compliance


Language l.jpg
Language Research

  • Need a language for privacy policies & user preferences.

  • P3P can be used as starting point.

    • Developed primarily for web shopping.

    • What about richer domains?

  • How do we balance expressibility and usability?

  • Arrange concepts in hierarchy or subsumption relationship.

    • Purpose:

  • P3P recipients:

contact

Ours

Same

Delivery

Unrelated

Public

email

phone

home

work


Language 2 l.jpg
Language (2) Research

  • How do we accommodate user negotiation models?

    • User willing to disclose information only if fairly compensated.

    • Value of privacy as coalitional game [KPR2001]


Efficiency l.jpg
Efficiency Research

  • How do we minimize the cost of privacy checking?

  • How do we incorporate purpose into database design and query optimization?

  • Tradeoffs between space & running time.

    • Only tag records in customer table with purpose, not all records. But now need to do a join when scanning records in order table.

  • How does the secure databases work on decomposition of multilevel relations into single-level relations [JS91] apply here?


  • Limited collection l.jpg
    Limited Collection Research

    • How do we identify attributes that are collected but not used?

      • Assets are only needed for mortgage when salary is below some threshold.

    • What’s the needed granularity for numeric attributes?

      • Queries only ask “Salary > threshold” for rent application.

    • How do we generate minimal queries?

      • Redundancy may be hidden in application code.


    Limited disclosure l.jpg
    Limited Disclosure Research

    • Can the user dynamically determine the set of recipients?

    • Example: Alice wants to add EasyCredit to set of recipients in EquiRate’s database.

    • Digital signatures.


    Limited retention l.jpg
    Limited Retention Research

    • Completely forgetting some information is non-trivial.

    • How do we delete a record from the logs and checkpoints, without affecting recovery?

    • How do we continue to support historical analysis and statistical queries without incurring privacy breaches?


    Safety l.jpg
    Safety Research

    • Encryption provides additional layer of security.

    • How do we index encrypted data?

    • How do we run queries against encrypted data?

    • [SWP00], [HILM02]


    Openness l.jpg
    Openness Research

    • A donor shall be able to access all information about the donor stored in the database.

    • How does the database check Alice is really Alice and not somebody else?

      • Princeton admissions office broke into Yale’s admissions using applicant’s social security number and birth date.

    • How does Alice find out what databases have information about her?

      • Symmetrically private information retrieval [GIKM98].


    Compliance l.jpg
    Compliance Research

    • Universal Logging

      • Can we provide each user whose data is accessed with a log of that access, along with the query reading the data?

      • Use intermediaries who aggregate and analyze logs for many users.

    • Tracking Privacy Breaches

      • Insert “fingerprint” records with emails, telephone numbers, and credit card numbers.

      • Some data may be more valuable for spammers or credit card theft. How do we identify categories to do stratified fingerprinting rather than randomly inserting records?