privacy and data mining friends or foes l.
Download
Skip this Video
Loading SlideShow in 5 Seconds..
Privacy and Data Mining: Friends or Foes? PowerPoint Presentation
Download Presentation
Privacy and Data Mining: Friends or Foes?

Loading in 2 Seconds...

play fullscreen
1 / 62

Privacy and Data Mining: Friends or Foes? - PowerPoint PPT Presentation


  • 255 Views
  • Uploaded on

Privacy and Data Mining: Friends or Foes?. Rakesh Agrawal IBM Almaden Research Center. Theme. DILEMMA Applications abound where data mining can do enormous good, but is vulnerable to misuse under misguided hands GOAL

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Privacy and Data Mining: Friends or Foes?' - Gabriel


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
privacy and data mining friends or foes

Privacy and Data Mining: Friends or Foes?

Rakesh Agrawal

IBM Almaden Research Center

theme
Theme

DILEMMA

  • Applications abound where data mining can do enormous good, but is vulnerable to misuse under misguided hands

GOAL

  • Understand the concerns with data mining and identify research directions that may address those concerns

QUESTIONS

  • Perceived concerns with data mining
  • How real are those concerns
  • What data mining community is doing to address the concerns
  • What more needs to be done
panelists
Panelists
  • James Dempsey, Center for Democracy & Technology
  • Daniel Gallington, Potomac Institute
  • Lawrence Cox, National Center for Health Statistics
  • Bhavani Thuraisingham, National Science Foundation
  • Latanya Sweeney, Carnegie Mellon University
  • Christopher Clifton, Purdue University
  • Jeff Ullman, Stanford University
slide4
Plan
  • Position statements -- 6 minutes each
  • Rejoinders -- 2 minutes each
  • Questions and observations from the floor
  • Closing statements -- 1 minute each
the potomac institute for policy studies

The Potomac Institute for Policy Studies

Privacy and Data Mining

KDD 2003

August 25, 2003

Daniel J. Gallington

slide6

New Information Technology and Privacy– Status of the Debate

  • Demonization of Science
  • Technology development vs. policy/legal “envelope”
  • Rules vs. Process
  • Enablement vs. Disablement
  • Secrecy
  • When the dust settles– what could work?
data mining and privacy friends or foes

Data Mining and Privacy:Friends or Foes?

Dr. Bhavani Thuraisingham

The National Science Foundation

August 2003

definitions
Definitions
  • Data Mining
    • Data mining is the process of a user analyzing large amounts of data using techniques from statistical reasoning and machine learning and discovering information often previously unknown
  • Data fusion
    • The process of associating records from two (or more) databases, e.g., Medical Records and Grocery Store purchases
  • Privacy Problem
    • User U poses queries and deduces information from the responses that U is authorized to see; U is not authorized to see the deduced information about an individual or a group of individuals G deemed private by either G or some authority
some data mining applications
Some Data Mining Applications
  • Medical and Healthcare
    • Mining genetic and medical databases and finding links between genetic composition and diseases
  • Security
    • Analyzing travel records, spending patterns, associations between people and determining potential terrorists
    • Examining audit data and determining unauthorized network intrusions
    • Mining credit card transactions, telephone calls and other related data and detecting fraud and identity theft
  • Marketing, Sales, and Finance
    • Understanding preferences of groups of consumers
some privacy concerns
Some Privacy concerns
  • Medical and Healthcare
    • Employers, marketers, or others knowing of private medical concerns
  • Security
    • Allowing access to individual’s travel and spending data
    • Allowing access to web surfing behavior
  • Marketing, Sales, and Finance
    • Allowing access to individual’s purchases
data mining as a threat to privacy
Data Mining as a Threat to Privacy
  • Data mining gives us “facts” that are not obvious to human analysts of the data
  • Can general trends across individuals be determined without revealing information about individuals?
  • Possible threats:
    • Combine collections of data and infer information that is private
      • Disease information from prescription data
      • Military Action from Pizza delivery to pentagon
  • Need to protect the associations and correlations between the data that are sensitive or private
some privacy problems and potential solutions
Some Privacy Problems and Potential Solutions
  • Problem: Privacy violations that result due to data mining
    • Potential solution: Privacy-preserving data mining
  • Problem: Privacy violations that result due to the Inference problem
    • Inference is the process of deducing sensitive information from the legitimate responses received to user queries
    • Potential solution: Privacy Constraint Processing
  • Problem: Privacy violations due to un-encrypted data
    • Potential solution: Encryption at different levels
  • Problem: Privacy violation due to poor system design
    • Potential solution: Develop methodology for designing privacy-enhanced systems
some research directions privacy preserving data mining
Some Research Directions:Privacy Preserving Data Mining
  • Prevent useful results from mining
    • Introduce “cover stories” to give “false” results
    • Only make a sample of data available so that an adversary is unable to come up with useful rules and predictive functions
  • Randomization
    • Introduce random values into the data and/or results
    • Challenge is to introduce random values without significantly affecting the data mining results
    • Give range of values for results instead of exact values
  • Secure Multi-party Computation
    • Each party knows its own inputs; encryption techniques used to compute final results
    • Rules, predictive functions
  • Approach: Only make a sample of data available
    • Limits ability to learn good classifier
some research directions privacy constraint processing
Some Research Directions:Privacy Constraint Processing
  • Privacy constraints processing
    • Based on prior research in security constraint processing
    • Simple Constraint: an attribute of a document is private
    • Content-based constraint: If document contains information about X, then it is private
    • Association-based Constraint: Two or more documents taken together is private; individually each document is public
    • Release constraint: After X is released Y becomes private
  • Augment a database system with a privacy controller for constraint processing
some research directions encryption for privacy
Some Research Directions:Encryption for Privacy
  • Encryption at various levels
    • Encrypting the data as well as the results of data mining
    • Encryption for multi-party computation
  • Encryption for untrusted third party publishing
    • Owner enforces privacy policies
    • Publisher gives the user only those portions of the document he/she is authorized to access
    • Combination of digital signatures and Merkle hash to ensure privacy
some research directions methodology for designing privacy systems
Some Research Directions:Methodology for Designing Privacy Systems
  • Jointly develop privacy policies with policy specialists
  • Specification language for privacy policies
  • Generate privacy constraints from the policy and check for consistency of constraints
  • Develop a privacy model
  • Privacy architecture that identifies privacy critical components
  • Design and develop privacy enforcement algorithms
  • Verification and validation
data mining and privacy friends or foes17
Data Mining and Privacy: Friends or Foes?
  • They are neither friends nor foes
  • Need advances in both data mining and privacy
  • Need to design flexible systems
    • For some applications one may have to focus entirely on “pure” data mining while for some others there may be a need for “privacy-preserving” data mining
    • Need flexible data mining techniques that can adapt to the changing environments
  • Technologists, legal specialists, social scientists, policy makers and privacy advocates MUST work together
some nsf projects addressing privacy
Some NSF Projects addressing Privacy
  • Privacy-preserving data mining
    • Distributed data mining techniques to replicate or approximate the results of centralized data mining, with quantifiable limits on the disclosure of data from each
  • Privacy for Supply Chain Management
    • Secure Supply-Chain Collaboration protocols to enable supply-chain partners to cooperatively achieve desired system-wide goals without revealing any private information, even though the jointly-computed decisions may depend on the private information of all the parties
  • Privacy Model
    • Model for privacy based on secure query protocol, encryption and database organization with little trust on the client or server
other ideas and directions
Other Ideas and Directions?
  • Please contact
    • Dr. Bhavani Thuraisingham The National Science Foundation Suite 1115 4201 Wilson Blvd Arlington, VA 22230 Phone: 703-292-8930 Fax 703-292-9037 email: bthurais@nsf.gov
technologies for privacy

Technologies for Privacy

Latanya Sweeney, Ph.D.Assistant Professor of Computer Science, Technology and PolicySchool of Computer ScienceCarnegie Mellon Universitylatanya @ privacy.cs.cmu.eduhttp://privacy.cs.cmu.edu/people/sweeney/index.html

6/29

address 4 questions
Address 4 Questions
  • Concerns with data mining
  • How real are those concerns
  • What the data mining community is doing to address those concerns
  • What more needs to be done

L. Sweeney. Navigating Computer Science Research Through Waves of Privacy Concerns. 2003. http://privacy.cs.cmu.edu/index.html

address 4 questions22
Address 4 Questions
  • Concerns with data mining:demand for person-specific data
  • How real are those concerns:explosion in collected informationindividual bears risks and harms
  • What the data mining community is doing:privacy-preserving data mining too limited
  • What more needs to be done:construct technology with provable guarantees of privacy protection privacy technology
privacy technology center core people
Privacy Technology Center Core People

Michael Shamos

Mel Siegel

Daniel Siewiorek

Asim Smailagic

Peter Steenkiste

Scott Stevens

Latanya Sweeney

Katia Sycara

Robert Thibedeau

Howard Wactlar

Alex Waibel

Anastassia AilamakiChris AtkesonGuy BlellochManuel BlumJamie CallanJamie CarbonellKathleen CarleyRobert CollinsLorrie CranorSamuel Edoho-EketMaxine EskenaziScott Fahlman

David FarberDavid GarlanRalph GrossAlex HauptmannTakeo KanadeBradley MalinBruce MaggsTom MitchellNorman SadehWilliam ScherlisJeff Schneider

Henry Schneiderman

emerging technologies with privacy concerns
Emerging Technologies with Privacy Concerns

1. Face recognition, Biometrics (DNA, fingerprints, iris, gait)

2. Video Surveillance, Ubiquitous Networks (Sensors)

3. Semantic Web, “Data Mining,” Bio-Terrorism Surveillance

4. Professional Assistants (email and scheduling), Lifelog recording

5. E911 Cell Phones, IR Tags, GPS

6. Personal Robots, Intelligent Spaces, CareMedia

7. Peer to peer Sharing, Spam Blockers, Instant Messaging

8. Tutoring Systems, Classroom Recording, Cheating Detectors

9. DNA sequences, Genomic data, Pharmaco-genomics

ubiquitous data sharing benefits and concerns
Ubiquitous Data SharingBenefits and Concerns

3. Semantic Web, “Data Mining,” Bio-Terrorism Surveillance

Benefits:

- Counter terrorism surveillance may improve safety. - Bio-Terrorism surveillance can save lives by early detection of a biological agent and naturally occurring outbreaks. - Semantic web enables more powerful computer uses

  • Privacy concerns:
    • - Erosion of civil liberties
    • - Illegal search from law-enforcement “mining” cases - Patient privacy may render healthcare less effective. - Access to uncontrolled and unprecedented amounts of data - Collected data can be used for other government purposes
1 concerns with data mining
1. Concerns with Data Mining

A. Video, wiretapping and surveillance

B. Civil liberties, illegal search

C. Medical privacy

D. Employment, workplace privacy

E. Educational records privacy

F. Copyright law

“data mining” ubiquitous data sharing, increased demand for person-specific data to realize potential benefits from algorithms

definition privacy
Definition. Privacy

Privacy reflects the ability of a person, organization, government, or entity to control its own space, where the concept of space (or “privacy space”) takes on different contexts

  • Physical space, against invasion
  • Bodily space, medical consent
  • Computer space, spam
  • Web browsing space, Internet privacy
definition data privacy
Definition. Data Privacy

When privacy space refers to the fragments of data one leaves behind as a person moves through daily life, the notion of privacy is called data privacy.

  • No control or ownership
  • Historically dictated by policy and laws
  • Today’s technically empowered society renders overtaxes past approach
address 4 questions29
Address 4 Questions
  • Concerns with data mining
  • How real are those concerns
  • What the data mining community is doing to address those concerns
  • What more needs to be done

L. Sweeney. Navigating Computer Science Research Through Waves of Privacy Concerns. 2003. http://privacy.cs.cmu.edu/index.html

slide30

Exponential Growth in Data Collected

Growth in active web servers

Growth in available disk storage

1991

1996

1993 First WWW conference

2001

slide31

Linking to Re-identify Data

Name

Address

Date registered

Party affiliation

Date last voted

Ethnicity

Visit date

Diagnosis

Procedure

Medication

Total charge

ZIP

Birth date

Sex

Medical Data

Voter List

L. Sweeney. Weaving technology and policy together to maintain confidentiality. Journal of Law, Medicine and Ethics. 1997, 25:98-110.

address 4 questions33
Address 4 Questions
  • Concerns with data mining
  • How real are those concerns
  • What the data mining community is doing to address those concerns
  • What more needs to be done

L. Sweeney. Navigating Computer Science Research Through Waves of Privacy Concerns. 2003. http://privacy.cs.cmu.edu/index.html

address 4 questions35
Address 4 Questions
  • Concerns with data mining
  • How real are those concerns
  • What the data mining community is doing to address those concerns
  • What more needs to be done

L. Sweeney. Navigating Computer Science Research Through Waves of Privacy Concerns. 2003. http://privacy.cs.cmu.edu/index.html

what more needs to be done
What More Needs to Be Done

Our approach.Privacy Technology Center proactively constructs privacy technology with provable guarantees of privacy protection while allowing society to collect and share person-specific information for many worthy purposes .

some privacy technology solutions
Some Privacy Technology Solutions

- Face de-identification- Self-controlling data- Video abstraction- CertBox (“privacy appliance”)- Reasonable cause (“selective revelation”)- Distributed surveillance- Privacy and context awareness (“eWallet”)- Data valuation by simulation- Roster collocation networks- Video and sound opt-out- Text anonymizer- Privacy agent- Blocking devices- Point location query restriction

k same face de identification
k-Same Face De-identification

Privacy Compliance: No matter how good face recognition software may become, it will not be able to reliably re-identify k-Same’d faces.

Warranty: The resulting data remain useful for identifying suspicious behavior and identifying basic characteristics.

E. Newton, L. Sweeney, and B. Malin Preserving Privacy by De-identifying Facial Images. Carnegie Mellon University, School of Computer Science, Technical Report, CMU-CS-03-119. Pittsburgh: 2003. http://privacy.cs.cmu.edu/people/sweeney/video.html

slide39

Example of k-Same Faces for Varying k

-Pixel

-Eigen

100

k =

2

5

3

10

50

slide40

Performance of k-Same Algorithm for varying values of k

Upper-bound on Recognition Performance =

1

k

slide41
Single Bar Mask

T-Mask

Black Blob

Mouth Only

Grayscale

Black & White

Ordinal Data

Threshold

Pixelation

Negative

Grayscale

Black & White

Random

Grayscale

Black & White

Mr. Potato Head

Some Attempts that Don’t Work!

legal flow of medical data for surveillance
Legal Flow of Medical Data for Surveillance

HIPAA

Public Health Law

“No” risk!

PublicHealth

Explicitly Identified by Name, etc.

Scientifically de-identified

Surveillance Systems

Hospitals, Labs,Physician Offices

de identified data through a privacy wall generated in real time by a certbox
De-identified Data through a “Privacy Wall” Generated in Real-Time by a “CertBox”

Scientifically de-identified

CertBox

PublicHealth

Explicitly Identified by Name, etc.

Data de-identified automatically by a tamper-resistant system specific to the data and the task. Called a “CertBox.”

risk of re identification
Risk of Re-identification

Ann 9/1960

Ann

“Ann”

“Ann”

PublicHealth

“9/1960 F 37213”

“Ann”

“9/1960 F 37213”

A re-identification results when a record in a sample from the Bio-Surveillance Datastream can reasonably be related to the patient who is the subject of the record in such a way that direct and rather specific communication with the patient is possible.

measuring identifiability
Measuring Identifiability

Binsize of 1

Only 1 person is green with that shape head.

Hal

Jim

Gil

Binsize of 2

Ken

Len

Mel

Population

2 people are gray with that shape head.

Release

Identifiability estimates, in graduated sized groupings, the number of people to which a released record is apt to refer. These groupings are called binsizes.

risk assessment server
Risk Assessment Server

Inferences

Sample fromBio-Surveillance Datastream

Assessment

Engine

Population

Models

computation

models

Profile

of Databases

The Risk Assessment Server identifies which fields and/or records in the Bio-surveillance Datastream are vulnerable to known re-identification inference strategies. The output of the assessment server is a report on the identifiability of the Bio-surveillance Datastream (not just the sample) with respect to those inference strategies.

The Risk Assessment Server is licensed to Computer Information Technology Corp.(CIT). Diagram is courtesy of CIT. All rights reserved.

certbox contains privacert
CertBox Contains PrivaCert™

Raw data

Scientifically de-identified

PrivaCert™

Rule-based system custom to data assessment

reasonable cause selective revelation
Reasonable Cause (“Selective Revelation”)

Gross overview

Sufficiently anonymous

Normal operation

Sufficiently de-identified

Unusual activity

Identifiable

Suspicious activity

Readily identifiable

Outbreak suspected

Explicitly identified

Outbreak detected

Datafly Idenifiability 0..1

Detection Status 0..1

address 4 questions49
Address 4 Questions
  • Concerns with data mining:demand for person-specific data
  • How real are those concerns:explosion in collected informationindividual bears risks and harms
  • What the data mining community is doing:privacy-preserving data mining too limited
  • What more needs to be done:construct technology with provable guarantees of privacy protection privacy technology
perceived concerns
Perceived Concerns
  • Data mining lets you find out about my private life
    • I don’t want (you, my insurance company, the government) knowing everything
  • Data mining doesn’t always get it right
    • I don’t want to be put in jail because data mining said so
    • I don’t want to be denied a (credit, a job, insurance) because data mining said so
perceived concerns51

Real

Perceived Concerns
  • Data mining lets you find out about my private life
    • Learned models allow conjectures
    • Learning the model requires collecting data
  • Data mining doesn’t always get it right
    • Our legal system is supposed to ensure due process
    • Data mining typically allows businesses to take risks they otherwise wouldn’t
perceived concerns52

Real

Perceived Concerns

Solutions

  • Data mining lets you find out about my private life
    • Privacy-preserving data mining
  • Data mining doesn’t always get it right
    • We know it
      • Educate the user
    • We’re working on it
privacy preserving data mining data perturbation
Privacy-Preserving Data MiningData Perturbation
  • Construct a data set with noise added
    • Can be released without revealing private data
  • Miners given the perturbed data set
    • Reconstruct distribution to improve results
  • Solutions out there
    • Decision trees, association rules
  • Debate: Does it really preserve privacy?
    • Can we prove impossibility of noise removal?
privacy preserving data mining distributed data mining
Privacy-Preserving Data MiningDistributed Data Mining
  • Data owners keep their data
    • Collaborate to get data mining results
  • Encryption techniques to preserve privacy
    • Proofs that private data is not disclosed
  • Solutions for Decision Trees, Association Rules, Clustering
    • Different solutions needed depending on how data is distributed, privacy constraints
what next
What Next?
  • Data mining lets you find out about my private life
    • Constraints that allow us to restrict what models can be learned
  • Data mining doesn’t always get it right
    • Educate the public
      • What data mining does (and doesn’t do)
    • And of course, more research
some thoughts about privacy

Some Thoughts About Privacy

Jeffrey D. Ullman

KDD, Aug. 25, 2003

our treatment of privacy is pretty weird
Our Treatment of Privacy is Pretty Weird
  • We allow spammers and cold-callers to intrude without mercy.
  • Yet Amazon wouldn’t tell me the status of my Son’s order.
  • And Congress killed the only system that has a hope of protecting us against mass murder by terrorists.
tia city walls of today
TIA: City Walls of Today
  • 5000 years ago, stone walls protected advanced civilizations from marauders.
  • I doubt the first attempts were perfect (did they forget doors?), and there was a downside, e.g., restricted movement.
  • Likewise, TIA may be the only way to keep terrorists at bay.
what the antis forget
What The “Antis” Forget
  • There is a great difference between an inanimate machine knowing your secrets and a person knowing the same.
  • Political solutions can control how and why information goes from the machine to trusted analysts who can act on the knowledge.
analogy
Analogy
  • From 200 years of tradition, it has become safe to put M16’s in the hands of soldiers who do not use them to rob liquor stores.
  • Likewise, we need a cadre of trusted analysts whose job is to protect, not to intrude on the innocent.
technology thoughts
Technology Thoughts
  • TIA is not about machine learning --- we don’t have positive examples.
  • TIA is an advanced form of data-mining, where long connections are sought in massive data.
    • e.g., multiple connections between “Al Qaida” and “flight schools.”
technology thoughts 2
Technology Thoughts --- (2)
  • Possible boost: “Locality-Sensitive Hashing” (Gionis, Indyk, & Motwani).
  • A powerful technique for focusing on low-frequency, high-correlation events.
  • Needs generalization to graphs that represent various forms of connection.