Differential privacy on linked data theory and implementation
Download
1 / 76

Differential Privacy on Linked Data: Theory and Implementation - PowerPoint PPT Presentation


  • 110 Views
  • Uploaded on

Differential Privacy on Linked Data: Theory and Implementation. Yotam Aron. Table of Contents. Introduction Differential Privacy for Linked Data SPIM implementation Evaluation. Contributions. Theory: how to apply differential privacy to linked data.

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about ' Differential Privacy on Linked Data: Theory and Implementation' - ilana


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

Table of contents
Table of Contents Implementation

  • Introduction

  • Differential Privacy for Linked Data

  • SPIM implementation

  • Evaluation


Contributions
Contributions Implementation

  • Theory: how to apply differential privacy to linked data.

  • Implementation: privacy module for SPARQL queries.

  • Experimental evaluation: differential privacy on linked data.


Introduction
Introduction Implementation


Overview privacy risk
Overview: Privacy Risk Implementation

  • Statistical data can leak privacy.

  • Mosaic Theory: Different data sources harmful when combined.

  • Examples:

    • Netflix Prize Data set

    • GIC Medical Data set

    • AOL Data logs

  • Linked data has added ontologies and meta-data, making it even more vulnerable.


Current solutions
Current Solutions Implementation

  • Accountability:

    • Privacy Ontologies

    • Privacy Policies and Laws

  • Problems:

    • Requires agreement among parties.

    • Does not actually prevent breaches, just a deterrent.


Current solutions cont d
Current Solutions (Cont’d) Implementation

  • Anonymization

    • Delete “private” data

    • K – anonymity (Strong Privacy Guarantee)

  • Problems

    • Deletion provides no strong guarantees

    • Must be carried out for every data set

    • What data should be anonymized?

    • High computational cost (k-anonimity is np-hard)


Differential privacy
Differential Privacy Implementation

  • Definition for relational databases (from PINQ paper):

    A randomized function K gives Ɛ-differential privacy if for all data sets and differing on at most one record, and all,


Differential privacy1
Differential Privacy Implementation

  • What does this mean?

    • Adversaries get roughly same results from and , meaning a single individual’s data will not greatly affect their knowledge acquired from each data set.


How is this achieved
How Is This Achieved? Implementation

  • Add noise to result.

  • Simplest: Add Laplace noise


Laplace noise parameters
Laplace Noise Parameters Implementation

  • Mean = 0 (so don’t add bias)

  • Variance = , where is defined, for a record j, as

  • Theorem: For query Q result R,the output R + Laplace(0, ) is differentially private.


Other benefit of laplace noise
Other Benefit of Laplace Noise Implementation

  • A set of queries each with sensitivity will have an overall sensitivity of

  • Implementation-wise, can allocate an “budget” Ɛ for a client and for each query client specifies to use.


Benefits of differential privacy
Benefits of Differential Privacy Implementation

  • Strong Privacy Guarantee

  • Mechanism-Based, so don’t have to mess with data.

  • Independent of data set’s structure.

  • Works well with for statistical analysis algorithms.


Problems with differential privacy
Problems with Differential Privacy Implementation

  • Potentially poor performance

    • Complexity

    • Noise

  • Only works with statistical data (though this has fixes)

  • How to calculate sensitivity of arbitrary query without brute-force?



Differential privacy and linked data
Differential Privacy and Linked Data Implementation

  • Want same privacy guarantees for linked data without, but no “records.”

  • What should be “unit of difference”?

    • One triple

    • All URIs related to person’s URI

    • All links going out from person’s URI


Differential privacy and linked data1
Differential Privacy and Linked Data Implementation

  • Want same privacy guarantees for linked data without, but no “records.”

  • What should be “unit of difference”?

    • One triple

    • All URIs related to person’s URI

    • All links going out from person’s URI


Differential privacy and linked data2
Differential Privacy and Linked Data Implementation

  • Want same privacy guarantees for linked data without, but no “records.”

  • What should be “unit of difference”?

    • One triple

    • All URIs related to person’s URI

    • All links going out from person’s URI


Differential privacy and linked data3
Differential Privacy and Linked Data Implementation

  • Want same privacy guarantees for linked data without, but no “records.”

  • What should be “unit of difference”?

    • One triple

    • All URIs related to person’s URI

    • All links going out from person’s URI


Records for linked data
“Records” for Linked Data Implementation

  • Reduce links in graph to attributes

  • Idea:

    • Identify individual contributions from a single individual to total answer.

    • Find contribution that affects answer most.


Records for linked data1
“Records” for Linked Data Implementation

  • Reduce links in graph to attributes, makes it a record.

P1

P2

Knows


Records for linked data2
“Records” for Linked Data Implementation

  • Repeated attributes and null values allowed

P1

P2

Knows

Loves

Knows

P3

P4

Knows


Records for linked data3
“Records” for Linked Data Implementation

  • Repeated attributes and null values allowed (not good RDBMS form but makes definitions easier)


Query sensitivity in practice
Query Sensitivity in Practice Implementation

  • Need to find triples that “belong” to a person.

  • Idea:

    • Identify individual contributions from a single individual to total answer.

    • Find contribution that affects answer most.

  • Done using sorting and limiting functions in SPARQL


Example
Example Implementation

S1

  • COUNT of places visited

S2

P1

MA

P2

State of Residence

S3

Visited


Example1
Example Implementation

S1

  • COUNT of places visited

S2

P1

MA

P2

State of Residence

S3

Visited


Example2
Example Implementation

S1

  • COUNT of places visited

S2

P1

MA

P2

State of Residence

S3

Visited

Answer: Sensitivity of 2


Using sparql
Using SPARQL Implementation

  • Query:

    (COUNT(?s) as ?num_places_visited) WHERE{

    ?p :visited ?s }


Using sparql1
Using SPARQL Implementation

  • Sensitivity Calculation Query (Ideally):

    SELECT ?p (COUNT(ABS(?s)) as ?num_places_visited) WHERE{

    ?p :visited ?s;

    ?p foaf:name ?n }

    GROUP BY ?p ORDER BY ?num_places_visited LIMIT 1


In reality
In reality… Implementation

  • LIMIT, ORDER BY, GROUP BY doesn’t work together in 4store…

  • For now: Don’t use LIMIT and get top answers manually.

    • I.e. Simulate using these keywords in python

    • Will affect results, so better testing should be carried out in the future.

  • Would like to keep it on sparql-side ideally so there is less transmitted data (e.g. on large data sets)


Side rant 4store limitations
(Side rant) 4store limitations Implementation

  • Many operations not supported in unison

  • E.g. cannot always filter and use “order by” for some reason

  • Severely limits the types of queries I could use to test.

  • May be desirable to work with a different triplestore that is more up-to-date (ARQ).

    • Didn’t because wanted to keep code in python.

    • Also had already written all code for 4store


Problems with this approach
Problems with this Approach Implementation

  • Need to identify “people” in graph.

    • Assume, for example, that URI with a foaf:name is a personand use its triples in privacy calculations.

    • Imposes some constraints on linked data format for this to work.

    • For future work, look if there’s a way to automatically identify private data, maybe by using ontologies.

  • Complexity is tied to speed of performing query over large data set.

  • Still not generalizable to all functions.


And on the plus side
…and on the Plus Side Implementation

  • Model for sensitivity calculation can be expanded to arbitrary statistical functions.

    • e.g. dot products, distance functions, variance, etc.

  • Relatively simple to implement using SPARQL 1.1



Sparql privacy insurance module
SPARQL Privacy Insurance Module Implementation

  • i.e. SPIM

  • Use authentication, AIR, and differential privacy in one system.

    • Authentication to manage Ɛ-budgets.

    • AIR to control flow of information and non-statistical data.

    • Differential privacy for statistics.

  • Goal: Provide a module that can integrate into SPARQL 1.1 endpoints and provide privacy.


Design
Design Implementation

HTTP Server

OpenID Authentication

AIR Reasoner

Differential Privacy Module

SPIM Main Process

Triplestore

Privacy Policies

User Data


Http server and authentication
HTTP Server and Authentication Implementation

  • HTTP Server: Django server that handles http requests.

  • OpenID Authentication: Django module.

HTTP Server

OpenID Authentication


Spim main process
SPIM Main Process Implementation

  • Controls flow of information.

  • First checks user’s budget, then uses AIR, then performs final differentially-private query.

SPIM Main Process


Air reasoner
AIR ImplementationReasoner

  • Performs access control by translating SPARQL queries to n3 and checking against policies.

  • Can potentially perform more complicated operations (e.g. check user credentials)

AIR Reasoner

Privacy Policies


Differential privacy protocol
Differential Privacy Protocol Implementation

Differential Privacy Module

SPARQL Endpoint

Client

Scenario: Client wishes to make standard SPARQL 1.1 statistical query. Client has Ɛ “budget” of overall accuracy for all queries.


Differential privacy protocol1
Differential Privacy Protocol Implementation

Differential Privacy Module

SPARQL Endpoint

Query,

Ɛ > 0

Client

Step 1: Query and epsilon value sent to the endpoint and intercepted by the enforcement module.


Differential privacy protocol2
Differential Privacy Protocol Implementation

Differential Privacy Module

SPARQL Endpoint

Client

Sens Query

Step 2: The sensitivity of the query is calculated using a re-written, related query.


Differential privacy protocol3
Differential Privacy Protocol Implementation

Differential Privacy Module

SPARQL Endpoint

Client

Query

Step 3: Actual query sent.


Differential privacy protocol4
Differential Privacy Protocol Implementation

Differential Privacy Module

SPARQL Endpoint

Result and Noise

Client

Step 4: Result with Laplace noise sent over.


Experimental evaluation
Experimental Evaluation Implementation


Evaluation
Evaluation Implementation

  • Three things to evaluate:

    • Correctness of operation

    • Correctness of differential privacy

    • Runtime

  • Used an anonymized clinical database as the test data and added fake names, social security numbers, and addresses.


Correctness of operation
Correctness of Operation Implementation

  • Can the system do what we want?

    • Authentication provides access control

    • AIR restricts information and types of queries

    • Differential privacy gives strong privacy guarantees.

  • Can we do better?


Use case used in thesis
Use Case Used in Thesis Implementation

  • Clinical database data protection

  • HIPAA: Federal protection of private information fields, such as name and social security number, for patients.

  • 3 users

    • Alice: Works in CDC, needs unhindered access

    • Bob: Researcher that needs access to private fields (e.g. addresses)

    • Charlie: Amateur researcher to whom HIPAA should apply

  • Assumptions:

    • Django is secure enough to handle “clever attacks”

    • Users do not collude, so can allocate individual epsilon values.


Use case solution overview
Use Case Solution Overview Implementation

  • What should happen:

    • Dynamically apply different AIR policies at runtime.

    • Give different epsilon-budgets.

  • How allocated:

    • Alice: No AIR Policy, no noise.

    • Bob: Give access to addresses but hide all other private information fields.

      • Epsilon budget: E1

    • Charlie: Hide all private information fields in accordance with HIPAA

      • Epsilon budget: E2


Use case solution overview1
Use Case Solution Overview Implementation

  • Alice: No AIR Policy

  • Bob: Give access to addresses but hide all other private information fields.

    • Epsilon budget: E1

  • Charlie: Hide all private information fields in accordance with HIPAA

    • Epsilon budget: E2


Example a clinical database
Example: A Clinical Database Implementation

  • Client Accesses triplestore via HTTP server.

  • OpenID Authentication verifies user has access to data. Finds epsilon value,

HTTP Server

OpenID Authentication


Example a clinical database1
Example: A Clinical Database Implementation

  • AIR reasoner checks incoming queries for HIPAA violations.

  • Privacy policies contain HIPAA rules.

AIR Reasoner

Privacy Policies


Example a clinical database2
Example: A Clinical Database Implementation

  • Differential Privacy applied to statistical queries.

  • Statistical result + noise returned to client.

Differential Privacy Module


Correctness of differential privacy
Correctness of Differential Privacy Implementation

  • Need to test how much noise is added.

    • Too much noise = poor results.

    • Too little noise = no guarantee.

  • Test: Run queries and look at sensitivity calculated vs. actual sensitivity.


How to test sensitivity
How to test sensitivity? Implementation

  • Ideally:

    • Test noise calculation is correct

    • Test that noise makes data still useful (e.g. by applying machine learning algorithms).

  • Fort his project, just tested former

    • Machine learning APIs not as prevalent for linked data.

    • What results to compare to?


Test suite
Test suite Implementation

  • 10 queries for each operation (COUNT, SUM, AVG, MIN, MAX)

  • 10 different WHERE CLAUSES

  • Test:

    • Sensitivity calculated from original query

    • Remove each personal URI using “MINUS” keyword and see which removal is most sensitive


Example for sens test
Example for ImplementationSens Test

  • Query:

    PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>

    PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>

    PREFIX foaf: <http://xmlns.com/foaf/0.1#>

    PREFIX mimic: <http://air.csail.mit.edu/spim_ontologies/mimicOntology#>

    SELECT (SUM(?o) as ?aggr) WHERE{

    ?s foaf:name ?n.

    ?s mimic:event ?e.

    ?e mimic:m1 "Insulin".

    ?e mimic:v1 ?o.

    FILTER(isNumeric(?o))

    }


Example for sens test1
Example for ImplementationSens Test

  • Sensitivity query:

    PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>

    PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>

    PREFIX foaf: <http://xmlns.com/foaf/0.1#>

    PREFIX mimic: <http://air.csail.mit.edu/spim_ontologies/mimicOntology#>

    SELECT (SUM(?o) as ?aggr) WHERE{

    ?s foaf:name ?n.

    ?s mimic:event ?e.

    ?e mimic:m1 "Insulin".

    ?e mimic:v1 ?o.

    FILTER(isNumeric(?o))

    MINUS {?s foaf:name "%s"}

    } % (name)


Results query 6 error
Results Query 6 - Error Implementation


Runtime
Runtime Implementation

  • Queries were also tested for runtime.

    • Bigger WHERE clauses

    • More keywords

    • Extra overhead of doing the calculations.



Interpretation
Interpretation Implementation

  • Sensitivity calculation time on-par with query time

    • Might not be good for big data

    • Find ways to reduce sensitivity calculation time?

  • AVG does not do so well…

    • Approximation yields too much noise vs. trying all possibilities

    • Runs ~4x slower than simple querying

    • Solution 1: Look at all data manually (large data transfer)

    • Solution 2: Can we use NOISY_SUM / NOISY_COUNT instead?


Conclusion
Conclusion Implementation


Contributions1
Contributions Implementation

  • Theory on how to apply differential privacy to linked data.

  • Overall privacy module for SPARQL queries.

    • Limited but a good start

  • Experimental implementation of differential privacy.

    • Verification that it is applied correctly.

  • Other:

    • Updated sparql to n3 translation to Sparql version 1.1

    • Expanded upon IARPA project to create policies against statistical queries.


Shortcomings and future work
Shortcomings and Future Work Implementation

  • Triplestoresneed some structure for this to work

    • Personal information must be explicitly defined in triples.

    • Is there a way to automatically detect what triples would constitute private information?

  • Complexity

    • Lots of noise for sparse data.

    • Can divide data into disjoint sets to reduce noise like PINQ does

    • Use localized sensitivity measures?

  • Third party software problems

    • Would this work better using a different Triplestore implementation?


Diff privacy and an open web
Diff. Privacy and an Open Web Implementation

  • How applicable is this to an open web?

    • High sample numbers, but potentially high data variance.

    • Sensitivity calculation might take too long, need to approximate.

  • Can use disjoint subsets of the web to increase number of queries with ɛ budgets.


Demo Implementation

  • air.csail.mit.edu:8800/spim_module/


References
References Implementation

  • Differential Privacy Implementations:

    • “Privacy Integrated Queries (PINQ)” by Frank McSherry: http://research.microsoft.com/pubs/80218/sigmod115-mcsherry.pdf

    • “Airavat: Security and Privacy for MapReduce” by Roy, Indrajit; Setty, Srinath T. V. ; Kilzer, Ann; Shmatikov, Vitaly; and Witchel, Emmet: http://www.cs.utexas.edu/~shmat/shmat_nsdi10.pdf

    • “Towards Statistical Queries over Distributed Private User Data” by Chen, Ruichuan; Reznichenko, Alexey; Francis, Paul; Gehrke, Johannes: https://www.usenix.org/conference/nsdi12/towards-statistical-queries-over-distributed-private-user-data


References1
References Implementation

  • Theoretical Work

    • “Differential Privacy” by Cynthia Dwork: http://research.microsoft.com/pubs/64346/dwork.pdf

    • “Mechanism Design via Differential Privacy” by McSherry, Frank; and Talwar, Kunal: http://research.microsoft.com/pubs/65075/mdviadp.pdf

    • “Calibrating Noise to Sensitivity in Private Data Analysis” by Dwork, Cynthia; McSherry, Frank; Nissim, Kobbi; and Smith, Adam: http://people.csail.mit.edu/asmith/PS/sensitivity-tcc-final.pdf

    • “Differential Privacy for Clinical Trail Data: Preliminary Evaluations”, by Vu, Duy; and Slavković, Aleksandra: http://sites.stat.psu.edu/~sesa/Research/Papers/padm09sesaSep24.pdf


References2
References Implementation

  • Other

    • “Privacy Concerns of FOAF-Based Linked Data” by Nasirifard, Peyman; Hausenblas, Michael; and Decker, Stefan: http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.153.5772

    • “The Mosaic Theory, National Security, and the Freedom of Information Act”, by David E. Pozenhttp://www.yalelawjournal.org/pdf/115-3/Pozen.pdf

    • “A Privacy Preference Ontology (PPO) for Linked Data”, by Sacco, Owen; and Passant, Alexandre: http://ceur-ws.org/Vol-813/ldow2011-paper01.pdf

    • “k-Anonimity: A Model for Protecting Privacy”, by Latanya Sweeney: http://arbor.ee.ntu.edu.tw/archive/ppdm/Anonymity/SweeneyKA02.pdf


References3
References Implementation

  • Other

    • “Approximation Algorithms for k-Anonimity”, by Aggarwal, Gagan; Feder, Tomas; Kenthapadi, Krishnaram; Motwani, Rajeev; Panigraphy, Rina; Thomas, Dilys; and Zhu, An: http://research.microsoft.com/pubs/77537/k-anonymity-jopt.pdf


Appendix results q1 q2
Appendix: Results Q1, Q2 Implementation


Appendix results q3 q4
Appendix: Results Q3, Q4 Implementation


Appendix results q5 q6
Appendix: Results Q5, Q6 Implementation


Appendix results q7 q8
Appendix: Results Q7, Q8 Implementation



ad