Secure data outsourcing
Download
1 / 54

Secure Data Outsourcing - PowerPoint PPT Presentation


  • 114 Views
  • Uploaded on

Secure Data Outsourcing. Outline. Motivation Background Research issues Summary. Motivation. Cost of maintaining/mining large data 4-5 times of the cost of data acquisition DBAs are paid well  More and more data service providers Low cost – cloud computing

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Secure Data Outsourcing' - maegan


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

Outline
Outline

  • Motivation

  • Background

  • Research issues

  • Summary


Motivation
Motivation

  • Cost of maintaining/mining large data

    • 4-5 times of the cost of data acquisition

    • DBAs are paid well 

  • More and more data service providers

    • Low cost – cloud computing

      • Maintain one database for one user  multiple users

    • Examples:

      • Alentus.com

      • Datapipe.com

      • Discountasp.net

  • Concerns about data security and privacy

    • Untrusted service provider


Un trusted service provider
Un-trusted service provider

  • Lazy: incentives to perform less

  • Curious: incentives to acquire information

  • Malicious:

    • Denial of service

    • Incorrect results

    • Possibly compromised


Challenges
Challenges

  • Data confidentiality

    • Data need to be encrypted (?)

    • Utility of protected data?

      • Query utility

      • Mining utility

  • Access pattern privacy

  • Integrity

    • Data integrity

    • Query integrity

      • Correct

      • Complete

      • Fresh


Why is it hard for query services
Why is it hard for query services?

  • Arbitrary expressivity

    • SQL statements

    • Often, restricted for certain type of query for simplicity (e.g. range query, knn query)

  • Cost

    • Communication

    • Computation (server side vs client side)


Why it is hard for mining services
Why it is hard for mining services?

  • Many data mining models

    • Different utilities to preserve

    • No one-size-for-all solutions


Data confidentiality
Data confidentiality

  • Bucketization method (crypto-index)

  • Order preserving encryption

  • Perturbations


Bucketization method
Bucketization method

  • Hacigumus (SIGMOD02)


Secure data outsourcing

  • Main steps

    • Partition sensitive attributes

      • Order preserving: supports comparison

      • Random: query rewriting becomes hard

    • Build index on the partitions

    • Rewrite queries to target partitions

      • ‘john doe’  105

      • Select * from T’ where name=105

    • Execute queries and return results

    • Prune/post-process results on client


Secure data outsourcing


Order preserving encryption
Order preserving encryption

  • Agrawal2004, Boldyreva2009

  • The set of data is securely transformed so that the order is preserved but the distribution and domain are changed

  • Benefits: indexing/searching on OPE encrypted data

  • Weakness: once the original distribution is known, OPE is broken


Secure data outsourcing

Bucket based

Estimation

OPE

Original Xi distribution is known

Transformed Xi’ distribution


Data perturbation
Data perturbation

  • Definition

    1. randomly change the original data

    2. the attacker cannot effectively recover the original data

    3. the desired properties are preserved

  • Techniques

    • Single dimension: noise addition

    • Multidimensional

      • Geometric perturbation

      • Random projection

      • RASP random space perturbation


Noise addition
Noise addition

  • Y = X+ R

    • X: original data column, R: random noise (distribution published), Y: published data

  • Applications in data mining

    • Reconstructing column distribution

      • Rakesh Agrawal SIGMOD 2000

      • Applied to privacy-preserving decision tree, naïve bayes classifier

  • Attacks

    • Spectral filtering (Kargupta ICDM 2004)

    • PCA reconstruction (Huang SIGMOD2005)


Secure data outsourcing

  • Multiplicative perturbations

    • Geometric data perturbation for outsourced data mining

    • Random Projection

    • RASP perturbation for query services (range query, kNN query).



Geometric data perturbation
Geometric data perturbation

  • Y=RX+T+D

    • R: secret rotation matrix (preserve Euclidean distances)

    • T: secret random translation matrix, D: secret random noise matrix

    • Distances are approximately preserved (D)

    • Resilient to most attacks to rotation perturbation

  • Applications

    • Outsourced privacy preserving data mining, applicable for many classification and clustering algorithms

  • Attacks

    • Population based attacks (when covariance matrix is revealed)


Random projection
Random Projection

  • Y=AX+D

    • A: random projection, e.g., entries from N(0,1)

    • Distances are approximately preserved

  • Applications

    • Many classification and clustering algorithms

      • Worse accuracy than geometric perturbation

    • Good for sparse high-dimensional data (text data), i.e., sketch methods (A is randomly generated for EACH record)

  • Attacks

    • Possibly more resilient than other two perturbation methods

    • But utility (distance) is not well preserved


Rasp perturbation
RASP perturbation

k-dimensional numeric data, n records,

represented as a k x n matrix, x: a record

(1) Extend x to k+2 dimensions

  • (K+1) th dimension is always 1 – homogeneous dimension

  • (K+2) th dimension v is a real random number drawn from

    (2) Encryption

    - A is a (k+2)x(k+2) invertible real value matrix, with at least two non-zero values for each row and the last column of A has all non-zero values

    - A is shared by all records


Secure data outsourcing

  • Properties

    • Not an OPE

    • Preserves convexity of the dataset

      • Convex dataset in Rk another convex dataset in Rk+2.

    • Good for range query

      • Each range query in Rk

         hyperplane based query

         range query in Rk+2.


Rasp properties
RASP properties

  • Convexity preserving

    • Queried range (hypercube) is convex

    • RASP transforms the range to another convex (polyhedron)

half space: wTx<=a

wTx=a

The intersection of convex sets is also convex.


Illustration of convexity preserving
illustration of convexity preserving

Encrypted space

Original space


Secure query transformation
Secure query transformation

  • A naïve solution

    • Based on the convexity preserving property

Problems: (1) A-1 can be probed

(2) is . . If a is known, the whole

dimension i is breached.


Secure query transformation1
Secure query transformation

  • Enhanced solution

    • Xk+2 is always positive

    • (Xi-a)  0  (Xi-a)Xk+2 0

    • Correspondingly, in the encrypted space yTy  0,

Problems addressed:

(1) A-1 cannot be derived from 

(2) (Xi-a)Xk+2 0 contains the random component Xk+2 that protects

the condition (Xi-a)  0


Efficient two stage query processing
Efficient two-stage query processing

  • illustrated

Stage2:

Filter out the junk records

Stage1:

Querying this bounding

box

Original space

Transformed space

A multidimensional tree index is been built on the encrypted data (in the

transformed space) in the server.


Secure data outsourcing

Stage 1:

The client calculates the large bounding box;

The server uses the index to find the results.

Stage 2:

filter the initial results with the conditions yTiy  0 for 1…2k

Note: the two-stage strategy works, if the output of stage 1 is significantly smaller than the original database and can be fit into the memory.

Otherwise, use linear scan with stage 2 filtering.


Rasp based data mining
RASP-based data mining

  • Preserving range query  linear classifier

  • Use the boosting framework to get strong classifiers (PerturBoost, in ICDM 2013)


Access pattern privacy
Access pattern privacy

  • On database queries

    • Problem is the same as PIR

    • Attackers may use the access pattern to breach data confidentiality

  • Each of previous approaches should handle this problem!


Pir is impractical
PIR is impractical

  • Solutions based on private Information retrieval (PIR)

    • PIR is still impractical


For bucktization approach
For Bucktization approach

  • Based on the architecture of Hacigumus (SIGMOD02)

  • Hore VLDB04

    • For range query

    • Privacy concern: reveal the distribution of value in each bucket

    • “Diffusion”: split buckets and combine parts of different buckets

    • Trade off: now the server needs to return more noisy results  larger size


For ope
For OPE

  • Use queries to find out the distributions, then break the encryption


For rasp
For RASP

  • Secure query transformation

  • Attacks to transformed queries


Oblivious ram
Oblivious RAM

  • Access pattern: read/write data items

  • Setting:

    • Client has a small secure memory

    • Server has large insecure storage, semi-honest

    • Data items are encrypted

    • Client cannot hide the accessed locations

  • An active area


Existing approaches
Existing Approaches

  • Inside a level

    • Some real blocks

      • Useful data

    • Some dummy blocks

      • Random data

    • Randomly permuted

      • Only the client knows the permutation


Existing approaches1
Existing Approaches

  • Reading

    • Read a block from each level

    • One realblock.

    • Remaining are dummy blocks

dummy

real

dummy

dummy

dummy

dummy

Client

Server


Existing approaches2
Existing Approaches

  • Writing

    • Shuffle consecutively filled levels.

    • Write into next unfilled level.

    • Clear the source levels

Server (after)

Client

Server (before)

shuffle

blocks


Continuous shuffling
Continuous Shuffling

To write:



Integrity guarantee
Integrity guarantee

  • Merkle hash tree

H(H(x1)+H(x2)) , + is string concatenation

Can be stored with tree like structure : index, xml




Using merkle tree
Using merkle tree

Example:

5<=q<=10

LUB(q) = 4

GLB(q) = 11


Secure data outsourcing

  • Operations:

    • Selections, projections, equijoins, set ops

  • Issues

    • Works only on data with verification objects

    • Query expressiveness

    • Expensive

  • Related work

    • Pang et. al (ICDE04, SIGMOD05), using ElGamal function

    • Sion VLDB05: challenge token

    • F.Li SIGMOD06: freshness


Secure keyword search
Secure keyword search

  • Simple information retrieval

    • For a keyword, find the documents containing the keyword

  • What if the documents are encrypted word by word

  • and if the keyword is also encrypted


Secure keyword search1
Secure keyword search

  • Song 2000

  • Seed is random, different for

  • each Wi

  • Key idea: Li and Ri are self-

  • verifiable

  • Advantage of XOR



Secure data outsourcing

  • Setting of ki

    • Ki = Fk’(Wi), k’ is secret

    • User publishes W and k = Fk’(W)

    • Server checks CiW

       whether <Li, Fk(Li)> == CiW

      It reveals nothing if Ci is not the ciphertext for W.

      And Li is random for different Wi – server cannot find any information from Li.


Hidden search
Hidden search

  • In previous schemes, W is revealed

    • Weakness: each search will have to release k for W

    • Easy to collect information

  • Solution: encrypt Wi with an private key, then xor with <Li, Fk(Li)>


  • Recent developments
    Recent developments

    • Reza 2006

      • “Searchable symmetric encryption: improved definitions and efficient constructions”

      • Completely solved this problem, with a solution indistinguishability under chosen ciphertext attack (IND-CCA)




    Discussion
    Discussion

    • Data confidentiality/access pattern

      • Restrict cryptographic definition (keyword search) or

      • Relaxed definition (perturbation, bucketization, OPE, etc.)

    • It is very difficult to formulate and prove the security of non-traditional approaches

      • Do we need to reformulate the security model? and how?