secure data outsourcing
Download
Skip this Video
Download Presentation
Secure Data Outsourcing

Loading in 2 Seconds...

play fullscreen
1 / 54

Secure Data Outsourcing - PowerPoint PPT Presentation


  • 114 Views
  • Uploaded on

Secure Data Outsourcing. Outline. Motivation Background Research issues Summary. Motivation. Cost of maintaining/mining large data 4-5 times of the cost of data acquisition DBAs are paid well  More and more data service providers Low cost – cloud computing

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about ' Secure Data Outsourcing' - maegan


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
outline
Outline
  • Motivation
  • Background
  • Research issues
  • Summary
motivation
Motivation
  • Cost of maintaining/mining large data
    • 4-5 times of the cost of data acquisition
    • DBAs are paid well 
  • More and more data service providers
    • Low cost – cloud computing
      • Maintain one database for one user  multiple users
    • Examples:
      • Alentus.com
      • Datapipe.com
      • Discountasp.net
  • Concerns about data security and privacy
    • Untrusted service provider
un trusted service provider
Un-trusted service provider
  • Lazy: incentives to perform less
  • Curious: incentives to acquire information
  • Malicious:
    • Denial of service
    • Incorrect results
    • Possibly compromised
challenges
Challenges
  • Data confidentiality
    • Data need to be encrypted (?)
    • Utility of protected data?
      • Query utility
      • Mining utility
  • Access pattern privacy
  • Integrity
    • Data integrity
    • Query integrity
      • Correct
      • Complete
      • Fresh
why is it hard for query services
Why is it hard for query services?
  • Arbitrary expressivity
    • SQL statements
    • Often, restricted for certain type of query for simplicity (e.g. range query, knn query)
  • Cost
    • Communication
    • Computation (server side vs client side)
why it is hard for mining services
Why it is hard for mining services?
  • Many data mining models
    • Different utilities to preserve
    • No one-size-for-all solutions
data confidentiality
Data confidentiality
  • Bucketization method (crypto-index)
  • Order preserving encryption
  • Perturbations
bucketization method
Bucketization method
  • Hacigumus (SIGMOD02)
slide10
Main steps
    • Partition sensitive attributes
      • Order preserving: supports comparison
      • Random: query rewriting becomes hard
    • Build index on the partitions
    • Rewrite queries to target partitions
      • ‘john doe’  105
      • Select * from T’ where name=105
    • Execute queries and return results
    • Prune/post-process results on client
slide11
Trade off between confidentiality and overhead
    • Larger partition  increased privacy  increased overheads
order preserving encryption
Order preserving encryption
  • Agrawal2004, Boldyreva2009
  • The set of data is securely transformed so that the order is preserved but the distribution and domain are changed
  • Benefits: indexing/searching on OPE encrypted data
  • Weakness: once the original distribution is known, OPE is broken
slide13

Not attribute-wise order preserving

    • Order preserving encryption (OPE, Agrawal et al 2004) is not resilient to distribution-based attacks

Bucket based

Estimation

OPE

Original Xi distribution is known

Transformed Xi’ distribution

data perturbation
Data perturbation
  • Definition

1. randomly change the original data

2. the attacker cannot effectively recover the original data

3. the desired properties are preserved

  • Techniques
    • Single dimension: noise addition
    • Multidimensional
      • Geometric perturbation
      • Random projection
      • RASP random space perturbation
noise addition
Noise addition
  • Y = X+ R
    • X: original data column, R: random noise (distribution published), Y: published data
  • Applications in data mining
    • Reconstructing column distribution
      • Rakesh Agrawal SIGMOD 2000
      • Applied to privacy-preserving decision tree, naïve bayes classifier
  • Attacks
    • Spectral filtering (Kargupta ICDM 2004)
    • PCA reconstruction (Huang SIGMOD2005)
slide16

Multiplicative perturbations

    • Geometric data perturbation for outsourced data mining
    • Random Projection
    • RASP perturbation for query services (range query, kNN query).
geometric data perturbation
Geometric data perturbation
  • Y=RX+T+D
    • R: secret rotation matrix (preserve Euclidean distances)
    • T: secret random translation matrix, D: secret random noise matrix
    • Distances are approximately preserved (D)
    • Resilient to most attacks to rotation perturbation
  • Applications
    • Outsourced privacy preserving data mining, applicable for many classification and clustering algorithms
  • Attacks
    • Population based attacks (when covariance matrix is revealed)
random projection
Random Projection
  • Y=AX+D
    • A: random projection, e.g., entries from N(0,1)
    • Distances are approximately preserved
  • Applications
    • Many classification and clustering algorithms
      • Worse accuracy than geometric perturbation
    • Good for sparse high-dimensional data (text data), i.e., sketch methods (A is randomly generated for EACH record)
  • Attacks
    • Possibly more resilient than other two perturbation methods
    • But utility (distance) is not well preserved
rasp perturbation
RASP perturbation

k-dimensional numeric data, n records,

represented as a k x n matrix, x: a record

(1) Extend x to k+2 dimensions

  • (K+1) th dimension is always 1 – homogeneous dimension
  • (K+2) th dimension v is a real random number drawn from

(2) Encryption

- A is a (k+2)x(k+2) invertible real value matrix, with at least two non-zero values for each row and the last column of A has all non-zero values

- A is shared by all records

slide21

Properties

    • Not an OPE
    • Preserves convexity of the dataset
      • Convex dataset in Rk another convex dataset in Rk+2.
    • Good for range query
      • Each range query in Rk

 hyperplane based query

 range query in Rk+2.

rasp properties
RASP properties
  • Convexity preserving
    • Queried range (hypercube) is convex
    • RASP transforms the range to another convex (polyhedron)

half space: wTx<=a

wTx=a

The intersection of convex sets is also convex.

illustration of convexity preserving
illustration of convexity preserving

Encrypted space

Original space

secure query transformation
Secure query transformation
  • A naïve solution
    • Based on the convexity preserving property

Problems: (1) A-1 can be probed

(2) is . . If a is known, the whole

dimension i is breached.

secure query transformation1
Secure query transformation
  • Enhanced solution
    • Xk+2 is always positive
    • (Xi-a)  0  (Xi-a)Xk+2 0
    • Correspondingly, in the encrypted space yTy  0,

Problems addressed:

(1) A-1 cannot be derived from 

(2) (Xi-a)Xk+2 0 contains the random component Xk+2 that protects

the condition (Xi-a)  0

efficient two stage query processing
Efficient two-stage query processing
  • illustrated

Stage2:

Filter out the junk records

Stage1:

Querying this bounding

box

Original space

Transformed space

A multidimensional tree index is been built on the encrypted data (in the

transformed space) in the server.

slide27

Stage 1:

The client calculates the large bounding box;

The server uses the index to find the results.

Stage 2:

filter the initial results with the conditions yTiy  0 for 1…2k

Note: the two-stage strategy works, if the output of stage 1 is significantly smaller than the original database and can be fit into the memory.

Otherwise, use linear scan with stage 2 filtering.

rasp based data mining
RASP-based data mining
  • Preserving range query  linear classifier
  • Use the boosting framework to get strong classifiers (PerturBoost, in ICDM 2013)
access pattern privacy
Access pattern privacy
  • On database queries
    • Problem is the same as PIR
    • Attackers may use the access pattern to breach data confidentiality
  • Each of previous approaches should handle this problem!
pir is impractical
PIR is impractical
  • Solutions based on private Information retrieval (PIR)
    • PIR is still impractical
for bucktization approach
For Bucktization approach
  • Based on the architecture of Hacigumus (SIGMOD02)
  • Hore VLDB04
    • For range query
    • Privacy concern: reveal the distribution of value in each bucket
    • “Diffusion”: split buckets and combine parts of different buckets
    • Trade off: now the server needs to return more noisy results  larger size
for ope
For OPE
  • Use queries to find out the distributions, then break the encryption
for rasp
For RASP
  • Secure query transformation
  • Attacks to transformed queries
oblivious ram
Oblivious RAM
  • Access pattern: read/write data items
  • Setting:
    • Client has a small secure memory
    • Server has large insecure storage, semi-honest
    • Data items are encrypted
    • Client cannot hide the accessed locations
  • An active area
existing approaches
Existing Approaches
  • Inside a level
    • Some real blocks
      • Useful data
    • Some dummy blocks
      • Random data
    • Randomly permuted
      • Only the client knows the permutation
existing approaches1
Existing Approaches
  • Reading
    • Read a block from each level
    • One realblock.
    • Remaining are dummy blocks

dummy

real

dummy

dummy

dummy

dummy

Client

Server

existing approaches2
Existing Approaches
  • Writing
    • Shuffle consecutively filled levels.
    • Write into next unfilled level.
    • Clear the source levels

Server (after)

Client

Server (before)

shuffle

blocks

integrity guarantee
Integrity guarantee
  • Merkle hash tree

H(H(x1)+H(x2)) , + is string concatenation

Can be stored with tree like structure : index, xml

using merkle tree
Using merkle tree

Example:

5<=q<=10

LUB(q) = 4

GLB(q) = 11

slide44
Operations:
    • Selections, projections, equijoins, set ops
  • Issues
    • Works only on data with verification objects
    • Query expressiveness
    • Expensive
  • Related work
    • Pang et. al (ICDE04, SIGMOD05), using ElGamal function
    • Sion VLDB05: challenge token
    • F.Li SIGMOD06: freshness
secure keyword search
Secure keyword search
  • Simple information retrieval
    • For a keyword, find the documents containing the keyword
  • What if the documents are encrypted word by word
  • and if the keyword is also encrypted
secure keyword search1
Secure keyword search
  • Song 2000
  • Seed is random, different for
  • each Wi
  • Key idea: Li and Ri are self-
  • verifiable
  • Advantage of XOR
slide49
Setting of ki
    • Ki = Fk’(Wi), k’ is secret
    • User publishes W and k = Fk’(W)
    • Server checks CiW

 whether <Li, Fk(Li)> == CiW

It reveals nothing if Ci is not the ciphertext for W.

And Li is random for different Wi – server cannot find any information from Li.

hidden search
Hidden search
  • In previous schemes, W is revealed
      • Weakness: each search will have to release k for W
      • Easy to collect information
  • Solution: encrypt Wi with an private key, then xor with <Li, Fk(Li)>
recent developments
Recent developments
  • Reza 2006
    • “Searchable symmetric encryption: improved definitions and efficient constructions”
    • Completely solved this problem, with a solution indistinguishability under chosen ciphertext attack (IND-CCA)
discussion
Discussion
  • Data confidentiality/access pattern
    • Restrict cryptographic definition (keyword search) or
    • Relaxed definition (perturbation, bucketization, OPE, etc.)
  • It is very difficult to formulate and prove the security of non-traditional approaches
    • Do we need to reformulate the security model? and how?
ad