1 / 32

A Privacy Preserving Index for Range Queries

A Privacy Preserving Index for Range Queries. Bijit Hore , Sharad Mehrotra, Gene Tsudik. Database as a Service (DAS) [Hacigumus et. al, SIGMOD2002]. A client wants to store data on a remote server & run queries on it BUT he does not trust the server

glyn
Download Presentation

A Privacy Preserving Index for Range Queries

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. A Privacy Preserving Index for Range Queries Bijit Hore, Sharad Mehrotra, Gene Tsudik

  2. Database as a Service (DAS) [Hacigumus et. al, SIGMOD2002] • A client wants to store data on a remote server & run queries on it • BUT he does not trust the server • Solution: Encrypt the data & store it • How do you query the encrypted data ? Untrusted Trusted True Results Encrypted Results Query Post Processor Encrypted & Indexed Client Data Server Query Translator Query over Encrypted Data User Original Query Service Provider Client

  3. Data storage in DAS Client side storage Meta data Server side data buckets Z0 Z1 Z2 Z3 Z4 0 200 450 600 650 700 Server side Table (encrypted + indexed) RA Original Table (plain text)R Bucket-tags

  4. Querying in DAS Select * from R where R.sal  [400K, 600K] Client-side query Server-side query SelectetuplefromRAwhereRA.salA= z1 ∨z2 Server side Table (encrypted + indexed) RA Client side Table (plain text)R Client side Table (plain text)R Bucket-tags

  5. Issues in partitioning • How many buckets should one use ? • How to partition the data ?

  6. Data Privacy in DAS • Adversary Access to sever-side data + Malicious Intentions • Privacy issue in partitioned data Small range of a bucket B + 1 sample value from B • Privacy goal of client To hide all useful information from A Put all values of an attribute in a single bucket ! Adversary (A) “Almost total” disclosure of all elements in B

  7. Research challenges & our contributions • Precision: how to partition data • Definition • Optimal partitioning to maximize precision • Privacy: quantifying disclosure • Adversary’s goals • Measures of information disclosure • Privacy-Precision trade-off • Controlled diffusion algorithm  • Experiments & Conclusion Privacy Precision

  8. Precision of range queries • Given a partition of data into M parts • Precision (q) = 1 – (# false positives / # tuples returned for q) • Recall = 1 • Workload: All O(N2) range queries are equiprobable (uniform) # false positive α∑ NB*FB= 5*32 + 5*18 = 250 B Precision = 1 – 20/50 = 0.6 q M = 2 10 10 Frequency NB=5,FB=18 6 4 4 4 4 4 N = 10 (domain size) 2 2 1 2 3 4 5 6 7 8 9 10 Salary (100K’s)

  9. Query optimal buckets (QOB) • Optimization problem: For the uniform workload find a partition of the data into M buckets that minimizes total # false positives i.e. 4 Minimize ∑ NB*FB B=1 Optimal solution to a sub-problem Cost of rightmost bucket QOB (1,10,4) = QOB (1,7,3) + Cost(8,10) 10 10 Frequency NB*FB = 24 6 4 4 4 4 4 N = 10 (domain size) 2 2 1 2 3 4 5 6 7 8 9 10 Salary (100K’s)

  10. QOB (cont.) 4 Optimal cost =∑NB*FB = 12*3 + 20*2 + 10*2 + 8*3= 110 1 B1 B2 B3 B4 10 10 6 Frequency 4 4 4 4 4 2 2 1 2 3 4 5 6 7 8 9 10 Salary(100K’s) Time complexity =O(n2M),Space =O(nM) n= # distinct values in dataset;M= # buckets

  11. Outline • Optimal data partitioning for range queries • Adversarial goals & privacy measures • Balancing privacy and precision • Experiments & conclusion

  12. Adversary’s learning model Need to learn bucketproperties to estimate sensitive values Model A’s Domain knowledge + Sample values from buckets Worst case assumption for Privacy Analysis: A knows exact value distribution for every bucket A learns distribution ofvalues in buckets

  13. Adversarial Goal (I) Individual Centric Information: Eg: “What is thesalaryof an individual I” Value Estimation Power (VEP) of A Variance of bucket-distribution is an inverse measure of VEP Average error of value estimation for Adversary Preferred: Large variance Small variance Large Small Bucket range Bucket range

  14. Adversarial Goal (II) Query Centric Information: Eg: “Which individuals have salary  [100k,150k]” Set Estimation Power (SEP) of A Entropy of bucket-distribution is an inverse measure of SEP* Best case: high entropy + large variance Average error of query-set estimation for Adversary low entropy + large variance Large Small 100k 150k 100k 150k H(X) = - ∑ pilogpi Bucket range Bucket range

  15. Outline • Optimal data partitioning for range queries • Adversarial goals & privacy measures • Balancing privacy and precision • Experiments & conclusion

  16. Privacy-Precision Trade-off • Optimal buckets might offer less privacy than desired • Small variance partialdisclosureof numeric value • Small entropy  Total disclosure with high probability (e.g. categorical data) Partialdetection of query-sets (for all cases) Algorithm that allows trading-off bounded amount of query precision for greater variance and entropy Objective

  17. The controlled diffusion algorithm A simple observation Q • Let a query Q overlap only with B0 • If elements of B0 are distributed • into CB1, CB2 & CB3 randomly • Now Q overlaps with CB1, CB2 & CB3 • With new buckets, the precision for Q drops by factor of • (|CB1|+|CB2|+|CB3|) / |B0| • Any re-distribution scheme where ∀ Bithis ratio≤ K  precision degradation is bounded above byK B0 CB1 CB2 CB3

  18. Controlled diffusion Algorithm • Compute optimal buckets on data set DB1 … BM • Fix max degradation factor = K • Initialize M empty composite buckets CB1 … CBM • Set target size of each CB to fCB = |D|/M (equidepth) • ∀Bi • select diCB’s at random, where di = K*|Bi|/fCB • Diffuse elements of Bi into these uniformly at random

  19. Controlled Diffusion (Example) Degradation factor k = 2 Query optimal buckets Metadata size increases from O(M) to O(KM) 10 10 10 Freq B1 B2 B3 B4 6 Final set of buckets on server 4 4 4 4 4 2 2 1 2 3 4 5 6 7 8 9 10 2 4 2 2 2 2 Values CB1 CB1 4 2 2 3 CB2 CB2 2 2 2 3 4 CB3 CB3 CB4 3 4 2 3 CB4 1 2 3 4 5 6 7 8 9 10 Composite Buckets

  20. Some features of the diffusion algorithm • Many consecutive optimal buckets might get diffused into common set of CB’s  • Observed precision degradation < K • Elements with same values can go to multiple buckets • Giving it an extra degree of freedom compared to hashing • Not best for point queries • Random choice in the algorithm • Each bucket distribution approaches data distribution as K increases  reducing information gained by adversary by learning buckets

  21. Outline • Optimal data partitioning for range queries • Adversarial goals & privacy measures • Balancing privacy and precision • Experiments & conclusion

  22. Experiments • Data sets • Synthetic Data: 105Integers in [0,999] uniformly at random • Real Data:104 Real values in [-0.8,8.0] “Corel Image” dataset (UCI KDD archive) • Query workloads (2 of size 104 each) • End points chosen uniformly at random from the respective ranges

  23. Relative decrease in precision of composite buckets • Relative increase in standard deviation in composite buckets • Relative increase in entropy in composite buckets

  24. Composite buckets (sample) K = 6, M = 350 K = 10, M = 250

  25. Visualizing trade-offs for various bucketization parameters • Eg: The marked points show the average entropy & precision we get for 100 buckets & degradation factor of 2 • The same point in the precision vs standard deviation trade-off space  • Provides an easy way to visualize the design space and choose parameters of interest

  26. Summary • Anoptimalalgorithm for partitioning data for range queries • Statistical measures of data privacy • Variance • Entropy • Fast & simple algorithm forre-bucketizingdata • Bounded amount of precision degradation • Substantial increase in privacy level

  27. Related work • Hacigumus et. al, SIGMOD 2002, “Executing SQL over Encrypted Data in the Database Service Provider Model”. • Damiani et. al, ACM CCS 2003, “Balancing Confidentiality and Efficiency in Untrusted Relation DBMS”. • Bouganim et. al, VLDB 2002 “Chip-Secured Data Access: Confidential Data on Untrusted Servers”.

  28. THANK YOU ! Questions ?

  29. Privacy in DAS • Here goal of “Data Privacy” is not just ensuring “non-disclosure of identity”. It is more general ! Privacy-preserving DM & Statistical DB DAS • Privacy criteria: Hide as much information as possible (even at the aggregate level) • Utility criteria: Maintain only the necessary information required for server-side query evaluation (at desired degree of accuracy) • Privacy criteria: Protect against disclosure of identity • Utility criteria: Minimizing information loss i.e. maximize utility for data miners, retain as much aggregate level information as possible

  30. Individual Privacy Measure Average Squared Error of Estimation (ASEE) Error in approximating true value of a r.v XB by another r.v XB’(learned by A) ASEE(XB,XB’)= Var(XB) + Var(XB’) + (E(XB) – E(XB’))2 Varianceof bucket distribution, Var(XB) is our measure of individual privacy (lower bound)

  31. Set oriented Privacy Measure Entropy of bucket distribution is our measure for query-centric privacy • Measures uncertainty associated with a r.v (Eg. True class of an element for categorical data) • An inverse measure of the quality of partial solution sets* that A can derive for a query H(X) = - ∑ pilogpi

  32. Meta data size increase in diffusion • The meta data increases from O(M) to K*|B1|/fcb + K*|B2|/fcb + … + K*|BM|/fcb = (K/fcb) * (|B1| + |B2| + … + |BM|) = (KM/|D|)*|D| = O(KM)

More Related