1 / 40

Privacy Preservation of Aggregates in Hidden Databases: Why and How?

This paper explores strategies for protecting sensitive aggregates in hidden databases, using a novel approach of inserting dummy tuples with CAPTCHA flags. The objective is to make it difficult for adversaries to obtain accurate aggregates while minimizing the reduction of service quality for normal users.

bchristie
Download Presentation

Privacy Preservation of Aggregates in Hidden Databases: Why and How?

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Privacy Preservation of Aggregates in Hidden Databases: Why and How? Zhang Kun 2009-12-04

  2. Paper Introduction • Arjun Dasgupta, Nan Zhang, Gautam Das, Surajit Chaudhuri: Privacy preservation of aggregates in hidden databases: why and how? SIGMOD 2009:153-164 • Arjun Dasgupta • UT Arlington (Computer Science and Engineering Department, The University of Texas at Arlington, http://cse.uta.edu/ ) • Nan Zhang • George Wash. Univ. • Cautam Das • UT Arlington,Arjun Dasgupta’s advisor • Surajit Chaudhuri • Microsoft Research

  3. Outline • Introduction • Strategies for Protecting Sensitive Aggregates • A Generic Defense Model • Experiments Results • Conclusion

  4. Hidden Databases • Form-like interface • Return top-k tuples

  5. Search Queries vs. Aggregate Queries • Search Queries • SELECT * FROM D WHERE ac1=vc1&…&acu=vcu • Answered by hidden database with top-k restriction • Overflowing(>k),valid(1…k), and underflowing(0 tuple) queries • Aggregate Queries • SELECT AGGR(*) FROM D WHERE ac1=vc1&…&acu=vcu • e.g., How many “Thinking in Java” book left in DangDang? • Cannot be answered through the public web interface

  6. Sampling Techniques for Discovering Aggregates • HIDDEN-DB-SAMPLER and HYBRID-SAMPLE proposed in our previous work [DDM07, DZD09] provides efficient sampling with very small bias. The retrieved samples can then be used for aggregated estimation. • Liu Wei, Meng Xiaofeng, Ling Yanyan proposed a graph-based approach for web database sampling. (Journal of Software, Vol,19, No.2, 2008)

  7. WDB-Sampler: Web Database Sampling Process

  8. Privacy Concerns over Aggregate Information • Book dealership (commercial competition) • Which books are in a low inventory status less than 30? • Reason: If a competitor knows such aggregate information, it take advantage of the low inventory by a multitude of tactics (e.g., stock those books, make adjustments to price) • Fight Occupancy (homeland security) • Which flights, on what dates, are likely to be relatively empty? • Reason: In 911 ,terrorists’ tactics are believed to be to hijack relatively empty flights because there would be less resistance from occupants.

  9. Protecting Sensitive Aggregates Over Hidden Databases • A novel problem of protecting sensitive aggregates over hidden web databases • Reveal individual tuples truthfully and efficiently • Tuplewise privacy preservation techniques such as encryption,, date perturbation and generalization methods cannot be used in this scenario ( like in SaaS ) • But hide aggregated views of the data • Comparison with existing work • Most existing work focuses on protecting individual tuples while disclosing aggregates • Inverse to our problem • These exists work that recognizes the possible sensitivity of aggregate information • However, the objective is not to reveal individual tuples truthfully while protecting the sensitive aggregates

  10. Problem Statement • The objective of aggregate suppression is to • Make it very difficult for any adversary to obtain an accurate of aggregates via the search interface, and • Minimize the reduction of service quality for search queries issued by normal users. • Normal users may include • Human end-users • Third-party web mashup applications

  11. Outline • Introduction • Strategies for Protecting Sensitive Aggregates • A Generic Defense Model • Experiments Results • Conclusion

  12. Naïve Strategies (I) • Audit query history for each user, IP address, etc • Problem: distributed attacks

  13. Naïve Strategies (II) • CAPTCHA challenge before submitting or answering a query • Problem: significant inconvenience to end users, completely disable third-party applications through the public interface

  14. Naïve Strategies (III) • Use machine-incomprehensible response (e.g., image instead of text for attribute values) • Problem: overhead on the server side, disable third-party applications through the public interface

  15. Our Strategy: Insertion of Dummy Tuples with CAPTCHA flags • Insert dummy tuples, and associate each tuple with a CAPTCHA flag including whether it is real or dummy

  16. Justification: Why Dummy Tuples? • Requirements of truthfully revealing individual tuples • No change of existing tuple values • No removal of existing tuples • Only choice left: insert dummy tuples • Key observation: All existing sampling attack retrieve samples from valid queries only. • Overflowing queries are usually overlooked because their results (top-k tuples) are preferentially selected by a ranking function, and hence cannot be assumed to be random • How about inserting dummy tuples to overflow valid queries?

  17. Dummy Oracle • DUMMY ORACLE (m0, C) • Generates dummy tuples that • Satisfy the search condition C • Cannot be identified as a dummy by the adversary • Terminates when • Either no more tuples satisfying the above conditions can be generated, • Or m0 dummy tuples have been generated • Whichever occurs first • How to construct the dummy oracle requiring proper modeling of external knowledge left as an open problem

  18. Justification: Why CAPTCHA Flags? • Necessity of flag due to the requirements of trustfully disclosing individual tuples • CAPTCHA flags exploit the different requirements of a search user, a mashup applications, and an adversary • A bona fide search user issues a small number of search queries – manual interpretation of CAPTCHA flags is tolerable • A web mashup application can directly push the flag to its end users – thereby maintaining its usability • However, an adversary requires a relatively large number of queries for aggregate estimation – CAPTCHA flags become a major deficiency for the adversary

  19. Technique Problem Definition • Problem definition: Given the attacking-cost limit umax and a set of sensitive aggregate queries, the objective of dummy insertion is to achieve (ε,δ,p)-guarantee of each aggregate query while minimizing the number of inserted dummy tuples. • Attacking-cost limit umax: the maximum number of search queries that an adversary can issue. • Why minimizing the number of inserted dummy tuples? • Dummy tuples naturally lead to degradation of service quality

  20. (ε,δ)-privacy game • (ε,δ)-privacy game • Between a user and the hidden database owner for a sensitive aggregate query QA • The owner chooses its defensive schema • The user issues search queries and analzes their results to try and estimate Res(QA) (Results of the aggregate query QA) • The user wins if x such that the user has confidence > δ that Res(QA) [x,x+ ε]. Otherwise the user loses.

  21. (ε,δ,p)-privacy guarantee • We say that a defensive scheme achieve (ε,δ,p)-privacy guarantee for QA iff for any user, Pr{A wins (ε,δ)-privacy game for QA} ≤p

  22. Outline • Introduction • Strategies for Protecting Sensitive Aggregates • A Generic Defense Model • Experiments Results • Conclusion

  23. Key Observations for Single Sample Tuple Attack • Any smart sampler should issue queries that maximize the shrinkage of search space for valid queries • Valid queries as well as long overflowing queries contribute the most of shrinking the space • Fortunately, long queries (both overflowing and valid) are difficult to find before dummy insertion. • For a Boolean database, within the (2c*nCc)c-predicate queries, the total number of valid and overflowing queries is at most m*nCc; thus the probability of choosing one is no more than m/2c, which is extremely small when c is large. • Thus, short valid queries becomes the main threat for defense.

  24. Neighbor Insertion • b-Neighbor Insertion: add dummy tuples such that all valid queries with fewer than b predicates will overflow • Approach: insert dummy tuples into the “neighboring zone” of real tuples (i.e., sharing the same values on a large number of attributes)

  25. Example Neighbor Insertion • A Boolean database with m(m=7) unique tuples on n attributes. Only one tuple t1=<0,0,1,0,1>satisfy a1=a2=0.

  26. Before dummy tuple After dummy tuple • Dummy tuple(0,0,0,0,0) in the t1=<0,0,0,0,1> ‘s neighboring zone • b-neighboring Insertion (b=n=5) • Short valid queries -> short overflowing queries

  27. Key Observations for Multi Sample Tuples Attack • Key Observations: • Shrinkage by underflow is permanent • Shrinkage by overflow is mostly temporary • Thus, short underflowing queries become a very dangerous threat

  28. High-Level Packing • d-Level Packing: add dummy tuples such that all underflowing queries with fewer than d predicates will overflow • Approach: “pack” short underflowing queries with dummy tuples

  29. Example High-Level Packing • A Boolean database with m=2l unique tuples on n attributes. All tuples satisfy a1=a2=…=an-l=0 • In this case, any b([1,l]), b-neighbor insertion will not insert any dummy tuples. • n=5, l=3,m=8 • A Boolean database with m=2l =8 unique tuples on n=5 attributes. All tuples satisfy a1=a2=0 • In this case, any b([1,3]), b-neighbor insertion will not insert any dummy tuples.

  30. Before dummy tuple After dummy tuple • Insert two dummy tuple<1,0,0,0,0> and <1,1,0,0,1> • Short underflowing queries -> short overflowing queries

  31. Algorithm COUNTER-SAMPLER • Usually, d<b • Valid queries are more dangerous than underflows! • Step 1. d-level packing • Short underflowing queries -> short overflowing queries • Step 2. b-neighbor insertion • Short valid queries -> short overflowing queries • Time Complexity • O(nCd-1*max(2d,m)+nCb-1*m)

  32. Privacy Guarantee • For a Boolean hidden database with m tuples, when all samples have an attacking-cost limit umax, for any COUNT query with answer in [x,y], the hidden database owner achieves (ε,δ,50%)-privacy guarantee if COUNTER-SAMPLER has been executed with parameters b and d which satisfy • d≥log2m+1 and 3d-1/(d+1) ≥ umax • b ≥ d+(3 ε 2umax/(32min(x(m-x),y(m-y))(erf-1(δ))2)) • Where erf-1(.) is the inverse error function

  33. Outline • Introduction • Strategies for Protecting Sensitive Aggregates • A Generic Defense Model • Experiments Results • Conclusion

  34. Experimental Setup • Datasets • Boolean 0.3: 100,000 tuples and 30 attributes. Each attribute has probability of p=0.3 to be 1 • Boolean Mixed: 30 independent attributes, 5 have probability of p=0.5 to be 1, 10 have p=0.3, the other 10 have p=0.1 • Categorical Census: 1990 US Census Adult data published on the UCI Data Mining archive [HB99]. Highest domain size: 92 categories, lowest: Boolean • Attacking Techniques • HIDDEN-DB-SAMPLER[DDM07] • HYBRID-SAMPLER[DZD09]

  35. HIDDEN-DB-SAMPLER

  36. HYBRID-SAMPLER

  37. Outline • Introduction • Strategies for Protecting Sensitive Aggregates • A Generic Defense Model • Experiments Results • Conclusion

  38. Conclusion • Main Contribution 1: A Novel Problem • Reveal individual tuples trustfully and efficiently • But hide aggregated views of the data • An urgent challenge for hidden database owners • Main Contribution 2: COUNTER-SAMPLER • The insertion of dummy tuples with CAPTCHA flags • Minimum disruption to end user and third-party web mashup applications • Provides privacy guarantee against sampling attack. Increase by up to an order of magnitude the number of queries required by existing sampling techniques.

  39. A Broader Picture • Solution space for privacy-preserving strategies • Back-end hidden databases: this paper (dummy tuple insertion) • Query processing module – Future work • Front-end interface – Future work

  40. Some drawbacks • Lacking support for dynamic data operations • COUNTER-SAMPLER needs to be executed once as a pre-processing step for a static database • Lacking support for relatively larger database • Assuming the actual number of tuples m is much smaller than the space of possible tuples, which is at least 2n in Boolean databases with n attributes. • For a Boolean hidden database, how could insert dummy tuples when m is close to 2n.

More Related