Privacy Preservation of Aggregates in Hidden Databases: Why and How?

Privacy Preservation of Aggregates in Hidden Databases: Why and How? Zhang Kun 2009-12-04

Paper Introduction • Arjun Dasgupta, Nan Zhang, Gautam Das, Surajit Chaudhuri: Privacy preservation of aggregates in hidden databases: why and how? SIGMOD 2009:153-164 • Arjun Dasgupta • UT Arlington (Computer Science and Engineering Department, The University of Texas at Arlington, http://cse.uta.edu/ ) • Nan Zhang • George Wash. Univ. • Cautam Das • UT Arlington，Arjun Dasgupta’s advisor • Surajit Chaudhuri • Microsoft Research

Outline • Introduction • Strategies for Protecting Sensitive Aggregates • A Generic Defense Model • Experiments Results • Conclusion

Hidden Databases • Form-like interface • Return top-k tuples

Search Queries vs. Aggregate Queries • Search Queries • SELECT * FROM D WHERE ac1=vc1&…&acu=vcu • Answered by hidden database with top-k restriction • Overflowing(>k),valid(1…k), and underflowing(0 tuple) queries • Aggregate Queries • SELECT AGGR(*) FROM D WHERE ac1=vc1&…&acu=vcu • e.g., How many “Thinking in Java” book left in DangDang? • Cannot be answered through the public web interface

Sampling Techniques for Discovering Aggregates • HIDDEN-DB-SAMPLER and HYBRID-SAMPLE proposed in our previous work [DDM07, DZD09] provides efficient sampling with very small bias. The retrieved samples can then be used for aggregated estimation. • Liu Wei, Meng Xiaofeng, Ling Yanyan proposed a graph-based approach for web database sampling. (Journal of Software, Vol,19, No.2, 2008)

WDB-Sampler: Web Database Sampling Process

Privacy Concerns over Aggregate Information • Book dealership (commercial competition) • Which books are in a low inventory status less than 30? • Reason: If a competitor knows such aggregate information, it take advantage of the low inventory by a multitude of tactics (e.g., stock those books, make adjustments to price) • Fight Occupancy (homeland security) • Which flights, on what dates, are likely to be relatively empty? • Reason: In 911 ,terrorists’ tactics are believed to be to hijack relatively empty flights because there would be less resistance from occupants.

Protecting Sensitive Aggregates Over Hidden Databases • A novel problem of protecting sensitive aggregates over hidden web databases • Reveal individual tuples truthfully and efficiently • Tuplewise privacy preservation techniques such as encryption,, date perturbation and generalization methods cannot be used in this scenario ( like in SaaS ) • But hide aggregated views of the data • Comparison with existing work • Most existing work focuses on protecting individual tuples while disclosing aggregates • Inverse to our problem • These exists work that recognizes the possible sensitivity of aggregate information • However, the objective is not to reveal individual tuples truthfully while protecting the sensitive aggregates

Problem Statement • The objective of aggregate suppression is to • Make it very difficult for any adversary to obtain an accurate of aggregates via the search interface, and • Minimize the reduction of service quality for search queries issued by normal users. • Normal users may include • Human end-users • Third-party web mashup applications

Naïve Strategies (I) • Audit query history for each user, IP address, etc • Problem: distributed attacks

Naïve Strategies (II) • CAPTCHA challenge before submitting or answering a query • Problem: significant inconvenience to end users, completely disable third-party applications through the public interface

Naïve Strategies (III) • Use machine-incomprehensible response (e.g., image instead of text for attribute values) • Problem: overhead on the server side, disable third-party applications through the public interface

Our Strategy: Insertion of Dummy Tuples with CAPTCHA flags • Insert dummy tuples, and associate each tuple with a CAPTCHA flag including whether it is real or dummy

Justification: Why Dummy Tuples? • Requirements of truthfully revealing individual tuples • No change of existing tuple values • No removal of existing tuples • Only choice left: insert dummy tuples • Key observation: All existing sampling attack retrieve samples from valid queries only. • Overflowing queries are usually overlooked because their results (top-k tuples) are preferentially selected by a ranking function, and hence cannot be assumed to be random • How about inserting dummy tuples to overflow valid queries?

Dummy Oracle • DUMMY ORACLE (m0, C) • Generates dummy tuples that • Satisfy the search condition C • Cannot be identified as a dummy by the adversary • Terminates when • Either no more tuples satisfying the above conditions can be generated, • Or m0 dummy tuples have been generated • Whichever occurs first • How to construct the dummy oracle requiring proper modeling of external knowledge left as an open problem

Justification: Why CAPTCHA Flags? • Necessity of flag due to the requirements of trustfully disclosing individual tuples • CAPTCHA flags exploit the different requirements of a search user, a mashup applications, and an adversary • A bona fide search user issues a small number of search queries – manual interpretation of CAPTCHA flags is tolerable • A web mashup application can directly push the flag to its end users – thereby maintaining its usability • However, an adversary requires a relatively large number of queries for aggregate estimation – CAPTCHA flags become a major deficiency for the adversary

Technique Problem Definition • Problem definition: Given the attacking-cost limit umax and a set of sensitive aggregate queries, the objective of dummy insertion is to achieve (ε,δ,p)-guarantee of each aggregate query while minimizing the number of inserted dummy tuples. • Attacking-cost limit umax: the maximum number of search queries that an adversary can issue. • Why minimizing the number of inserted dummy tuples? • Dummy tuples naturally lead to degradation of service quality

(ε,δ)-privacy game • (ε,δ)-privacy game • Between a user and the hidden database owner for a sensitive aggregate query QA • The owner chooses its defensive schema • The user issues search queries and analzes their results to try and estimate Res(QA) (Results of the aggregate query QA) • The user wins if x such that the user has confidence > δ that Res(QA) [x,x+ ε]. Otherwise the user loses.

(ε,δ,p)-privacy guarantee • We say that a defensive scheme achieve (ε,δ,p)-privacy guarantee for QA iff for any user, Pr{A wins (ε,δ)-privacy game for QA} ≤p

Key Observations for Single Sample Tuple Attack • Any smart sampler should issue queries that maximize the shrinkage of search space for valid queries • Valid queries as well as long overflowing queries contribute the most of shrinking the space • Fortunately, long queries (both overflowing and valid) are difficult to find before dummy insertion. • For a Boolean database, within the (2c*nCc)c-predicate queries, the total number of valid and overflowing queries is at most m*nCc; thus the probability of choosing one is no more than m/2c, which is extremely small when c is large. • Thus, short valid queries becomes the main threat for defense.

Neighbor Insertion • b-Neighbor Insertion: add dummy tuples such that all valid queries with fewer than b predicates will overflow • Approach: insert dummy tuples into the “neighboring zone” of real tuples (i.e., sharing the same values on a large number of attributes)

Example Neighbor Insertion • A Boolean database with m(m=7) unique tuples on n attributes. Only one tuple t1=<0,0,1,0,1>satisfy a1=a2=0.

Before dummy tuple After dummy tuple • Dummy tuple(0,0,0,0,0) in the t1=<0,0,0,0,1> ‘s neighboring zone • b-neighboring Insertion (b=n=5) • Short valid queries -> short overflowing queries

Key Observations for Multi Sample Tuples Attack • Key Observations: • Shrinkage by underflow is permanent • Shrinkage by overflow is mostly temporary • Thus, short underflowing queries become a very dangerous threat

High-Level Packing • d-Level Packing: add dummy tuples such that all underflowing queries with fewer than d predicates will overflow • Approach: “pack” short underflowing queries with dummy tuples

Example High-Level Packing • A Boolean database with m=2l unique tuples on n attributes. All tuples satisfy a1=a2=…=an-l=0 • In this case, any b([1,l]), b-neighbor insertion will not insert any dummy tuples. • n=5, l=3,m=8 • A Boolean database with m=2l =8 unique tuples on n=5 attributes. All tuples satisfy a1=a2=0 • In this case, any b([1,3]), b-neighbor insertion will not insert any dummy tuples.

Before dummy tuple After dummy tuple • Insert two dummy tuple<1,0,0,0,0> and <1,1,0,0,1> • Short underflowing queries -> short overflowing queries

Algorithm COUNTER-SAMPLER • Usually, d<b • Valid queries are more dangerous than underflows! • Step 1. d-level packing • Short underflowing queries -> short overflowing queries • Step 2. b-neighbor insertion • Short valid queries -> short overflowing queries • Time Complexity • O(nCd-1*max(2d,m)+nCb-1*m)

Privacy Guarantee • For a Boolean hidden database with m tuples, when all samples have an attacking-cost limit umax, for any COUNT query with answer in [x,y], the hidden database owner achieves (ε,δ,50%)-privacy guarantee if COUNTER-SAMPLER has been executed with parameters b and d which satisfy • d≥log2m+1 and 3d-1/(d+1) ≥ umax • b ≥ d+(3 ε 2umax/(32min(x(m-x),y(m-y))(erf-1(δ))2)) • Where erf-1(.) is the inverse error function

Experimental Setup • Datasets • Boolean 0.3: 100,000 tuples and 30 attributes. Each attribute has probability of p=0.3 to be 1 • Boolean Mixed: 30 independent attributes, 5 have probability of p=0.5 to be 1, 10 have p=0.3, the other 10 have p=0.1 • Categorical Census: 1990 US Census Adult data published on the UCI Data Mining archive [HB99]. Highest domain size: 92 categories, lowest: Boolean • Attacking Techniques • HIDDEN-DB-SAMPLER[DDM07] • HYBRID-SAMPLER[DZD09]

HIDDEN-DB-SAMPLER

HYBRID-SAMPLER

Conclusion • Main Contribution 1: A Novel Problem • Reveal individual tuples trustfully and efficiently • But hide aggregated views of the data • An urgent challenge for hidden database owners • Main Contribution 2: COUNTER-SAMPLER • The insertion of dummy tuples with CAPTCHA flags • Minimum disruption to end user and third-party web mashup applications • Provides privacy guarantee against sampling attack. Increase by up to an order of magnitude the number of queries required by existing sampling techniques.

A Broader Picture • Solution space for privacy-preserving strategies • Back-end hidden databases: this paper (dummy tuple insertion) • Query processing module – Future work • Front-end interface – Future work

Some drawbacks • Lacking support for dynamic data operations • COUNTER-SAMPLER needs to be executed once as a pre-processing step for a static database • Lacking support for relatively larger database • Assuming the actual number of tuples m is much smaller than the space of possible tuples, which is at least 2n in Boolean databases with n attributes. • For a Boolean hidden database, how could insert dummy tuples when m is close to 2n.

Privacy Preservation of Aggregates in Hidden Databases: Why and How?

Privacy Preservation of Aggregates in Hidden Databases: Why and How?

Presentation Transcript