P4P: A Practical Framework for Privacy-Preserving Distributed Computation

P4P: A Practical Framework for Privacy-Preserving Distributed Computation Yitao Duan (Advisor Prof. John Canny) http://www.cs.berkeley.edu/~duan Berkeley Institute of Design Computer Science Division University of California, Berkeley 11/27/2007

Research Goal • To provide practical solutions with provable privacy and adequate efficiency in a realistic adversary model at reasonably large scale

Challenge: standard cryptographic tools not feasible at large scale u1 u2 un-1 un d1 d2 dn-1 dn Must be obfuscated Model f ……

A Practical Solution • Provable privacy: Cryptography • Efficiency: Minimize the number of expensive primitives and rely on probabilistic guarantee • Realistic adversary model: Must handle malicious users who may try to bias the computation by inputting invalid data

No leakage beyond final result for many algorithms u1 u2 un-1 un d1 d2 dn-1 dn Cryptographic privacy Basic Approach Σ f = di in D, gj: j = 1, 2, …, m gj(dn) gj(di) gj(d2) gj(dn-1) ……

The Power of Addition • A large number of popular algorithms can be run with addition-only steps • Linear algorithms: voting and summation, nonlinear algorithm: regression, classification, SVD, PCA, k-means, ID3, EM etc • All algorithms in the statistical query model [Kearns 93] • Many other gradient-based numerical algorithms • Addition-only framework has very efficient private implementation in cryptography and admits efficient ZKPs

Peers for Privacy: The Nomenclature • Privacy is a right that one must fight for. Some agents must act on behalf of user’s privacy in the computation. We call them privacy peers • Our method aggregates across many user data. We can prove that the aggregation provides privacy: the data from the peers protects each other

Private Addition – P4P Style • The computation: secret sharing over small field • Malicious users: efficient zero-knowledge proof to bound the L2-norm of the user vector

Big Integers vs. Small Ones • Most applications work with “regular-sized” integers (e.g. 32- or 64-bit). Arithmetic operations are very fast when each operand fits into a single memory cell (~10-9 sec) • Public-key operations (e.g. used in encryption and verification) must use keys with sufficient length (e.g. 1024-bit) for security. Existing private computation solutions must work with large integers extensively (~10-3 sec) • A 6 orders of magnitude difference!

Private Arithmetic: Two Paradigms • Homomorphism: User data is encrypted with a public key cryptosystem. Arithmetic on this data mirrors arithmetic on the original data, but the server cannot decrypt partial results. • Secret-sharing: User sends shares of their data to several servers, so that no small group of servers gains any information about it.

Arithmetic: Homomorphism vs VSS • Homomorphism + Can tolerate t < n corrupted players as far as privacy is concerned - Use public key crypto, works with large fields (e.g. 1024-bit), 10,000x more expensive than normal arithmetic (even for addition) • Secret sharing + Addition is essentially free. Can use any size field - Can’t do two party multiplication - Most schemes also use public key crypto for verification - Doesn’t fit well into existing service architecture

U P U U S U PeerGroup P U U P4P: Peers for Privacy • Some parties, called Privacy Peers, actively participate in the computation, working for users’ privacy • Privacy peers provide privacy when they are available, but cant access data themselves

U P U U S U PeerGroup P U U P4P • The server provides data archival, and synchronizes the protocol • Server only communicates with privacy peers occasionally (2AM)

Privacy Peers • Roles of privacy peers • Anonymizing communication • Sharing information • Participating in computation • Others infrastructure support • They work on behalf of users privacy • But we need a higher level of trust on privacy peers

Candidates for Privacy Peers • Some players are more trustworthy than others • In workspace, a union representative • In a community, a few members with good reputation • Or a third party commercial provider • A very important source of security and efficiency • The key is that privacy peers should have different incentives from the server, a mutual distrust between them

Security from Heterogeneity • Server is secure against outside attacks and won’t actively cheat • Companies spend $$$ to protect their servers • The server often holds much more valuable info than what the protocol reveals • Server benefits from accurate computation • Privacy peers won’t collude with the server • Interests conflicts, mutual distrust, laws • Server can’t trust clients can keep conspiracy secret • Users can actively cheat • Rely on server for protection against outside attacks, privacy peers for defending against a curious server

Private Addition ui vi di: user i’sprivate vector. ui,,vi anddi are all in a small integer field ui + vi = di

Private Addition μ = Σui ν = Σvi ui + vi = di

Private Addition μ ν μ = Σui ν = Σvi ui + vi = di

Private Addition μ + ν

P4P’s Private Addition • Provable privacy • Computation on both the server and the privacy peer is over small field: same cost as non-private implementation • Fits existing server-based schemes • Server is always online. Users and privacy peers can be on and off. • Only two parties performing the computation, users just submit their data (and provide a ZK proof, see later) • Extra communication for the server is only with the privacy peer, independent of n

Bush 100,000Gore -100,000 The Need for Verification • This scheme has a glaring weakness. Users can use any number in the small field as their data. • Think of a voting scheme: “Please place your vote 0 or 1 in the envelope”

Zero Knowledge Proofs • I can prove that I know X without disclosing what X is. • I can prove that a given encrypted number is a 0. Or I can prove that an encrypted number is a 1. • I can prove that an encrypted number is a ZERO OR ONE, i.e. a bit. (6 extra numbers needed) • I can prove that an encrypted number is a k-bit integer. I need 6k extra numbers to do this (!!!)

An Efficient ZKP of Boundedness • Luckily, we don’t need to prove that every number in a user’s vector is small, only that the vector is small. • The server asks for some random projections of the user’s vector, and expects the user to prove that the square sum of them is small. • O(log m) public key crypto operations (instead of O(m)) to prove that the L-2 norm of an m-dim vector is smaller than L. • Running time reduced from hours to seconds.

Bounding the L2-Norm • A natural and effective way to restrict a cheating user’s malicious influence • You must have a big vector to produce large influence on the sum • Perturbation theory bounds system change with norms: |σi(A) - σi(B)| ≤ ||A-B||2 [Weyl] • Can be the basis for other checks • Setting L = 1 forces each user to have only 1 vote

Random Projection-basedL2-Norm ZKP • Server generates N random m-vectors in {-1, 0, +1}m • User projects his data to the N directions. provides ZKP that the square sum of the projections < NL2/2 • Expensive public key operations are only on the projections and the square sum

Effectiveness

Acceptance/rejection probabilities (a) Linear and (b) log plots of probability of user input acceptance as a function of |d|/L for N = 50. (b) also includes probability of rejection. In each case, the steepest (jagged curve) is the single-value vector (case 3), the middle curve is Zipf vector (case 2) and the shallow curve is uniform vector (case 1)

Performance Evaluation (a) Verifier and (b) prover times in seconds for the validation protocol where (from top to bottom) L (the required bound) has 40, 20, or 10 bits. The x-axis is the vector length.

SVD • Singular value decomposition is an extremely useful tool for a lot of IR and data mining tasks (CF, clustering …) • SVD for a matrix A is a factorization A = UDVT. • If A encodes users x items, then VT gives us the best least-squares approximations to the rows of A in a user-independent way. • ATAV = VD  SVD is an eigenproblem

SVD: P4P Style

Experiments: SVD Datasets

Results N: number of iterations. k: number of singular values. ε: relative residual error

Distributed Association Rule Mining • n users, m items. User i has dataset Di • Horizontally partitioned: Di contains the same attributes 1 0 0 0 1 …… 0 0 D1 …… Dn 0 0 1 0 0 …… 1 0

The Market-Basket Model • A large set of items, e.g., things sold in a supermarket. • A large set of baskets, each of which is a small set of the items, e.g., the things one customer buys on one day.

Support • Simplest question: find sets of items that appear “frequently” in the baskets. • Support for itemset I = the number of baskets containing all items in I. • Given a support thresholds, sets of items that appear in >s baskets are called frequent itemsets.

Example • Items={milk, coke, pepsi, beer, juice}. • Support = 3 baskets. B1 = {m, c, b} B2 = {m, p, j} B3 = {m, b} B4 = {c, j} B5 = {m, p, b} B6 = {m, c, b, j} B7 = {c, b, j} B8 = {b, c} • Frequent itemsets: {m}, {c}, {b}, {j}, {m, b}, {c, b}, {j, c}.

Association Rules • If-then rules about the contents of baskets. • {i1, i2,…,ik} →jmeans: “if a basket contains all of i1,…,ik then it is likely to contain j.” • Confidence of this association rule is the probability of j given i1,…,ik.

Step k of apriori-gen in P4P • User i constructs an mk-Dimensional vector in small field (mk:number of candidate itemset at step k) • Use P4P to compute the aggregate (with verification) • The result encodes the supports of all candidate itemsets

Step k of apriori-gen in P4P * 1 0 0 0 1 …… 0 0 + D1 d1[j] …… + Dn dn[j] 0 0 1 0 0 …… 1 0 P4P cj: jth candidate itemset Support for cj

Analysis • Privacy guaranteed by P4P • Near optimal efficiency: cost comparable to that of a direct implementation of the algorithms • Main aggregation in small field • Only a small number of large field operations • Deal with cheating users with P4P’s built-in ZK user data verification

Privacy • SVD: The intermediate sums are implied by the final results • ATA = VDVT • ARM: Sums treated as public by the applications • Guaranteed privacy regardless data distribution or size

Infrastructure Support • Multicast encryption [RSA 06] • Scalable secure bidirectional communication [Infocom 07] • Data protection scheme [PET 04]

P4P: Current Status • P4P has been implemented • In Java using native code for big integer • Runs on Linux platform • Will be made an open-source toolkit for building privacy-preserving real-world applications.

Conclusion • We can provide strong privacy protection with little or no cost to a service provider for a broad class of problems in e-commerce and knowledge work. • Responsibility for privacy protection shifts to privacy peers • Within the P4P framework, private computation and many zero-knowledge verifications can be done with great efficiency

More info • duan@cs.berkeley.edu • http://www.cs.berkeley.edu/~duan • Thank You!

P4P: A Practical Framework for Privacy-Preserving Distributed Computation

P4P: A Practical Framework for Privacy-Preserving Distributed Computation

Presentation Transcript

Privacy-preserving Distributed Learning using Generative Models

Practical Private Computation and Zero-Knowledge Tools for Privacy-Preserving Distributed Data Mining

Efficient Privacy Preserving Protocols for Visual Computation

A Privacy-Preserving Index for Range Queries

iMapReduce : A Distributed Computing Framework for Iterative Computation

Secure Distributed Framework for Achieving ϵ -Differential Privacy

A Privacy-Preserving Index for Range Queries

Virtual Trip Lines for Distributed Privacy-Preserving Traffic Monitoring

A Privacy Preserving Index for Range Queries

Privacy-Preserving Distributed Data Mining

Privacy-Preserving Authentication: A Tutorial

A Distributed Privacy-Preserving Scheme for Location-Based Queries

A Flexible, Privacy-Preserving Authentication Framework for Ubiquitous Environments

A Privacy – Preserving Index for Range queries

Privacy-Preserving Distributed Information Sharing

Privacy-Preserving Computation

A Privacy-Preserving Interdomain Audit Framework

A Privacy-Preserving Framework for Personalized Social Recommendations

Privacy Preserving SQL Query Execution on Distributed Data

P4P: A Framework for Practical Server-Assisted Multiparty Computation with Privacy

A Privacy-Preserving Index for Range Queries