Privacy-preserving Data Mining for the Internet of Things: State of the Art

Privacy-preserving Data Mining for the Internet of Things: State of the Art Yee Wei Law (罗裔纬) wsnlabs.com

Speaker’s brief bio • Ph.D. from University of Twentefor research on security of wireless sensor networks (WSNs) in EU project EYES • Research Fellowship on WSNs from The University of Melbourne • ARC projects “Trustworthy sensor networks: theory and implementation”, “BigNet” • EU FP7 projects “SENSEI”, “SmartSantander”, “IoT-i”, “SocIoTal” • IBES seed projects on participatory sensing, smart grids • Taught Master’s course “Sensor Systems” • Professional membership: • Associate of (ISC)2 (junior CISSP) • Smart Grid Australia Research Working Group • Current research interests: • Privacy-preserving data mining • Secure/resilient control • Applications of above to the IoT and smart grid • Current research orientation: • Mixed basic/applied researchin data science or network science • Research involving probabilistic/statistical, combinatorial, matrix analysis

Agenda • The IoT and its research priorities • Participatory sensing (PS) • Collaborative learning (CL) • Introduction to privacy-preserving data mining • Schemes suitable for PS and CL • Research opportunities challenges • If time permits, SOCIOTAL

A dynamic global network infrastructure with self-configuring capabilities based on standard and interoperable communication protocols where physical and virtual “things” have identities, physical attributes, and virtual personalities and use intelligent interfaces, and are seamlessly integrated into the information network. H. Sundmaeker et al., “Vision and Challenges for Realising the Internet of Things,” Cluster of European Research Projects on the Internet of Things, Mar. 2010.

Evidence of the Internet of Things Nissan EPORO robot cars Smart grid

Research priorities Smart transport Smart grid Smart water Smart whatever • Among research priorities: • Mathematical models and algorithms for inventory management, production scheduling, and data mining • Privacy aware data processing ITU-T: “Through the exploitation of identification, data capture, processing and communication capabilities, the IoT makes full use of things to offer services to all kinds of applications, whilst maintaining the required privacy.”

Shifting priorities ARPAnet Machine-to-machine communications • We have enough tech to hook things up, now we should make find better ways of capturing and analyzing data. • Introducing participatory sensing and collaborative learning... Some graphics from Sabina Jeschke

Participatory sensing A process whereby individuals and communities use evermore-capable mobile phones and cloud services to collect and analyze systematic data for use in discovery. • Citizen-provided data can improve governance with benefits including: • Increased public safety • Increased social inclusion and awareness • Increased resource efficiency for sustainable communities • Increased public accountability Source: Estrin et al.

Data sharing scenarios Lindell and Pinkas[2000]: “privacy-preserving data mining” refers to privacy-preserving distributed data mining

Data sharing scenarios (cont’d) • Collaborative learning: Multiple data owners collaboratively analyze the union of their data with the involvement of a third-party data miner. • Agrawal and Srikant [2000] coined the term “privacy-preserving data mining” to refer to privacy-preserving collaborative learning. • Encrypting data to data miner is inadequate, data should be masked, at a balanced point between accuracy and privacy.

Privacy-preserving collaborative learning Adversarial models • Requirement imposed by participatory sensing: • online data submission, offline data processing • Design space: • Data type: • continuous or categorical • voice, images, videos, etc. • Data structure: • relational or time series • for relational data: horizontal or vertical partitioned • Data mining operation Privacy criterion Semantic Syntactic SMC Randomization Proposed criterion Differential privacy Linear Nonlinear Additive Multiplicative

Adversarial models Semi-honest (honest but curious) Malicious Active attacker tries to learn the private states of other parties, and deviates arbitrarily from protocol • Passive attacker tries to learn the private states of other parties, without deviating from protocol • By definition, semi-honest parties do not collude • Common approach: Design in the semi-honest model, enhance it for the malicious model • General method: zero-knowledge proofs often not practical • Semi-honest model often realistic enough

Syntactic privacy criteria Semantic privacy criteria To minimize the difference between adversarial prior knowledge and adversarial posterior knowledge about individuals represented in the database General enough for most data types, relational or time series Example: Cryptographic privacy Differential privacy • To prevent syntactic attacks, e.g., table linkage: • Attacker has access to an anonymous table and a nonanonymous table, with the anonymous table being a subset of the nonanonymous table • Attacker can infer the presence of its target’s record in the anonymous table from the target’s record in the nonanonymoustable • Relevant for relational data, not time series data • Example: • k-anonymity Cryptographic privacy Secure Multiparty Computation Differential privacy Randomization

Secure multiparty computation Output x1 Receiver chooses a value Sender doesn’t know which f(x1,x2) Oblivious transfer • Introduced by Rabin [1981] • Killian [1988] showed oblivious transfer is sufficient for secure two-party computation • Naor et al. [2001] reduce the amortized overhead of oblivious transfer to one exponentiation per a log number of oblivious transfers • Homomorphic encryption can be used in the semi-honest model Sender x2 n values Metaphor: Yao’s millionaire problem [1982] 1-out-of-n oblivious transfer Garbled circuits for arbitrary functions [Beaver et al. 1990] Building blocks: Oblivious transfer

Differential privacy • In cryptography, semantic security: whatever is computable about the cleartext given the ciphertext is also efficiently computable without the ciphertext • Useless for PPDM: A DB satisfying above has no utility • Dwork[2006] proposed “differential privacy” for statistical disclosure control: add noise to query results

Differential privacy (cont’d) • Theoretical basis for answering “sum queries” • Sum queries can be used for histogram, mean, covariance, correlation, SVD, PCA, k-means, decision tree, etc. Row index Row Differential privacy Sensitivity Laplace noise Noisy sum queries

Taxonomy of attacks against randomization-based approaches Known input/sample attack: Known input-output attack: The attacker has some input samples and all output samples, and knows which input sample corresponds to which output sample Proposed privacy criterion: The distance betweenf(X) and estimated f(X) kept above a specified threshold under known attacks • The attacker has some input samplesand all output samplesbut does not know which input sample corresponds to which output sample • Typically begins with establishing correspondences between the input samples and the output samples

Randomization Randomized distortion or perturbation of data • Additive perturbation: adds noise data to data • iidnoise susceptible to: • Spectral filtering attack by Kargupta et al. [2003] • PCA attack by Huang et al. [2005]: • Estimate covariance matrix of original data • Find eigenvalues and eigenvectors of covariance matrix through PCA • Bayesian estimation may not have analytic form Randomization Linear Nonlinear Additive perturbation Multiplicative perturbation Time series data Relational data eigenvectors of covar

Collaborative learning using additive perturbation Attack • Compared to multiplicative perturbation, easier to recover the source data distributionfX(x) from the perturbed data distribution and noise distribution • Against attacks: noise to be correlated with data and participant-specific • PoolView [Ganti et al. 2008] builds a model of the data, then generate noise from the model: • With a common noise model, a participant (i) can reconstruct another participant’s (j) data from the perturbed data: Solved through deconvolution Estimated with kernel density estimation

Collaborative learning using additive perturbation • Zhang et al. [2012] • Catches: • The data miner has to know the participants’ parameters—system not resilient to collusion • Data correlation between participants expose them to attacks (recall the PCA-based attack?) Data-dependence Participant-dependence PDF reconstructed by data miner based on PDF of y and noise

Multiplicative perturbation Multiplies data with noise Rotation perturbation [Chen et al. 2005] • Noise matrix is an orthogonal matrix with orthonormal rows and columns • Giannella et al.’s [2013] attack can estimate the original data usingall perturbed data and a small amount of original data Attack stage 1 • Find maximally unique map β that satisfies Then we know which xi is mapped to which yi Attack stage 2 • Find that maximizes Enhanced version: geometric perturbation Input x Output y Perturbation

Multiplicative projection: random projection d dimension k dimension • Projection by Gaussian random matrix • Statistically orthogonal • essentially a Johnson-Lindenstrausstransform • Other Johnson-Lindenstrauss transforms: • Attack against orthogonal transform adaptable for this? inter-point distances change by factor (1±ε) as long as k≥O(ε-2logn) Perturbed vectors

Collaborative learning using multiplicative perturbation Goal is to use a different perturbation matrix for a different participant Liu et al. [2012]: synthesized data matrix Z mean, covariance Learn in approx an inverse of Ru and Rv Data miner then get an estimation of Xu and Xv ! What about the privacy criterion?

Nonlinear perturbation Nonlinear + linear perturbation: • Relies on linear perturbation to achieve projection • Near-many-to-one mapping provides privacy property • Many-to-one mapping extended to the “normal” part of the curve? Nonlinear function Random matrices Extreme values (potential outliers)are “squashed” =tanh(x) Normalized values

Bayesian estimation attacks against multiplication perturbation • Solve underdetermined system Y=RX for X • Maximum a posteriori estimation (why?) • If R is known • Gaussian original data obviously simplifies the attacker’s problem • If R is not known • Difficult optimization problem, although Gaussian data simplifies the problem • Choice of p(R) matters

Independent component analysis against multiplicative perturbation • Prerequisites for attacker • independence • at most one Gaussian component • sparseness (Laplace) • m≥(n+1)/2 • Steps: • estimate R • estimate X • resolve permutation and scaling ambiguity Perturbation matrix treated as mixing matrix Blind source separation m=n m<n m>n Overcomplete/underdetermined ICA Sparse representation Nonnegative matrix factorization

Research opportunities and challenges Tools: Statistical analysis, Bayesian analysis, matrix analysis, time series analysis, optimization, signal processing • Commercial interest? • Large design space: effectiveness depends as much on the nature of data as the data mining algorithms • Challenging multidisciplinary problems necessitate broad range of tools: • Scenario-dependent privacy criteria • Defenses and attacks evolve side-by-side • Role of dimensionality reduction? • Steganography for “traitor tracing”? • Many more from syntactic privacy, SMC, etc. Multiplicative perturbation Nonlinear perturbation Data mining algorithms Bayesian estimation attacks ICA attacks Participants’ data Perturbed data

What is Big Data? • Unsupervised learning of Big Data, e.g., Deep Learning

Motivating use cases: • Vision: Business-centric Internet of Things  Citizen-centric Internet of Things • Main non-technical aim: Create trust and confidence in Internet of Things systems, while providing user-friendly ways to contribute to and use the system thus encouraging creation of services of high socio-economic value. • Main technical aims: • Reliable and secure communications • Trustworthy data collection • Privacy-preserving data mining Alice’s sensor network monitoring her house Alice’s friend Bob granted access to Alice’s network while Alice’s on vacation Sensor network monitoring community microgrid feeding data to stakeholders

Duration: Sep 2013- Aug 2016 Funding scheme: STREP Total Cost: €3.69 m EC Contribution: €2.81m Contract Number: CNECT-ICT- 609112

Adversarial models Privacy criterion Source: Cisco IBSG, April 2011 Semantic Syntactic SMC Randomization Proposed criterion Differential privacy Linear Nonlinear Conclusion Looking back: 1970s gives us statistical disclosure control; 2000s gives us PPDM Technological development expands design space, invites multidisciplinary input Socio-economical development plays critical role Additive Multiplicative

Syntactic privacy criteria/definitions To prevent syntactic attacks: • Table linkage: • Attacker has access to an anonymous table and a nonanonymous table, with the anonymous table being a subset of the nonanonymoustable • Attacker can infer the presence of its target’s record in the anonymous table from the target’s record in the nonanonymoustable • Record linkage: • Attacker has access to an anonymous table and a nonanonymous table, and the knowledge that its target is represented in both tables • Attacker can uniquely identify the target’s record in the anonymous table from the target’s record in the nonanonymoustable • Attribute linkage: • Attacker has access to an anonymous table, and the knowledge that its target is represented in the table, the attacker can infer the value(s) of its target’s sensitive attribute(s) from the group (e.g., 30-40 year-old females) the target belongs to Examples: • k-anonymity

Privacy-preserving Data Mining for the Internet of Things: State of the Art