A Tutorial of Privacy-Preservation of Graphs and Social Networks

A Tutorial of Privacy-Preservation of Graphs and Social Networks Xintao Wu, Xiaowei Ying University of North Carolina at Charlotte

National Freedom of Information

Data Protection Laws

National Laws • USA • HIPAA for health care • Passed August 21, 96 • lowest bar and the States are welcome to enact more stringent rules • California State Bill 1386 • Grann-Leach-Bliley Act of 1999 for financial institutions • COPPA for childern’s online privacy • etc. • Canada • PIPEDA 2000 • Personal Information Protection and Electronic Documents Act • Effective from Jan 2004 • European Union (Directive 94/46/EC) • Passed by European Parliament Oct 95 and Effective from Oct 98. • Provides guidelines for member state legislation • Forbids sharing data with states that do not protect privacy

Privacy Breach • AOL's publication of the search histories of more than 650,000 of its users has yielded more than just one of the year's bigger privacy scandals. (Aug 6, 2006) That database does not include names or user identities. Instead, it lists only a unique ID number for each user. AOL user 710794 • an overweight golfer, owner of a 1986 Porsche 944 and 1998 Cadillac SLS, and a fan of the University of Tennessee Volunteers Men's Basketball team. • interested in the Cherokee County School District in Canton, Ga., and has looked up the Suwanee Sports Academy in Suwanee, Ga., which caters to local youth, and the Youth Basketball of America's Georgia affiliate. • regularly searches for "lolitas," a term commonly used to describe photographs and videos of minors who are nude or engaged in sexual acts. Source: AOL's disturbing glimpse into users' lives By Declan McCullough , CNET News.com, August 7, 2006, 8:05 PM PDT

Privacy Preserving Data Mining • Data mining • The goal of data mining is summary results (e.g., classification, cluster, association rules etc.) from the data (distribution) • Individual Privacy • Individual values in database must not be disclosed, or at least no close estimation can be got by attackers • Contractual limitations: privacy policies, corporate agreements • Privacy Preserving Data Mining • How to transform data such that • we can build a good data mining model (data utility) • while preserving privacy at the record level (privacy)?

PPDM on Tabular Data 69% unique on zip and birth date 87% with zip, birth date and gender Generalization (k-anonymity, L-diversity, t-closeness etc.) and Randomization Refer to a survey book [Aggarwal, 08]

PPDM Tutorials on Tabular Data • Privacy in data system, RakeshAgrawal, PODS03 • Privacy preserving data mining, Chris Clifton, PKDD02, KDD03 • Models and methods for privacy preserving data publishing and analysis, Johannes Gehrke, ICDM05, ICDE06, KDD06 • Cryptographic techniques in privacy preserving data mining, HelgerLipmaa, PKDD06 • Randomization based privacy preserving data mining, Xintao Wu, PKDD06 • Privacy in data publishing, Johannes Gehrke & AshwinMachanavajjhala, S&P09 • Anonymized data: genertion, models, usage, Graham Cormode & DiveshSrivastava, SIGMOD09

Social Network

Social Network Network of US political books (105 nodes, 441 edges) Books about US politics sold by Amazon.com. Edges represent frequent co-purchasing of books by the same buyers. Nodes have been given colors of blue, white, or red to indicate whether they are "liberal", "neutral", or "conservative".

Social Network • Network of the political blogs on the 2004 U.S. election (polblogs, 1,222 nodes and 16,714 edges)

Social Network • Collaboration network of scientists [Newman, PRE06]

More Social Network Data • Newman’s collection • http://www-personal.umich.edu/~mejn/netdata/ • Enron data • http://www.cs.cmu.edu/~enron/ • Stanford large network dataset collection • http://snap.stanford.edu/data/index.html

Graph Mining • A very hot research area • Graph properties such as degree distribution • Motif analysis • Community partition and outlier detection • Information spreading • Resiliency/robustness, e.g., against virus propagation • Spectral analysis • Research development • “Managing and mining graph data” by Aggarwal and Wang, Springer 2010. • “Large graph-mining: power tools and a practitioner’s guide” by Faloutsos et al. KDD09

Network Science and Privacy Source: Jeannette Wing, Computing research: a view from DC, SNOWBIRD, 2008

Outline • Attacks on Naively Anonymized Graph • Privacy Preserving Social Network Publishing • K-anonymity • Generalization • Randomization • Other Works • Output Perturbation • Background on differential privacy • Accurate analysis of private network data

Social Network Data Publishing Data miner • Data owner release

Threat of Re-identification • Attacker attack Ada’s sensitive information is disclosed. • Privacy breaches • Identify disclosure • Link disclosure • Attribute disclosure

Deriving Personal Identifying Information [Gross WPES05] • User profiles (e.g., photo, birth date, residence, interests, friend links) can be used to estimate personal identifying information such as SSN. ### - ## - #### • Users should pay attention to (default) privacy preference settings of online social networks. Sequential no Determined by zip code Group no https://secure.ssa.gov/apps10/poms.nsf/lnx/0100201030

Active and Passive Attacks [Backstorm WWW07] • Active attack outline • Join the network by creating some new user accounts; • Establish a highly distinguishable subgraph H among the attacking nodes; • Send links to targeted individuals from the attacking nodes; • In the released graph, identify the subgraph H among the attacking nodes; • The targeted individuals and their links are then identified.

Active and Passive Attacks [Backstorm WWW07] • Active attacks & subgraph H The active attack is based on the subgraph H among the attackers: • No other subgraphs isomorphic to H; • Subgraph H has no non-trivial automorphism • Efficient to identify H regardless G;

Active and Passive Attacks [Backstorm WWW07] • Passive attacks outline • Observation: most nodes in the network already form a uniquely identifiable subgraph. • One adversary recruits k-1 of his neighbors to form the subgraph H of size k. • Work similarly to active attacks. Drawback: Uniqueness of H is not guaranteed.

Attacks by Structural Queries [Hay VLDB08] • Structural queries: A structural query Q represent complete or partial structural information of a targeted individual that may be available to adversaries. • Structural queries and identity privacy:

Attacks by Structural Queries [Hay VLDB08] • Degree sequence refinement queries

Attacks by Structural Queries [Hay VLDB08] • Subgraph queries The adversary is capable of gathering a fixed number of edges around the targeted individual. • Hub fingerprint queries A hub is a central node in a network. A hub fingerprint of node v is the node's connections to a set of designated hubs within a certain distance.

Attacks by Combining Multiple Graphs [Narayanan ISSP09] • Attack outline: • The attacker has two type of auxiliary information: • Aggregate: an auxiliary graph whose members overlap with the anonymized target graph • Individual: the detailed information on a very small number of individuals (called seeds) in both the auxiliary graph and the target graph. • Identify seeds in the target graph. • Identify more nodes by comparing the neighborhoods of the de-anonymized nodes in the auxiliary graph and the target graph (propagation).

Deriving Link Structure of Entire Network [Korolova ICDE08] • A different threat in which • An adversary subverts user accounts to get local neighborhoods and pieces them together to build the entire network. • No underlying network is released. • A registered user often can see all the links and nodes incident to him within distance d from him. • d=0 if a user can see who he links to. • d=1 if a user can also see who links to all his friends. • Analysis showed that the number of local neighborhoods needed to cover a fraction of the entire network drops exponentially with increase of the lookahead parameter d.

Privacy Preserving Social Network Publishing • Naïve anoymization is not sufficient to prevent privacy breaches, mainly due to link structure based attacks. • Graph topology has to be modified via • Adding/deleting edges/nodes • Grouping nodes/edges into super-nodes and super-edges • How to quantify utility loss and privacy preservation in the perturbed (and anonymized) graph?

Graph Utility • Utility heavily depends on mining tasks. • It is challenging to quantify the information loss in the perturbed graph data. • Unlike tabular data, we cannot use the sum of the information loss of each individual record. • We cannot use histograms to approximate the distribution of graph topology. • It is more challenging when considering both structure change and node attribute change.

Graph Utility • Topological features: • Structural characteristics of the graph. • Various measures form different perspectives. • Commonly used. • Spectral features: • Defined as eigenvalues of the graph's adjacency matrix or other derived matrices. • Closely related to many topological features. • Can provide global graph measures. • Aggregate queries: • Calculate the aggregate on some paths or subgraphs satisfying the query condition. • E.g.: the average distance from a medical doctor vertex to a teacher vertex in a network.

Topological Features • Topological features of networks • Harmonic mean of shortest distance • Transitivity(cluster coefficient) • Subgraph centrality • Modularity (community structure) • And many others (refer to: F. Costa et al., Characterization of Complex Networks: A Survey of measurements, 2006)

Graph and Matrix • Adjacency matrix • For a undirected graph, A is a symmetric; • No self-links, diagonal entries of A are all 0 • For a un-weighted graph, A is a 0-1 matrix

Spectral Features • Spectral features of networks • Adjacency spectrum Laplacian spectrum

Topological vs. Spectral Features • Adjacency and Laplacian spectrum: • The maximum degree, chromatic number, clique number etc. are related to ; • Epidemic threshold of the virus propagates in the network is related to ; • Laplacian spectrum indicates the community structure: • k disconnected communities: • k loosely connected communities:

Topological vs. Spectral Features • Laplacian spectrum & communities Disconnected communities: Loosely connected communities:

Topological vs. Spectral Features • Eigenspace[Ying SDM09]

Topological vs. Spectral Features • Topological & spectral features are related • No. of triangles: • Sub-graph centrality： • Graph diameter:

K-anonymity Privacy Preservation • K-anonymity (Sweeney) • Each individual is identical with at least K-1 other individuals • A general definition for network data [Hay, VLDB08]

K-anonymity • Each node is identical with at least K-1 other nodes under topology-based attacks. • The adversary is assumed to have some knowledge of the target user: • node degree (K-degree) • (immediate) neighborhood (K-neighborhood) • arbitrary subgraph (K-automorphism etc.) • K-anonymity approach guarantees that no node in the released graph can be linked to a target individual with success prob. greater than 1/K.

K-degree Anonymity [Liu SIGMOD08] • Attacking model: The attackers know the degree of the targeted individual

K-degree Anonymity [Liu SIGMOD08] • K-degree anonymous: • Optimize utility: minimize no. of added edges

K-degree Anonymity [Liu SIGMOD08] • Algorithm outline:

K-neighborhood Anonymity [Zhou ICDE08] • Attacking model The attackers know the immediate neighborhood of a targeted individual.

K-neighborhood Anonymity [Zhou ICDE08] • Problem:

K-neighborhood Anonymity [Zhou ICDE08] • Algorithm outline • Extract the neighborhoods of all vertices in the network. • Compare and test all neighborhoods by neighborhood component coding • Organize vertices into groups and anonymize the neighborhoods of vertices in the same group until the graph satisfies K-neighborhood anonymity. 1-neighborhood of Ada Naively Anonymized Graph K-neighborhood Anonymity

K-neighborhood Anonymity [Zhou ICDE08] • Graph utility: • The nodes have hierarchical label information. • Two ways to anonymize the neighborhoods: generalizing labels and adding edges. • Answer aggregate network queries as accurate as possible.

K-automorphism Anonymity [Zou VLDB09] • Attacking model: The attackers can know any subgraph that contains the targeted individual. • Graph automorphism

K-automorphism Anonymity [Zou VLDB09] • K-automorphic graph

A Tutorial of Privacy-Preservation of Graphs and Social Networks