Approximation Algorithms for k -Anonymity

Approximation Algorithms for k-Anonymity Authors: Gagan Aggarwal Tomas Feder Krishnaram Kenthapadi Rajeev Motwani Rina Panigrahy Dilys Thomas An Zhu Presented by Paul Yelton

Review of k-Anonymity NP-hardness of k-Anonymity with Suppression Algorithm for general k-Anonymity Improved Algorithm for 2-Anonymity Improved Algorithm for 3-Anonymity Conclusion Outline

Review of k-Anonymity

Review of k-Anonymity • Suppress/Generalize some entries to ensure that “... there are at least k-1 other tuples in the modified table that are identical to it along the quasi-identifying attributes” • Objective is to minimize the extent of suppression and generalization.

Review of k-Anonymity • Example of the previous table with anonymized with k=2 • Values are suppressed with *s to provide anonymization • k must be chosen according to the application to ensure the required level of privacy.

Review of k-Anonymity • Uses an input table T that has n rows and m quasi-identifying attributes • T is considered to be a table of n m-dimensional vectors x1,...,xn • Focus on a special case, k-Anonymity with Suppression, where they only perform suppression • Create a k-Anonymous suppression function t mapping xi to xp and replacing the quasi-identifier values with *s. • t creates partitions of the n row vectors into clusters of size ≥k

Review of k-Anonymity • Formal Definition: • k-Anonymity with Suppression • Given: x1,x2,...,xn ∈ ∑m and k, create a suppression function t so that t(xi) = t(xp) and xp = * • xp ≥ (k-1)xq • Minimize cost by using c(t) • c(t) is equivalent to the total number of *s in t(xi)

Next problem deals with k-Anonymity with Generalization, where suppression is also used. Example of an attribute named “Quality”. Review of k-Anonymity

Review of k-Anonymity • Formal Definition: • k-Anonymity with Generalization • Given: x1,x2,...,xn ∈ ∑m and k, create a generalization function h to map xi to generalization hierarchy • Generalization hierarchy is Djhfor all attributes j with 1 ≤ h ≤ lj and Dj0 = Dj • h(xi) = h(xj) for at least (k-1) values of j ≠ i • c(h) = ∑i∑jh(i,j)/lj • Note: k-Anonymity with Suppression is a special case when lj = 1

NP-hardness of k-Anonymity with Suppression

NP-hardness of k-Anonymity with Suppression • Present the Proof that was formulated • Equivalence to edge partition into triangles and 4-stars • Equivalence to edge partition into triangles

NP-hardness of k-Anonymity with Suppression Theorem 1 k-Anonymity with Suppression is NP-hard even for a ternary alphabet, for example ( ∑ = {0,1,2} ) Proof Given a graph G= (V,E) with |E| = 3m for an integer m, can the edges of G be partitioned into m edge-disjoint triangles?

NP-hardness of k-Anonymity with Suppression Construct a table T with 3m rows, For each the y vertices of G create an attribute/column Optimal 3-Anonymity solution for T is ≤ 9m if E can be partitioned into m disjoint triangles. An edge can be defined as follows: e1 = (y1,y2) ex: T= y3 y1 y2 ... yn 0 e1 = (y1,y2) 1 1 0 0 e2 = (y2,y3) 0 1 1 0 0 ... ... ... ... ... ... 0 0 0 1 e3m = (y1,yn) 1

NP-hardness of k-Anonymity with Suppression Consider a triangle with vertices y1, y2, y3 and apply suppression to those vertices, we obtain a cluster containing 3 rows with *s in each modified row. y3 y3 y1 y2 y1 y2 0 * (y1,y2) 1 1 (y1,y2) * * (y2,y3) 0 1 (y2,y3) * * 1 * 1 * (y3,y1) 0 (y3,y1) * 1 *

NP-hardness of k-Anonymity with Suppression Consider a 4-star with vertices y1, y2, y3, y4 and y4is the center vertex. Apply suppression to vertices y1, y2, y3, we obtain a cluster containing 3 rows with *s in each modified row. y3 y1 y2 y3 y4 y1 y2 y4 * 1 (y1,y4) * * 1 0 0 1 (y1,y4) * 1 (y2,y4) * 0 1 0 1 * (y2,y4) * 1 * 0 0 1 1 * (y3,y4) (y3,y4) y1 y4 y3 y2

NP-hardness of k-Anonymity with Suppression • From the above proofs we obtain an optimal cost of 9m • Based on the simple graph G any three rows are distinct and differ in at least three positions • There are at least three *s in each modified row so cost ≤ 9m • Two possibilities for creating clusters of ≥ size of 3: • - Edges form either a triangle or 4-star • - Modified rows in a triangle has three *s and 0's elsewhere • while modified rows in a 4-star has three *s, single 1 and 0's elsewhere • This solution relates to a partition of the edges of a graph into triangles and 4-stars.

NP-hardness of k-Anonymity with Suppression • Equivalence to edge partition into triangles • Create a table T' as a replication of T so that we can force the 4-stars to pay more *s • Use the following function for replicating: t = log2(3m + 1) • This allows T' to have t blocks that have n columns. • e was defined earlier, e=(a,b) • Arbitrary ordering of the edges in E to give a rank to e in binary notation as e1,e2....et • Rows have 0's in all places except in the two points. • Blocks can be in 1 of 2 configurations: • - conf0 has a 1 in position a, and a 2 in position b • - conf1 has a 2 in position a, and a 1 in position b

NP-hardness of k-Anonymity with Suppression 3 4 3 4 3 4 1 2 1 2 1 2 1 1 2 (3,4) 2 2 1 0 0 0 0 0 0 0 0 0 2 1 2 (1,4) 1 2 1 0 0 0 0 0 0 (1,2) 0 0 0 2 1 1 1 2 2 0 0 0 (1,3) 0 0 0 2 1 1 1 2 2 0 0 0 (2,3) 0 0 0 2 1 1 1 2 2 Block 1 Block 2 Block 3

NP-hardness of k-Anonymity with Suppression • Optimal cost of T' ≤ 9mt only if E can be partitioned into m disjoint triangles • Now every triangle in a partition relates to a cluster with 3t *s • Proof above proves k-Anonymity is NP-hard with a ternary alphabet for k=3 • It can be extended for k=(r2) and r ≥ 3 • Replicating the graph for reduction also allows extension when k=α(r2)for any integer α and r ≥ 3

Algorithm for general k-Anonymity

Algorithm for general k-Anonymity • Show an O(k)-approximation for the problem of k-Anonymity with Generalization • Create a graph • - edge-weighted complete graph G = (V,E) • - vertex set V contains a vertex to the related vector of k- Anonymized table with generalization • - ha,b(j) is lowest level of generalization such that • h(a)j = h(b)j • - weight function w(e) = ∑jha,b(j)/lj • Recall: Attribute j has lj levels of generalization and e refers to an edge where e=(a,b)

Algorithm for general k-Anonymity • Limitations of the Graph Representation • - Some information about the structure of the problem is lost. • - Cannot achieve better than Θ(k) approximation factor • Charge of a vertex: is considered to be the total generalization cost of the vector it represents • OPT denotes the cost of an optimal k-Anonymity solution, for example from the previous proof, OPT=9m • F = { T1,T2,...,Ts }, where F is a spanning forest containing all the vertices • Ti is a tree with ≥ k vertices, which is a subgraph of G • Weight of Ti is W(Ti)=∑e∈E(Ti)w(e) • c(F) = ∑i|V(Ti)|W(Ti) • L is size of largest component

Algorithm for general k-Anonymity Outline of an Algorithm 1. Create a forest G with a cost ≤ OPT 2. Calculate decomposition of the forest allowing deletion so that k ≤ |V(Ti)| ≤ max{ 2k–1, 3k–5 } vertices

Algorithm for general k-Anonymity • This algorithm yields k ≤ c(F) ≤ OPT

Algorithm for general k-Anonymity • Algorithm below breaks components > max{ 2k-1, 3k-5 } into components of size at least k.

Algorithm for general k-Anonymity

Algorithm for general k-Anonymity Theorem 5 There is a polynomial-time algorithm that achieves an approximation ratio of max{ 2k–1, 3k–5 } Proof Create a forest with the above FOREST algorithm, then repeatedly apply the DECOMPOSE-COMPONENT to any component > max{ 2k–1, 3k–5 } Note: Both algorithms terminate in O(kn2) time.

Algorithm for general k-Anonymity • This algorithm can be used when attributes are assigned weights and minimize the weighted generalization cost is desired • It also can be extended to allow an entire row deletion instead of forcing it to pair with k-1 other rows • - the distance between any two vertices is no more than the cost of deleting a vertex

Improved Algorithm for 2-Anonymity

Improved Algorithm for 2-Anonymity • The previous section shows a 3-approximation algorithm for this case, but they improve upon this result and produce a polynomial-time 1.5 approximation • Use a minimum-weight [1,2]-factor of a graph, meaning each vertex in a subgraph has a degree of 1 or 2. • F is a subgraph of G • W(F) = ∑Fw(e) • F is a vertex-disjoint collection of edges and pairs of adjacent edges • Each component of F is treated as a cluster, meaning displaying the bits on which all vectors agree and replace all other bits with *s

Improved Algorithm for 2-Anonymity • Theorem 6 • Number of *s introduced by the above algorithm is at most 1.5 times the number of *s in an optimal 2-Anonymity solution • Observation 7 • If vertices x1,x2,x3 form a cluster in a k-Anonymity solution, the number of *s = p + q + r=½(α + β + γ) • xmed is the median vertex, number of *s in each modified vector is at least p + q + r • cOFAC is the weight of an optimal [1,2]-factor • cALG is the cost of 2-Anonymity solution • cFAC is the weight of the [1,2]-factor

Improved Algorithm for 2-Anonymity Lemma cALG ≤ 3 ∙ cOFAC Proof For a cluster of size 3, the number of *s in each row is (α + β + γ)/2 Total number of *s = 3/2(α + β + γ) ≤ 3(α + β) obtained by triangle inequality Optimal [1,2]-factor contains the 2 lighter edges of the triangle that has a cost of (α + β) for this cluster ∑cluster = cALG ≤ 3 ∙ cOFAC

Improved Algorithm for 2-Anonymity Lemma cFAC ≤ ½OPT Proof -Cluster of size 2 cost of [1,2]-factor FAC = ½OPT -Cluster of size 3 cost of FAC = α + β≤⅔(α + β + γ)=4/3(p+q+r), inequality is obtained by the fact of γ ≥ α,β -Cost of OPT=3(p+q+r) and from above the cost of FAC is at most ½OPT -∑cluster = cFAC ≤ ½OPT

Improved Algorithm for 3-Anonymity

Improved Algorithm for 3-Anonymity • Produce a polynomial-time 2 approximation • This idea is very similar to the algorithm above for 1.5 approximation • Lemma • Cost of the optimal 2-factor, cOFAC on graph G corresponding to the vectors in the 3-Anonymity instance is at most ⅔OPT, cOFAC ≤ ⅔OPT • Proof • - Clusters are of size 3,4 or 5 vertices and if clusters > 5, they can be broken down into smaller groups of at least 3 • - For every cluster pick the min-weight cycle involving the vertices of the cluster

Improved Algorithm for 3-Anonymity • Consider the following 3 cases: • - Cluster i = size 3, then a triangle is present • - a,b, and c are the lengths of the edges. • - total cost of OPT is OPTi = 3/2(a + b + c) • - FAC has a total cost of cFAC,i = a+b+c=3/2OPTi • - Cluster i = size 4, τ = sum of the weights of all (42)=6 edges • - FAC pays cFAC,i ≤ ⅔τ • - OPTi ≥ 4 ∙½∙ 2/4 ∙ τ = τ • - cFAC,i ≤ ⅔OPTi • - Cluster i = size 5, τ = sum of the weights of all (52)=10 edges • - similar to i=4, FAC pays cFAC,i ≤ 5/10τ • - OPTi ≥ 5 ∙½∙ 3/10 ∙ τ = 3/4τ • - cFAC,i ≤ ⅔OPTi

Improved Algorithm for 3-Anonymity Lemma Given a 2-factor F with cost cF, we achieve a solution for 3-Anonymity with a cost cALG ≤ 3 ∙ cF Proof - Every cycle in F with size 3,4 and 5 will become a cluster - Depending on the size of the cycle ALG pays as follows - For a triangle, ALG pays 3∙½len(C) ≤ 3∙len(C) - For a 4-cycle, ALG pays at most 4∙½len(C) ≤ 3∙len(C) - For a 5-cycle, ALG pays at most 5∙½len(C) ≤ 3∙len(C) - For (3x+1)-cycle, ALG pays at most 6(x-1)+12/3x+1∙len(C) ≤ 3∙len(C) - For (3x+2)-cycle, ALG pays at most 6(x-2)+24/3x+2∙len(C) ≤ 3∙len(C)

Conclusion • Demonstrated that k-Anonymity with Generalization is NP-hard even with ternary values and only suppression is allowed • Gave an O(k)-approximation algorithm for an arbitrary value of k and alphabet size • Showed improved approximations for k=2 (1.5) and k=3 (2) • It is not possible to achieve an approximation factor better than k/4 by using graph representation • Interesting to see the hardness of approximation for k-Anonymity without using graph representation • Useful to extend k-Anonymity framework to handle inserts, deletes and updates to database.

Approximation Algorithms for k -Anonymity