1 / 38

Approximation Algorithms for k -Anonymity

Approximation Algorithms for k -Anonymity. Authors: Gagan Aggarwal Tomas Feder Krishnaram Kenthapadi Rajeev Motwani Rina Panigrahy Dilys Thomas An Zhu Presented by Paul Yelton. Review of k -Anonymity NP-hardness of k -Anonymity with Suppression Algorithm for general k -Anonymity

mhaugland
Download Presentation

Approximation Algorithms for k -Anonymity

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Approximation Algorithms for k-Anonymity Authors: Gagan Aggarwal Tomas Feder Krishnaram Kenthapadi Rajeev Motwani Rina Panigrahy Dilys Thomas An Zhu Presented by Paul Yelton

  2. Review of k-Anonymity NP-hardness of k-Anonymity with Suppression Algorithm for general k-Anonymity Improved Algorithm for 2-Anonymity Improved Algorithm for 3-Anonymity Conclusion Outline

  3. Review of k-Anonymity

  4. Review of k-Anonymity • Suppress/Generalize some entries to ensure that “... there are at least k-1 other tuples in the modified table that are identical to it along the quasi-identifying attributes” • Objective is to minimize the extent of suppression and generalization.

  5. Review of k-Anonymity • Example of the previous table with anonymized with k=2 • Values are suppressed with *s to provide anonymization • k must be chosen according to the application to ensure the required level of privacy.

  6. Review of k-Anonymity • Uses an input table T that has n rows and m quasi-identifying attributes • T is considered to be a table of n m-dimensional vectors x1,...,xn • Focus on a special case, k-Anonymity with Suppression, where they only perform suppression • Create a k-Anonymous suppression function t mapping xi to xp and replacing the quasi-identifier values with *s. • t creates partitions of the n row vectors into clusters of size ≥k

  7. Review of k-Anonymity • Formal Definition: • k-Anonymity with Suppression • Given: x1,x2,...,xn ∈ ∑m and k, create a suppression function t so that t(xi) = t(xp) and xp = * • xp ≥ (k-1)xq • Minimize cost by using c(t) • c(t) is equivalent to the total number of *s in t(xi)

  8. Next problem deals with k-Anonymity with Generalization, where suppression is also used. Example of an attribute named “Quality”. Review of k-Anonymity

  9. Review of k-Anonymity • Formal Definition: • k-Anonymity with Generalization • Given: x1,x2,...,xn ∈ ∑m and k, create a generalization function h to map xi to generalization hierarchy • Generalization hierarchy is Djhfor all attributes j with 1 ≤ h ≤ lj and Dj0 = Dj • h(xi) = h(xj) for at least (k-1) values of j ≠ i • c(h) = ∑i∑jh(i,j)/lj • Note: k-Anonymity with Suppression is a special case when lj = 1

  10. NP-hardness of k-Anonymity with Suppression

  11. NP-hardness of k-Anonymity with Suppression • Present the Proof that was formulated • Equivalence to edge partition into triangles and 4-stars • Equivalence to edge partition into triangles

  12. NP-hardness of k-Anonymity with Suppression Theorem 1 k-Anonymity with Suppression is NP-hard even for a ternary alphabet, for example ( ∑ = {0,1,2} ) Proof Given a graph G= (V,E) with |E| = 3m for an integer m, can the edges of G be partitioned into m edge-disjoint triangles?

  13. NP-hardness of k-Anonymity with Suppression Construct a table T with 3m rows, For each the y vertices of G create an attribute/column Optimal 3-Anonymity solution for T is ≤ 9m if E can be partitioned into m disjoint triangles. An edge can be defined as follows: e1 = (y1,y2) ex: T= y3 y1 y2 ... yn 0 e1 = (y1,y2) 1 1 0 0 e2 = (y2,y3) 0 1 1 0 0 ... ... ... ... ... ... 0 0 0 1 e3m = (y1,yn) 1

  14. NP-hardness of k-Anonymity with Suppression Consider a triangle with vertices y1, y2, y3 and apply suppression to those vertices, we obtain a cluster containing 3 rows with *s in each modified row. y3 y3 y1 y2 y1 y2 0 * (y1,y2) 1 1 (y1,y2) * * (y2,y3) 0 1 (y2,y3) * * 1 * 1 * (y3,y1) 0 (y3,y1) * 1 *

  15. NP-hardness of k-Anonymity with Suppression Consider a 4-star with vertices y1, y2, y3, y4 and y4is the center vertex. Apply suppression to vertices y1, y2, y3, we obtain a cluster containing 3 rows with *s in each modified row. y3 y1 y2 y3 y4 y1 y2 y4 * 1 (y1,y4) * * 1 0 0 1 (y1,y4) * 1 (y2,y4) * 0 1 0 1 * (y2,y4) * 1 * 0 0 1 1 * (y3,y4) (y3,y4) y1 y4 y3 y2

  16. NP-hardness of k-Anonymity with Suppression • From the above proofs we obtain an optimal cost of 9m • Based on the simple graph G any three rows are distinct and differ in at least three positions • There are at least three *s in each modified row so cost ≤ 9m • Two possibilities for creating clusters of ≥ size of 3: • - Edges form either a triangle or 4-star • - Modified rows in a triangle has three *s and 0's elsewhere • while modified rows in a 4-star has three *s, single 1 and 0's elsewhere • This solution relates to a partition of the edges of a graph into triangles and 4-stars.

  17. NP-hardness of k-Anonymity with Suppression • Equivalence to edge partition into triangles • Create a table T' as a replication of T so that we can force the 4-stars to pay more *s • Use the following function for replicating: t = log2(3m + 1) • This allows T' to have t blocks that have n columns. • e was defined earlier, e=(a,b) • Arbitrary ordering of the edges in E to give a rank to e in binary notation as e1,e2....et • Rows have 0's in all places except in the two points. • Blocks can be in 1 of 2 configurations: • - conf0 has a 1 in position a, and a 2 in position b • - conf1 has a 2 in position a, and a 1 in position b

  18. NP-hardness of k-Anonymity with Suppression 3 4 3 4 3 4 1 2 1 2 1 2 1 1 2 (3,4) 2 2 1 0 0 0 0 0 0 0 0 0 2 1 2 (1,4) 1 2 1 0 0 0 0 0 0 (1,2) 0 0 0 2 1 1 1 2 2 0 0 0 (1,3) 0 0 0 2 1 1 1 2 2 0 0 0 (2,3) 0 0 0 2 1 1 1 2 2 Block 1 Block 2 Block 3

  19. NP-hardness of k-Anonymity with Suppression • Optimal cost of T' ≤ 9mt only if E can be partitioned into m disjoint triangles • Now every triangle in a partition relates to a cluster with 3t *s • Proof above proves k-Anonymity is NP-hard with a ternary alphabet for k=3 • It can be extended for k=(r2) and r ≥ 3 • Replicating the graph for reduction also allows extension when k=α(r2)for any integer α and r ≥ 3

  20. Algorithm for general k-Anonymity

  21. Algorithm for general k-Anonymity • Show an O(k)-approximation for the problem of k-Anonymity with Generalization • Create a graph • - edge-weighted complete graph G = (V,E) • - vertex set V contains a vertex to the related vector of k- Anonymized table with generalization • - ha,b(j) is lowest level of generalization such that • h(a)j = h(b)j • - weight function w(e) = ∑jha,b(j)/lj • Recall: Attribute j has lj levels of generalization and e refers to an edge where e=(a,b)

  22. Algorithm for general k-Anonymity • Limitations of the Graph Representation • - Some information about the structure of the problem is lost. • - Cannot achieve better than Θ(k) approximation factor • Charge of a vertex: is considered to be the total generalization cost of the vector it represents • OPT denotes the cost of an optimal k-Anonymity solution, for example from the previous proof, OPT=9m • F = { T1,T2,...,Ts }, where F is a spanning forest containing all the vertices • Ti is a tree with ≥ k vertices, which is a subgraph of G • Weight of Ti is W(Ti)=∑e∈E(Ti)w(e) • c(F) = ∑i|V(Ti)|W(Ti) • L is size of largest component

  23. Algorithm for general k-Anonymity Outline of an Algorithm 1. Create a forest G with a cost ≤ OPT 2. Calculate decomposition of the forest allowing deletion so that k ≤ |V(Ti)| ≤ max{ 2k–1, 3k–5 } vertices

  24. Algorithm for general k-Anonymity • This algorithm yields k ≤ c(F) ≤ OPT

  25. Algorithm for general k-Anonymity • Algorithm below breaks components > max{ 2k-1, 3k-5 } into components of size at least k.

  26. Algorithm for general k-Anonymity

  27. Algorithm for general k-Anonymity Theorem 5 There is a polynomial-time algorithm that achieves an approximation ratio of max{ 2k–1, 3k–5 } Proof Create a forest with the above FOREST algorithm, then repeatedly apply the DECOMPOSE-COMPONENT to any component > max{ 2k–1, 3k–5 } Note: Both algorithms terminate in O(kn2) time.

  28. Algorithm for general k-Anonymity • This algorithm can be used when attributes are assigned weights and minimize the weighted generalization cost is desired • It also can be extended to allow an entire row deletion instead of forcing it to pair with k-1 other rows • - the distance between any two vertices is no more than the cost of deleting a vertex

  29. Improved Algorithm for 2-Anonymity

  30. Improved Algorithm for 2-Anonymity • The previous section shows a 3-approximation algorithm for this case, but they improve upon this result and produce a polynomial-time 1.5 approximation • Use a minimum-weight [1,2]-factor of a graph, meaning each vertex in a subgraph has a degree of 1 or 2. • F is a subgraph of G • W(F) = ∑Fw(e) • F is a vertex-disjoint collection of edges and pairs of adjacent edges • Each component of F is treated as a cluster, meaning displaying the bits on which all vectors agree and replace all other bits with *s

  31. Improved Algorithm for 2-Anonymity • Theorem 6 • Number of *s introduced by the above algorithm is at most 1.5 times the number of *s in an optimal 2-Anonymity solution • Observation 7 • If vertices x1,x2,x3 form a cluster in a k-Anonymity solution, the number of *s = p + q + r=½(α + β + γ) • xmed is the median vertex, number of *s in each modified vector is at least p + q + r • cOFAC is the weight of an optimal [1,2]-factor • cALG is the cost of 2-Anonymity solution • cFAC is the weight of the [1,2]-factor

  32. Improved Algorithm for 2-Anonymity Lemma cALG ≤ 3 ∙ cOFAC Proof For a cluster of size 3, the number of *s in each row is (α + β + γ)/2 Total number of *s = 3/2(α + β + γ) ≤ 3(α + β) obtained by triangle inequality Optimal [1,2]-factor contains the 2 lighter edges of the triangle that has a cost of (α + β) for this cluster ∑cluster = cALG ≤ 3 ∙ cOFAC

  33. Improved Algorithm for 2-Anonymity Lemma cFAC ≤ ½OPT Proof -Cluster of size 2 cost of [1,2]-factor FAC = ½OPT -Cluster of size 3 cost of FAC = α + β≤⅔(α + β + γ)=4/3(p+q+r), inequality is obtained by the fact of γ ≥ α,β -Cost of OPT=3(p+q+r) and from above the cost of FAC is at most ½OPT -∑cluster = cFAC ≤ ½OPT

  34. Improved Algorithm for 3-Anonymity

  35. Improved Algorithm for 3-Anonymity • Produce a polynomial-time 2 approximation • This idea is very similar to the algorithm above for 1.5 approximation • Lemma • Cost of the optimal 2-factor, cOFAC on graph G corresponding to the vectors in the 3-Anonymity instance is at most ⅔OPT, cOFAC ≤ ⅔OPT • Proof • - Clusters are of size 3,4 or 5 vertices and if clusters > 5, they can be broken down into smaller groups of at least 3 • - For every cluster pick the min-weight cycle involving the vertices of the cluster

  36. Improved Algorithm for 3-Anonymity • Consider the following 3 cases: • - Cluster i = size 3, then a triangle is present • - a,b, and c are the lengths of the edges. • - total cost of OPT is OPTi = 3/2(a + b + c) • - FAC has a total cost of cFAC,i = a+b+c=3/2OPTi • - Cluster i = size 4, τ = sum of the weights of all (42)=6 edges • - FAC pays cFAC,i ≤ ⅔τ • - OPTi ≥ 4 ∙½∙ 2/4 ∙ τ = τ • - cFAC,i ≤ ⅔OPTi • - Cluster i = size 5, τ = sum of the weights of all (52)=10 edges • - similar to i=4, FAC pays cFAC,i ≤ 5/10τ • - OPTi ≥ 5 ∙½∙ 3/10 ∙ τ = 3/4τ • - cFAC,i ≤ ⅔OPTi

  37. Improved Algorithm for 3-Anonymity Lemma Given a 2-factor F with cost cF, we achieve a solution for 3-Anonymity with a cost cALG ≤ 3 ∙ cF Proof - Every cycle in F with size 3,4 and 5 will become a cluster - Depending on the size of the cycle ALG pays as follows - For a triangle, ALG pays 3∙½len(C) ≤ 3∙len(C) - For a 4-cycle, ALG pays at most 4∙½len(C) ≤ 3∙len(C) - For a 5-cycle, ALG pays at most 5∙½len(C) ≤ 3∙len(C) - For (3x+1)-cycle, ALG pays at most 6(x-1)+12/3x+1∙len(C) ≤ 3∙len(C) - For (3x+2)-cycle, ALG pays at most 6(x-2)+24/3x+2∙len(C) ≤ 3∙len(C)

  38. Conclusion • Demonstrated that k-Anonymity with Generalization is NP-hard even with ternary values and only suppression is allowed • Gave an O(k)-approximation algorithm for an arbitrary value of k and alphabet size • Showed improved approximations for k=2 (1.5) and k=3 (2) • It is not possible to achieve an approximation factor better than k/4 by using graph representation • Interesting to see the hardness of approximation for k-Anonymity without using graph representation • Useful to extend k-Anonymity framework to handle inserts, deletes and updates to database.

More Related