Privacy protection in published data using an efficient clustering method

Privacy protection in published data using an efficient clustering method ICT 2010 presentation Presented By: Md. Manzoor MurshedThursday, August 14, 2014

Overview of the Presentation • Introduction • Re-identification of Data • k-anonymity Model • MOKA Algorithm • Experimental results • Future Work • Conclusion • Question?

An Abundance of Data • Supermarket scanners • Credit card transactions • Call center records • ATM machines • Web server logs • Customer web site trails • Podcasts • Blogs • Closed caption • Scientific experiments • Sensors, Cameras • Hospital visits • Social Networks • Facebook, Myspace • Twitter • Speech‐to‐text translation • Email • Education Institute • Travel records Print, film, optical, and magnetic storage: 5 Exabytes (EB) of new information in 2002, doubled in the last three years [How much Information 2003, UC Berkeley]

Data Holders Publish SensitiveInformation to Facilitate Research. Publish information that: • Discloses as much statistical information as possible. • discover valid, novel, potentially useful, and ultimately understandable patterns in data • Preserves the privacy of the individuals contributing the data.

How do you publicly release a database without compromising individual privacy? The Wrong Approach: Just leave out any unique identifiers like name and SSN and hope that this works. Why? The triple (DOB, gender, zip code) suffices to uniquely identify at least 87% of US citizens in publicly available databases (1990 U.S. Census summary data). Moral: Any real privacy guarantee must be proved and established mathematically. Question?

Examples of Re-identification Attempts

AOL Data Release … • AOL “anonymously” released a list of 21 million web searchqueries. • UserIDs were replaced by random numbers …

A Face Is Exposed for AOL Searcher No. 4417749[New York Times, August 9, 2006] … No. 4417749 conducted hundreds of searches over a threemonth period on topics ranging from “numb fingers” to “60 single men” to “dog that urinates on everything.” And search by search, click by click, the identity of AOL user No. 4417749 became easier to discern. There are queries for “landscapers in Lilburn, Ga,” several people with the last name Arnold and “homes sold in shadow lake subdivision gwinnett county georgia.” It did not take much investigating to follow that data trail to Thelma Arnold, a 62‐year‐old widow who lives in Lilburn, Ga., frequently researches her friends’ medical ailments and loves her three dogs. “Those are my searches,” she said, after a reporter read part of the list to her. …

Re-identification of AOL data release Ms. Arnold says she loves online research, but the disclosure of her searches has left her disillusioned. In response, she plans to drop her AOL subscription. “We all have a right to privacy,” she said. “Nobody should have found this all out.” Source: http://data.aolsearchlogs.com

Re-identification by linking • NAHDO reported that 37 states have legislative mandates to collect hospital level data • GIC is responsible for purchasing health insurance Medical Data was considered anonymous, since identifying attributes were removed. Governor of Massachusetts, was uniquely identified by the attributes Zip Birth Date Sex Hence, his private medical records were out in the open

Re-identification by linking (Example) Hospital Patient Data Vote Registration Data • Andre has heart disease!

Data Publishing and Data Privacy • Society is experiencing exponential growth in the number and variety of data collections containing person-specific information. • These collected information is valuable both in research and business. Data sharing is common. • Publishing the data may put the respondent’s privacy in risk. • Objective: • Maximize data utility while limiting disclosure risk to an acceptable level

What is Privacy? “The claim of individuals, groups, or institutions to determine for themselves when, how and to what extent information about them is communicated to others” Westin, Privacy and Freedom, 1967 But we need quantifiable notions of privacy … ... nothing about an individual should be learnable from the database that cannot be learned without access to the database … T. Dalenius, 1977

Quality versus anonymity

Related Works • Statistical Databases • Adding noise & maintaining some statistical invariant. Disadvantages: • destroy the integrity of the data • Multi-level Databases • Data is stored at different security classifications and users having different security clearances. • Restrict the release of lower classified information Eliminate precise inference. Disadvantages: • It is impossible to consider every possible attack • Suppression can drastically reduce the quality of the data.

K-Anonymity Sweeny came up with a formal protection model named k-anonymity • What is K-Anonymity? • If the information for each person contained in the release cannot be distinguished from at least k-1 individuals whose information also appears in the release. • Ex. If you try to identify a man from a release, but the only information you have is his birth date and gender. There are k people meet the requirement. This is k-Anonymity.

Example of suppression and generalization The following database: Can be 2-anonymized as follows: • Rows 1 and 3 are identical, rows 2 and 4 are identical, rows 4 and 5 are identical. • Suppression can replace individual attributes with a * • Generalization replace individual attributes with a border category 2014/8/14 17

K-Anonymity Protection Model Definition 1 (Quasi-identifier): A set of non sensitive attributes{Q1, . . . ,Qw} of a table that can be linked with external data to uniquely identify at least one individual from the general population are known as Quasi-identifier. Definition 2 (k-anonymity requirement): Each release of data must be such that every combination of values of quasi-identifiers can be indistinctly matched to at least k respondents. Definition 3 (k-Anonymity): A table T satisfies k-anonymity if for all t in T , there exists (k-1) other tuples ti1, ti2 , . . . , tik−1 in T such that t[C]=ti1 [C] = ti2 [C] = · · · = tik−1 [C], for all C in QI.

person Asian Non-Asian Metrics used for the Algorithm m numeric quasi-identifiers N1, N2, … Nm and q categorical quasi-identifiers C1, C2, … Cq. Information Loss: L(Pi) = |Pi| * D (Pi)

Pseudocode of the MOKA Algorithm //clustering state Sort all the records in the table using the non-sensitive attributes Set the number of clusters K = number of records in the table / k value of k-anonymity Remove and set every kth record of the table as the starting record of each cluster Find and assign rest of the records of the table to its nearest cluster //adjusting stage Find the clusters G that has records greater than k Sort the records of the G Remove and assign all the records of G in R that are greater than kth location Find and assign every records of R to the closest cluster of size less than k

(MOKA) algorithm

Experimental results

Future work • ℓ-diversity • Homogeneity Attack • Background knowledge attack • t-closeness • Skewness attack

Conclusion • The k-anonymity protection model can prevent identity disclosure but lack of diversity of the sensitive values attribute breaks the protection mechanism. • Clustering similar kind of data together before anonymization can lower the information loss due to generalization. • In this research we propose a modified clustering method for k-anonymization. • We compare our algorithm with k-means algorithm and got less information loss for some cases. • we are planning to change some parameters of our algorithm and would like to check the performance with other similar algorithm.

Questions? Thank you!

References • Sweeney, “k-anonymity: a model for protecting privacy”, International Journal of Uncertainty, Fuzziness and Knowledge-Based Systems, 2002. • Jun-Lin Lin, Meng-Cheng Wei, “An efficient clustering method for k-anonymization”. Proceedings of the 2008 international workshop on Privacy and anonymity in information society, ACM, PAIS 2008. • Jun-Lin Lin, Meng-Cheng Wei, Chih-Wen Li, Kuo-Chiang Hsieh, “A hybrid Method for k-anonymization”, Proceedings of the 2008 IEEE Asia-Pacific Services Computing Conference, APSCC 2008, pp.385-390. • Khaled El Emam, FidaKamalDankar, “Protecting Privacy Using K-anonymity”, Journal of the American Medical Informatics Association Volume 15 Number 5 September / October 2008. • Sweeney, “Computational Disclosure Control: A primer on data privacy Protection”, PhD thesis, Massachusetts Institute of Technology 2001. • Kristen LeFevre, David J. DeWitt, RaghuRamakrishanan, “Incognito: Efficient Full Domain k-anonymity”, Proceedings of SIGMOD 2005 June 14-16, 2005, Baltimore, Maryland, USA, pp. 49-60. • Kristen LeFevre, David J. DeWitt, RaghuRamakrishanan, “Mondrian Multidimensional K-Anonymity”, technical report of University of Wisconsin, Madison. • Robert Lemos, Researchers reverse Netflix anonymization, SecurityFocus 2007-12-04, (http://www.privacyanalytics.ca/news/netflix.pdf). • S. Hettich and S. D. Bay. The UCI KDD Archive, 1999, http://kdd.ics.uci.edu • AshwinMachanavajjhala, Johannes Gehrke, Daniel Kifer, MuthuramakrishanaVenkitasubramaniam . “ℓ-diversity: Privacy Beyond k-Anonymity.”IEEE Internationl Conference on Data Engineering, 2006. • Ninghui Li, Tiancheng Li, Suresh Venkatasubramanium, “t-Closeness: Privacy Beyond k-Anonymity and ℓ-diversity, Proceedings of IEEE 23rd Int'l Conference on Data Engineering (ICDE) 2007.

Privacy protection in published data using an efficient clustering method

Privacy protection in published data using an efficient clustering method

Presentation Transcript

Birch: An efficient data clustering method for very large databases

Privacy Protection for RFID Data

Digital Privacy and Data Protection

Genomic Data Privacy Protection Using Compressive Sensing

Privacy Protection

Efficient Clustering of High-Dimensional Data Sets

Re-use and Privacy/Data Protection

HC2013 Privacy and data protection challenges

Semantic privacy protection using ontologies

Data Protection Masterclass VI: Global Privacy

European Privacy and Data Protection Policy

Privacy, confidentiality and data protection

Efficient and Adaptive Replication using Content Clustering

An Efficient Software Protection Scheme

European data protection and privacy regulations

Data Privacy Protection & Advisory - EY India

An efficient iterative method in numerical calculation

Privacy-Preserving Clustering

Computer, Privacy, and Data Protection

Efficient and Adaptive Replication using Content Clustering