1 / 56

Privacy and Anonymity in Text

Privacy and Anonymity in Text. Chris Clifton 12 November, 2009. Plausibly Deniable Search. This is joint work with Mummoorthy Murugesan. 2009 SIAM International Conference on Data Mining (SDM09), Sparks, Nevada, April 30-May 2, 2009. The AOL Awakening.

tess
Download Presentation

Privacy and Anonymity in Text

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Privacy and Anonymity in Text Chris Clifton 12 November, 2009

  2. Plausibly Deniable Search This is joint work with MummoorthyMurugesan 2009 SIAM International Conference on Data Mining (SDM09), Sparks, Nevada, April 30-May 2, 2009

  3. The AOL Awakening • In Aug 2006, AOL released its customers web searches for research studies • 20 Million unique queries of 650K unique users • <user-id> was replaced with a <random-number> • NY Times reporter successfully found the identity of an individual from the queries • Queries included “60 single men” “landscapers in Lilburn, Ga” • Many more queries contained enough information to uniquely identify the person AOL fired its CTO over this issue; Two researchers were forced out

  4. Privacy in Web Search • Server-Controlled Privacy • Deletion of queries after a few months • Anonymization of querylogs before backup • Some of these methods have been shown to be inadequate • Private Information Retrieval • affects the advertising business model • not practical with the current solutions

  5. Lessons Learned • Content of user queries reveals a lot • Ego surfing: searching for own name, ssn, credit card • Identifiable • Location, type of work, age, medical condition • Sensitive • Car they own, restaurants in a zip code • Query transformation alone is not enough • Submitting Q’ instead of Q to retrieve the same set of documents • User intent still revealed

  6. User-Controlled Privacy • Hide identifying metadata • Private Web Search (PWS) – Firefox plugin (Yale Univ.) • Removes metadata • Hides user IP Address (via TOR)

  7. Private Web SearchFelipe Saint-Jean,Johnson, Boneh, Feigenbaum • Tor: Hides IP addresses • Routes request, response through multiple servers • Each knows only preceding server • HTTP filter normalizes search queries • Browser, OS, etc. • HTML filter removes active components

  8. User-Controlled Privacy • Hide identifying metadata • Private Web Search (PWS) – Firefox plugin (Yale Univ.) • Removes metadata • Hides user IP Address (via TOR) • Protect against disclosure through query terms • TrackMeNot – Firefox plugin (NYU) • Periodically issues randomized queries from a list of “seeds” • Uses search results for 'logical' future query terms Actual User Query (user intent) is revealed Timing attacks, load on server Query semantics attacks – `logical’ generated terms

  9. Plausibly Deniable Search PDS 1 2 Search Engine {q1,...,qk} {q1,...,qk} q {R(q1),...,R(qk)} 4 3 Filter R(qi) using the original q

  10. Plausibly Deniable Search:Key Concepts • Browser submits more than one query {q1,…,qk} • Deniability • Reversible: any of the k queries would have produced the same set • The additional “cover queries” are of diverse topics • Plausibility • All queries are equally plausible • Implausible queries would weaken the deniability argument {“java compiler” , “newton apple”} Vs {“java compiler” , “motorola table”}

  11. Plausibly Deniable Search: Theory • Assume the following: • User Queries follow a distribution Pu • Cover queries are generated through a distribution Pc • Given a set of two queries S={q1,q2}, there are two possible events • E1: q1 is user query & q2 is cover query • E2: q2 is cover query & q1 is user query

  12. Plausibly Deniable Search: Theory • To achieve deniability for either of these queries, we require the following condition: • Two of many possible solutions • queries have equal probability of being user queries, and equal probability of being cover queries • queries have the same probability of being user query or cover query

  13. Creating Plausibly Deniable Cover Queries • Create Canonical Queries • Standard queries • Creating PD-Querysets • Plausibly deniable querysetswith k queries • Issuing query • Find and issue the PD-Queryset for the given user query Done in advance (Server / Third Party)

  14. Step One:Creating Canonical Queries Use LSI to combine Semantically Similar Seed Queries Canonical Queries Seed Queries FP Mining • Semantically similar surrogate queries for user queries • Supports the “deniability” argument since all queries could be generated by the system. Seed Documents

  15. Step Two:Creating PD-Querysets • Dissimilarity between two queries is based on 3 measures: • Euclidean distance: Semantically similar queries are closer in the semantic space • Magnitude: queries that are equally stronger in their respective topics have similar magnitude • Neighborhood count: equally plausible queries have similar number of log (already issued) queries in their neighborhood Agglomerative Clustering Canonical Queries PD-Querysets

  16. Step 3: Issuing Query • User query is mapped to semantic space • Vec(q)=qTU’S’-1 • Find canonical queries that have the maximum cosine similarity with q in the semantic space • The PD-Queryset of the selected canonical query is issued

  17. How Good is PDS? • Deniability: • Canonical query provides one level of anonymity • There exist many seed queries that map to a single canonical query • The reversible property provides deniability • Plausibility: • Base on the number of similar topics queries issued by users • Measure as perception of human subjects; difficult to quantify How good are the canonical queries? Do they fetch what the users want?

  18. Results from Experiments • Document Collection • DMOZ categorized web documents • 314K documents and 1.28M unique terms • Three topics: Computers, Science, Sports • Number of Documents in Each Category • Computers 115k • Science 100k • Sports 99k • After performing SVD on the term-doc matrix, only 30 columns are kept in U

  19. Canonical Queries • 2.6 Million seed queries generated with ∆=500 • Produces 932K canonical queries • Average canonical query length 3.7

  20. Retrieval Performance • 5k queries from the allthweb.com searches • 3.4k unique queries containing at least 75% terms from our collection • Six of top 20 in 69% of queries (500)

  21. Topic Diversity • DMOZ categories are used in comparing the topics of queries • 85% of PD-Querysets have queries with >50% topic diversity

  22. What is Next? • PDS can be used along with other approaches such as PWS, TOR, etc. • Canonical Queries • Efficient ways of creating canonical queries • Improving retrieval performance • Sequential Queries • How to handle the sequentially edited queries by an user on the same topic? • Can an attacker figure out the user queries over period of time?

  23. Query Sequences • Users issue a sequence of queries on a topic • Cover queries should be plausibly deniable sequences • Consider two sequences, S1={a1,b1} S2={a2,b2}, where <a1,a2> are issued together (first), <b1,b2> are issued second • There are two possible events: • E1: S1 is user sequence, S2 is cover sequence • E2: S1 is cover sequence, S2 is user sequence

  24. Query Sequences • To deniability is achieved when we satisfy the following constraint: • Given deniability for the first queries a1,b1, we get:

  25. Two (of many) Possible Solutions • b1 and b2 have same conditional probability of being user-generated • Also same conditional probability of being method-generated • a2 has equal conditional probability of being user generated or method generated; b2 has the same property. This is applicable to the m+1th query given a sequence of m queries

  26. Generating “user-like” Sequences • Idea: Inter-query time determines difference between queries • Learn distribution of changes to queries at time • Given time, generate query from previous cover query and appropriate distribution • P(qk | qk-1) same as a real user!

  27. Distribution of what changes? • Features “defining” query are those useful in linking queries in sequence • If sequence can be discovered, must be simulated • Features from I know what you did last summer (Jones et. al) • Term re-use, topic similarity used to link queries in a sequence • Learned distribution from large query log for ranges of inter-query times • Topic relation • Topic repetition • Number of term changes

  28. Feature Distributions with respect to Inter-Query Time Term Changes Topic Changes “Bin number” is exponential grouping on time

  29. Effectiveness: Topic Change Distribution on DMOZ

  30. How well does it really work?

  31. Try again…

  32. Figure it out yet?

  33. Disclosure-free Discovery of Related Documents Chris Clifton MummoorthyMurugesan Wei Jiang Luo Si JaideepVaidya 18 September, 2009 Proceedings of the 24th International Conference on Data Engineering (ICDE 2008), Cancun, Mexico, April 7-12, 2008

  34. Problem:Identifying Common Interests We have evidence that Osama Bin Laden has financed the purchase of Stinger Missiles in Afghanistan. … There have been reports from Kabul of financial transfers from Bin Laden, purportedly or the purchase of Stinger missiles … We have evidence that Osama Bin Laden has financed the purchase of Stinger Missiles in Afghanistan. … There have been reports from Kabul of financial transfers from Bin Laden, purportedly or the purchase of Stinger missiles … We have evidence that Osama Bin Laden has financed the purchase of Stinger Missiles in Afghanistan. … There have been reports from Kabul of financial transfers from Bin Laden, purportedly or the purchase of Stinger missiles … There have been reports from Kabul of financial transfers from Bin Laden, purportedly or the purchase of Stinger missiles …

  35. Solution Overview Alice Bob We have evidence that Osama Bin Laden has financed the purchase of Stinger Missiles in Afghanistan. … There have been reports from Kabul of financial transfers from Bin Laden, purportedly or the purchase of Stinger missiles …

  36. Secure Product:Random Matrix • Vaidya and Clifton, Privacy Preserving Association Rule Mining in Vertically Partitioned Data, KDD02

  37. Secure Product:Homomorphic Encryption Goethals, Laur, Lipmaa, and Mielikainen, On secure scalar product computation for privacy-preserving data mining,ICISC 2004

  38. Is Performance an Issue? Alice Bob There have been reports from Kabul of financial transfers from Bin Laden, purportedly or the purchase of Stinger missiles … We have evidence that Osama Bin Laden has financed the purchase of Stinger Missiles in Afghanistan. … We have evidence that Osama Bin Laden has financed the purchase of Stinger Missiles in Afghanistan. … There have been reports from Kabul of financial transfers from Bin Laden, purportedly or the purchase of Stinger missiles … We have evidence that Osama Bin Laden has financed the purchase of Stinger Missiles in Afghanistan. … There have been reports from Kabul of financial transfers from Bin Laden, purportedly or the purchase of Stinger missiles … We have evidence that Osama Bin Laden has financed the purchase of Stinger Missiles in Afghanistan. … There have been reports from Kabul of financial transfers from Bin Laden, purportedly or the purchase of Stinger missiles … There have been reports from Kabul of financial transfers from Bin Laden, purportedly or the purchase of Stinger missiles … We have evidence that Osama Bin Laden has financed the purchase of Stinger Missiles in Afghanistan. … There have been reports from Kabul of financial transfers from Bin Laden, purportedly or the purchase of Stinger missiles … We have evidence that Osama Bin Laden has financed the purchase of Stinger Missiles in Afghanistan. … We have evidence that Osama Bin Laden has financed the purchase of Stinger Missiles in Afghanistan. … We have evidence that Osama Bin Laden has financed the purchase of Stinger Missiles in Afghanistan. …

  39. Running Time(journal articles)

  40. Faster: Local Clustering • Locally cluster similar documents • Secure protocol identifies similar clusters • Document comparison only within identified clusters There have been reports from Kabul of financial transfers from Bin Laden, purportedly or the purchase of Stinger missiles … We have evidence that Osama Bin Laden has financed the purchase of Stinger Missiles in Afghanistan. … We have evidence that Osama Bin Laden has financed the purchase of Stinger Missiles in Afghanistan. … There have been reports from Kabul of financial transfers from Bin Laden, purportedly or the purchase of Stinger missiles … We have evidence that Osama Bin Laden has financed the purchase of Stinger Missiles in Afghanistan. … We have evidence that Osama Bin Laden has financed the purchase of Stinger Missiles in Afghanistan. … There have been reports from Kabul of financial transfers from Bin Laden, purportedly or the purchase of Stinger missiles … We have evidence that Osama Bin Laden has financed the purchase of Stinger Missiles in Afghanistan. … There have been reports from Kabul of financial transfers from Bin Laden, purportedly or the purchase of Stinger missiles … We have evidence that Osama Bin Laden has financed the purchase of Stinger Missiles in Afghanistan. …

  41. Savings / Loss from Clustering

  42. Effectiveness:40% Document Overlap

  43. t-Plausibility: Semantic Preserving Text Sanitization Wei Jiang MummoorthyMurugesan Chris Clifton, and Luo Si 2009 IEEE International Conference on Privacy, Security, Risk and Trust (PASSAT-09), Vancouver, Canada, August 29-31, 2009

  44. Motivations • De-identification plays an important role in privacy (legislation) • Documents that do not contain personally identifiable information can be shared, e.g., pathology reports • De-identification tools remove “obvious” identifying information • Name, address, dates, … • Unfortunately, non-obvious information can identify • Pain vs. phantom pain • Alternative: suppress sensitive information • Uses marijuana for pain  Uses --- for --- • Our approach: information generalization • phantom pain  pain • tuberculosis  infectious disease

  45. Related Work • Data anonymization • k-Anonymity: sanitizing structured info, e.g., datasets with at least k records in relational format • Transforming a text into a dataset of k records is not well studied • Text sanitization • Most work focuses on identifying sensitive attributes • Then removing identified sensitive information

  46. Basic Idea: Generalization Seat (50) Agent (10) Original: A Sacramento resident purchased marijuana for the lumbar pain caused by liver cancer Sanitized: A ---------- resident purchased --------- for the ----------- caused by ------------ Generalized: A state capital resident purchased drug for the pain caused by carcinoma Malignant_tumor (7) Evidence (20) Capitol (32) Drug (6) Cancer (5) Symptom (10) State_capitol (4) Controlled_substance (2) Carcinoma (2) Pain (2) Denver, Indianapolis Phoenix, Sacramento Morphine Marijuana … Liver_cancer Lung_cancer … Lumbar_pain Migraine …

  47. t-Plausible Anonymization:t-PAT • Given a document d and an ontology o, anonymized document d’ is t-plausible if at least t base texts can be generalized to d’ • Let D(d’,d,o) give the number of possible base texts that can be generalized to d’ • t-PAT: Find the generalization d’ that ist-plausible and D(d’,d,o) is minimal

  48. Uniform t-Plausibility • t-PAT is a start; but too raw for being useful in protecting privacy • Consider our example: • Original textA Sacramento resident purchased marijuana for the lumbar pain caused by liver cancer • Sanitized text (t-PAT with t = 32)A capital resident purchased marijuana for the lumbar pain caused by liver cancer • Generalizing a single word may satisfy t-PAT

  49. Uniform t-PAT • Uniform t-PAT generalizes each word in an unbiased manner • We use entropy function H(w) to quantify generalization of each word • P(…) gives the probability of base term given a generalized word

  50. Cost function forUniform t-PAT • We define the following cost function C(d’,t) and attempt to minimize • α is the parameter to control global optimality and uniform generalization Uniform uncertainty introduced for each word Global generalization on d based on uncertainty t

More Related