Privacy and Anonymity in Text

Privacy and Anonymity in Text Chris Clifton 12 November, 2009

Plausibly Deniable Search This is joint work with MummoorthyMurugesan 2009 SIAM International Conference on Data Mining (SDM09), Sparks, Nevada, April 30-May 2, 2009

The AOL Awakening • In Aug 2006, AOL released its customers web searches for research studies • 20 Million unique queries of 650K unique users • <user-id> was replaced with a <random-number> • NY Times reporter successfully found the identity of an individual from the queries • Queries included “60 single men” “landscapers in Lilburn, Ga” • Many more queries contained enough information to uniquely identify the person AOL fired its CTO over this issue; Two researchers were forced out

Privacy in Web Search • Server-Controlled Privacy • Deletion of queries after a few months • Anonymization of querylogs before backup • Some of these methods have been shown to be inadequate • Private Information Retrieval • affects the advertising business model • not practical with the current solutions

Lessons Learned • Content of user queries reveals a lot • Ego surfing: searching for own name, ssn, credit card • Identifiable • Location, type of work, age, medical condition • Sensitive • Car they own, restaurants in a zip code • Query transformation alone is not enough • Submitting Q’ instead of Q to retrieve the same set of documents • User intent still revealed

User-Controlled Privacy • Hide identifying metadata • Private Web Search (PWS) – Firefox plugin (Yale Univ.) • Removes metadata • Hides user IP Address (via TOR)

Private Web SearchFelipe Saint-Jean,Johnson, Boneh, Feigenbaum • Tor: Hides IP addresses • Routes request, response through multiple servers • Each knows only preceding server • HTTP filter normalizes search queries • Browser, OS, etc. • HTML filter removes active components

User-Controlled Privacy • Hide identifying metadata • Private Web Search (PWS) – Firefox plugin (Yale Univ.) • Removes metadata • Hides user IP Address (via TOR) • Protect against disclosure through query terms • TrackMeNot – Firefox plugin (NYU) • Periodically issues randomized queries from a list of “seeds” • Uses search results for 'logical' future query terms Actual User Query (user intent) is revealed Timing attacks, load on server Query semantics attacks – `logical’ generated terms

Plausibly Deniable Search PDS 1 2 Search Engine {q1,...,qk} {q1,...,qk} q {R(q1),...,R(qk)} 4 3 Filter R(qi) using the original q

Plausibly Deniable Search:Key Concepts • Browser submits more than one query {q1,…,qk} • Deniability • Reversible: any of the k queries would have produced the same set • The additional “cover queries” are of diverse topics • Plausibility • All queries are equally plausible • Implausible queries would weaken the deniability argument {“java compiler” , “newton apple”} Vs {“java compiler” , “motorola table”}

Plausibly Deniable Search: Theory • Assume the following: • User Queries follow a distribution Pu • Cover queries are generated through a distribution Pc • Given a set of two queries S={q1,q2}, there are two possible events • E1: q1 is user query & q2 is cover query • E2: q2 is cover query & q1 is user query

Plausibly Deniable Search: Theory • To achieve deniability for either of these queries, we require the following condition: • Two of many possible solutions • queries have equal probability of being user queries, and equal probability of being cover queries • queries have the same probability of being user query or cover query

Creating Plausibly Deniable Cover Queries • Create Canonical Queries • Standard queries • Creating PD-Querysets • Plausibly deniable querysetswith k queries • Issuing query • Find and issue the PD-Queryset for the given user query Done in advance (Server / Third Party)

Step One:Creating Canonical Queries Use LSI to combine Semantically Similar Seed Queries Canonical Queries Seed Queries FP Mining • Semantically similar surrogate queries for user queries • Supports the “deniability” argument since all queries could be generated by the system. Seed Documents

Step Two:Creating PD-Querysets • Dissimilarity between two queries is based on 3 measures: • Euclidean distance: Semantically similar queries are closer in the semantic space • Magnitude: queries that are equally stronger in their respective topics have similar magnitude • Neighborhood count: equally plausible queries have similar number of log (already issued) queries in their neighborhood Agglomerative Clustering Canonical Queries PD-Querysets

Step 3: Issuing Query • User query is mapped to semantic space • Vec(q)=qTU’S’-1 • Find canonical queries that have the maximum cosine similarity with q in the semantic space • The PD-Queryset of the selected canonical query is issued

How Good is PDS? • Deniability: • Canonical query provides one level of anonymity • There exist many seed queries that map to a single canonical query • The reversible property provides deniability • Plausibility: • Base on the number of similar topics queries issued by users • Measure as perception of human subjects; difficult to quantify How good are the canonical queries? Do they fetch what the users want?

Results from Experiments • Document Collection • DMOZ categorized web documents • 314K documents and 1.28M unique terms • Three topics: Computers, Science, Sports • Number of Documents in Each Category • Computers 115k • Science 100k • Sports 99k • After performing SVD on the term-doc matrix, only 30 columns are kept in U

Canonical Queries • 2.6 Million seed queries generated with ∆=500 • Produces 932K canonical queries • Average canonical query length 3.7

Retrieval Performance • 5k queries from the allthweb.com searches • 3.4k unique queries containing at least 75% terms from our collection • Six of top 20 in 69% of queries (500)

Topic Diversity • DMOZ categories are used in comparing the topics of queries • 85% of PD-Querysets have queries with >50% topic diversity

What is Next? • PDS can be used along with other approaches such as PWS, TOR, etc. • Canonical Queries • Efficient ways of creating canonical queries • Improving retrieval performance • Sequential Queries • How to handle the sequentially edited queries by an user on the same topic? • Can an attacker figure out the user queries over period of time?

Query Sequences • Users issue a sequence of queries on a topic • Cover queries should be plausibly deniable sequences • Consider two sequences, S1={a1,b1} S2={a2,b2}, where <a1,a2> are issued together (first), <b1,b2> are issued second • There are two possible events: • E1: S1 is user sequence, S2 is cover sequence • E2: S1 is cover sequence, S2 is user sequence

Query Sequences • To deniability is achieved when we satisfy the following constraint: • Given deniability for the first queries a1,b1, we get:

Two (of many) Possible Solutions • b1 and b2 have same conditional probability of being user-generated • Also same conditional probability of being method-generated • a2 has equal conditional probability of being user generated or method generated; b2 has the same property. This is applicable to the m+1th query given a sequence of m queries

Generating “user-like” Sequences • Idea: Inter-query time determines difference between queries • Learn distribution of changes to queries at time • Given time, generate query from previous cover query and appropriate distribution • P(qk | qk-1) same as a real user!

Distribution of what changes? • Features “defining” query are those useful in linking queries in sequence • If sequence can be discovered, must be simulated • Features from I know what you did last summer (Jones et. al) • Term re-use, topic similarity used to link queries in a sequence • Learned distribution from large query log for ranges of inter-query times • Topic relation • Topic repetition • Number of term changes

Feature Distributions with respect to Inter-Query Time Term Changes Topic Changes “Bin number” is exponential grouping on time

Effectiveness: Topic Change Distribution on DMOZ

How well does it really work?

Try again…

Figure it out yet?

Disclosure-free Discovery of Related Documents Chris Clifton MummoorthyMurugesan Wei Jiang Luo Si JaideepVaidya 18 September, 2009 Proceedings of the 24th International Conference on Data Engineering (ICDE 2008), Cancun, Mexico, April 7-12, 2008

Problem:Identifying Common Interests We have evidence that Osama Bin Laden has financed the purchase of Stinger Missiles in Afghanistan. … There have been reports from Kabul of financial transfers from Bin Laden, purportedly or the purchase of Stinger missiles … We have evidence that Osama Bin Laden has financed the purchase of Stinger Missiles in Afghanistan. … There have been reports from Kabul of financial transfers from Bin Laden, purportedly or the purchase of Stinger missiles … We have evidence that Osama Bin Laden has financed the purchase of Stinger Missiles in Afghanistan. … There have been reports from Kabul of financial transfers from Bin Laden, purportedly or the purchase of Stinger missiles … There have been reports from Kabul of financial transfers from Bin Laden, purportedly or the purchase of Stinger missiles …

Solution Overview Alice Bob We have evidence that Osama Bin Laden has financed the purchase of Stinger Missiles in Afghanistan. … There have been reports from Kabul of financial transfers from Bin Laden, purportedly or the purchase of Stinger missiles …

Secure Product:Random Matrix • Vaidya and Clifton, Privacy Preserving Association Rule Mining in Vertically Partitioned Data, KDD02

Secure Product:Homomorphic Encryption Goethals, Laur, Lipmaa, and Mielikainen, On secure scalar product computation for privacy-preserving data mining,ICISC 2004

Is Performance an Issue? Alice Bob There have been reports from Kabul of financial transfers from Bin Laden, purportedly or the purchase of Stinger missiles … We have evidence that Osama Bin Laden has financed the purchase of Stinger Missiles in Afghanistan. … We have evidence that Osama Bin Laden has financed the purchase of Stinger Missiles in Afghanistan. … There have been reports from Kabul of financial transfers from Bin Laden, purportedly or the purchase of Stinger missiles … We have evidence that Osama Bin Laden has financed the purchase of Stinger Missiles in Afghanistan. … There have been reports from Kabul of financial transfers from Bin Laden, purportedly or the purchase of Stinger missiles … We have evidence that Osama Bin Laden has financed the purchase of Stinger Missiles in Afghanistan. … There have been reports from Kabul of financial transfers from Bin Laden, purportedly or the purchase of Stinger missiles … There have been reports from Kabul of financial transfers from Bin Laden, purportedly or the purchase of Stinger missiles … We have evidence that Osama Bin Laden has financed the purchase of Stinger Missiles in Afghanistan. … There have been reports from Kabul of financial transfers from Bin Laden, purportedly or the purchase of Stinger missiles … We have evidence that Osama Bin Laden has financed the purchase of Stinger Missiles in Afghanistan. … We have evidence that Osama Bin Laden has financed the purchase of Stinger Missiles in Afghanistan. … We have evidence that Osama Bin Laden has financed the purchase of Stinger Missiles in Afghanistan. …

Running Time(journal articles)

Faster: Local Clustering • Locally cluster similar documents • Secure protocol identifies similar clusters • Document comparison only within identified clusters There have been reports from Kabul of financial transfers from Bin Laden, purportedly or the purchase of Stinger missiles … We have evidence that Osama Bin Laden has financed the purchase of Stinger Missiles in Afghanistan. … We have evidence that Osama Bin Laden has financed the purchase of Stinger Missiles in Afghanistan. … There have been reports from Kabul of financial transfers from Bin Laden, purportedly or the purchase of Stinger missiles … We have evidence that Osama Bin Laden has financed the purchase of Stinger Missiles in Afghanistan. … We have evidence that Osama Bin Laden has financed the purchase of Stinger Missiles in Afghanistan. … There have been reports from Kabul of financial transfers from Bin Laden, purportedly or the purchase of Stinger missiles … We have evidence that Osama Bin Laden has financed the purchase of Stinger Missiles in Afghanistan. … There have been reports from Kabul of financial transfers from Bin Laden, purportedly or the purchase of Stinger missiles … We have evidence that Osama Bin Laden has financed the purchase of Stinger Missiles in Afghanistan. …

Savings / Loss from Clustering

Effectiveness:40% Document Overlap

t-Plausibility: Semantic Preserving Text Sanitization Wei Jiang MummoorthyMurugesan Chris Clifton, and Luo Si 2009 IEEE International Conference on Privacy, Security, Risk and Trust (PASSAT-09), Vancouver, Canada, August 29-31, 2009

Motivations • De-identification plays an important role in privacy (legislation) • Documents that do not contain personally identifiable information can be shared, e.g., pathology reports • De-identification tools remove “obvious” identifying information • Name, address, dates, … • Unfortunately, non-obvious information can identify • Pain vs. phantom pain • Alternative: suppress sensitive information • Uses marijuana for pain  Uses --- for --- • Our approach: information generalization • phantom pain  pain • tuberculosis  infectious disease

Related Work • Data anonymization • k-Anonymity: sanitizing structured info, e.g., datasets with at least k records in relational format • Transforming a text into a dataset of k records is not well studied • Text sanitization • Most work focuses on identifying sensitive attributes • Then removing identified sensitive information

Basic Idea: Generalization Seat (50) Agent (10) Original: A Sacramento resident purchased marijuana for the lumbar pain caused by liver cancer Sanitized: A ---------- resident purchased --------- for the ----------- caused by ------------ Generalized: A state capital resident purchased drug for the pain caused by carcinoma Malignant_tumor (7) Evidence (20) Capitol (32) Drug (6) Cancer (5) Symptom (10) State_capitol (4) Controlled_substance (2) Carcinoma (2) Pain (2) Denver, Indianapolis Phoenix, Sacramento Morphine Marijuana … Liver_cancer Lung_cancer … Lumbar_pain Migraine …

t-Plausible Anonymization:t-PAT • Given a document d and an ontology o, anonymized document d’ is t-plausible if at least t base texts can be generalized to d’ • Let D(d’,d,o) give the number of possible base texts that can be generalized to d’ • t-PAT: Find the generalization d’ that ist-plausible and D(d’,d,o) is minimal

Uniform t-Plausibility • t-PAT is a start; but too raw for being useful in protecting privacy • Consider our example: • Original textA Sacramento resident purchased marijuana for the lumbar pain caused by liver cancer • Sanitized text (t-PAT with t = 32)A capital resident purchased marijuana for the lumbar pain caused by liver cancer • Generalizing a single word may satisfy t-PAT

Uniform t-PAT • Uniform t-PAT generalizes each word in an unbiased manner • We use entropy function H(w) to quantify generalization of each word • P(…) gives the probability of base term given a generalized word

Cost function forUniform t-PAT • We define the following cost function C(d’,t) and attempt to minimize • α is the parameter to control global optimality and uniform generalization Uniform uncertainty introduced for each word Global generalization on d based on uncertainty t

Privacy and Anonymity in Text