you are what you say privacy risks of public mentions n.
Skip this Video
Loading SlideShow in 5 Seconds..
You Are What You Say: Privacy Risks of Public Mentions PowerPoint Presentation
Download Presentation
You Are What You Say: Privacy Risks of Public Mentions

Loading in 2 Seconds...

play fullscreen
1 / 48

You Are What You Say: Privacy Risks of Public Mentions - PowerPoint PPT Presentation

  • Uploaded on

You Are What You Say: Privacy Risks of Public Mentions. Written By: Dan Frankowski, Dan Cosley, Shilad Sen, Loren Terveen, John Riedl Presented by: David Keppel, Jorly Metzger. Table of Contents. Introduction Related Work Experiment Setup Evaluation Algorithms Altering The Dataset

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
Download Presentation

You Are What You Say: Privacy Risks of Public Mentions

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
you are what you say privacy risks of public mentions

You Are What You Say: Privacy Risks of Public Mentions

Written By:

Dan Frankowski, Dan Cosley, Shilad Sen,

Loren Terveen, John Riedl

Presented by:

David Keppel, Jorly Metzger

table of contents
Table of Contents
  • Introduction
  • Related Work
  • Experiment Setup
  • Evaluation
  • Algorithms
  • Altering The Dataset
  • Self Defense
  • Conclusion
did you ever use the term
Did you ever use the term…
  • “Did you ever use the termLong Dong Silver in conversation with Professor Hill?”
    • Clarence Thomas’ confirmation hearing for the U.S. Supreme Court
    • Video Rental History
    • Not permissible in court (1988 Video Privacy Protection Act)
i wish i could get
I wish I could get…
  • Tom Owad downloaded 260,000 Amazon wish lists
    • Flagged several “dangerous” books
    • Amazon wish list contained name, city, and state
    • Yahoo! PeopleSearch
    • Found the complete address of one of four wish list owners
some more examples
Some more examples
  • Blogs
  • Forums
  • Myspace
  • Netflix
  • AOL – oops
    • Release of search logs by users
    • Users could be identified using this data
you are what you say
You Are What You Say
  • Several Online identities
  • sparse relation space - the movies, journal articles, or authors you mention
  • Those properties allow re-identification.
  • It also may lead to other privacy violations,
    • name and address.
    • unforeseen consequences
a lot of information on yourself is out there
A lot of information on yourself is out there.
  • Many people reveal their preferences to organizations
  • Organizations keep people’s preference, purchase, or usage data.
  • Belief that this information should be private
  • These datasets usually are private to the organization that collects them.
why do i get all that spam
Why do I get all that Spam?
  • Why doesn’t that information stay private?
    • Research groups demand data (AOL search logs!)
    • Pool or trade data for mutual benefit
    • Government agencies forced to release data
    • Sell data to other businesses.
    • Bankrupt businesses may be forced to sell data
quasi identifier
  • Even if obvious identifiers have been removed, they might accidentally contain a uniquely identifying quasi-identifier
    • 87% of the 248 million people in the 1990 U.S. census are likely to be uniquely identified based only on their 5-digit ZIP, gender, and birth date
  • quasi-identifiers can be linked to other databases
    • medical records of a former governor of Massachusetts by linking public voter registration data to a database of supposedly anonymous medical records sold to industry

DataSet 1 – sensitive information

DataSet 2 – identifying information

DataSet 3 – privacy compromised



Voter Registration Data




Zip code

Zip code

Zip code







Medical Record

Medical Record

the research paper is proposing
The research paper is proposing
  • Re-identification of users from a public web movie forum in a private movie ratings dataset.
  • Three major results
    • They develop algorithms for re-identifying
    • They evaluate whether private dataset owners can protect user privacy by hiding data;
    • They evaluate two methods for users in a public forum to protect their own privacy
the usefulness of re identification
The usefulness of re-identification
  • The importance of re-identification
    • amount of data available electronically is increasing rapidly - IR techniques can be applied.
    • Serious privacy risks for users
  • Re-identification may prove valuable
    • Identifying shills
    • even fighting terrorism!
linking people in sparse relation spaces
Linking People in Sparse Relation Spaces
  • sparse relation spaces
    • Purchase data, online music player, Wikipedia
    • differ from traditional databases
  • Identified vs. Non-identified datasets
  • Accessible vs inaccessible datasets
    • Amazon might re-identify customers on competitors’ websites by comparing their purchase history against reviews written on those sites, and decide to market (or withhold) special offers from them.
burning questions
Burning Questions
  • RISKS OF DATASET RELEASE: What are the risks to user privacy when releasing a dataset?
  • ALTERING THE DATASET: How can dataset owners alter the dataset they release to preserve user privacy?
  • SELF DEFENSE: How can users protect their own privacy?
related work
Related Work
  • Studies ( [1][18])
    • Shows large majority of internet users are concerned about their privacy.
  • Opinion mining
    • Novak et al. [10], investigated re-identifying multiple aliases of a user in a forum based on general properties of their post text.
    • marrying our algorithms to opinion mining methods will improve their ability to re-identify people
related work cont
Related Work (cont.)
  • Identified a number of ways to modify data to preserve privacy
    • perturbing attribute values by adding random noise (Agrawal et al. [2])
  • Techniques for preserving k-anonymity: (Sweeney [17])
    • suppression (hiding data)
    • generalization (reducing the fidelity of attribute values).
k idenification
  • K-anonymity: "A [dataset] release provides k-anonymity protection if the information for each person contained in the release cannot be distinguished from at least k-1 individuals whose information also appears in the release."
  • K-identification: k-anonymity on 2 datasets
    • a measure of how well an algorithm can narrow each user in a dataset to one of k users in another dataset
    • If k is large, or if k is small and the k-identification rate is low, users can plausibly deny being identified
experiment setup
Experiment Setup
  • Offline experiments using two sparse relation spaces – both from a snapshot of MovieLens Database (Jan., 2006).
  • Prove that re-identification of users is possible, by using information from both databases.
  • Information available:
    • a set of movie ratings
    • a forum for referencing movies
experiment setup cont
Experiment Setup (cont.)
  • MovieLens movie recommender:
    • a set of movie ratings – assigned by a user
    • a set of movie mentions – derived from the forum
experiment dataset
Experiment Dataset
  • Drawn from posts in the MovieLens forums and from the MovieLens user’s data set of movie ratings.
  • Dataset includes:
    • 12,565,530 ratings
    • 140,132 users,
    • 8,957 items.
  • Users can make movie references while posting to the forum
    • Manual
    • Automated
the power law
The power law
  • Typical and important feature of real world sparse relation spaces
  • (Review) Sparse relationship
    • A) relates people to items;
    • B) is sparse, having relatively few relationships recorded per person;
    • C) large space of items.
  • data roughly follows a power law
    • rating dataset
    • distribution between users and ratings
    • Mentions dataset
binning strategy
Binning Strategy
  • The bins contain similar numbers of users and have intuitive meaning
  • Hypothesize that identifying a user depends on the number of mentions:
  • Users with more mentions disclose more information.
experiment overview
Experiment Overview
  • Objective
  • Evaluation Criteria
  • Re-Identification Algorithms
  • Altering the Dataset
  • Self-Defense

Take a user from the public dataset of mentions and attempt to re-identify them within the private dataset of rankings



  • member43
  • movie62
  • move12
  • member65
  • movie4
  • movie2
  • movie6
  • movie15


  • jmetzger
  • movie2
  • movie6
  • movie4



  • member21
  • movie4
  • move95
  • movie6
evaluation criteria overview
Evaluation CriteriaOverview
  • 133 users selected from the public mentions dataset to be target users. Each target user has at least one mention.
  • Users to be re-identified will reside in the private ratings dataset.
  • K-identification will be evaluated for k = 1, 5, 10, and 100.
evaluation criteria k identification
Evaluation CriteriaK-Identification
  • K-identification – measures how well an algorithm can narrow each user in a dataset to one of k users in another dataset
  • Let t = target user and j = rank rt. Then t is k-identified for k  j.

Note: for ties involving t, j is the highest rank among tied users

  • K-identification rate is the fraction of k-identified users

t = jmetzger

Mt = {movie4, movie10, movie12}

reIdentAlg(Mt) returns a likely list of ratings users in order of likelihood of being t

jmetzger is identified as Member45, then t is 4-identified and 5-identified

jmetzger is identified as Member9 and member94. Then t is 3-identified, 4-identified, and 5-identified

evaluation criteria k identification rate
Evaluation CriteriaK-Identification Rate

Evaluate for K = 1, 2, 4

4 target users selected from a public dataset

t1 = Member4

t2 = Member1

t3 = Member5

t4 = Member8

K = 1: K-Identification Rate = 2 / 4 = 50%

K = 2: K-identification Rate = 2 / 4 = 50%

K = 4: K-identification Rate = 3 / 4 = 75%

algorithms set intersection basic concept
Algorithms: Set IntersectionBasic Concept
  • Finds all users in the ratings dataset that rates every item mentioned by the target user t.
  • Each returned user will have the same likeness score.
  • Actual rating value given by user is ignored
algorithms set intersection evaluation

Behavior at 1-Identification

Algorithms: Set IntersectionEvaluation
  • Failure Scenarios
    • “Not Narrow” – more than k users matched
    • “No One Possible” – no users matched
    • “Misdirected” – users found, but none matched target user t
algorithms tf idf basic concept
Algorithms: TF-IDFBasic Concept
  • Finds users who have mentioned items, but have not rated those items
  • Desired Properties
    • Users who rate more mentions score higher
      • Concerned with Ratings Users that contain most mentions
    • Users who rate rare movies that are mentioned score higher than users who rate common mentioned movies
      • Concerned with the number of Ratings Users that have a particular mention
algorithms tf idf formula


Wum = tfumlog2

u U who rated m


sim(t,u) =

wt xwu

Algorithms: TF-IDFFormula
  • Term frequency tfum for mentions users
  • is 1 if user mentioned m
  • is 0 otherwise
  • Term frequency tfum for ratings users
  • is 1 if user rated m
  • is 0 otherwise

t = target user

u = user

m = movie

U = set of all users

algorithms tf idf evaluation
Algorithms: TF-IDFEvaluation
  • Better performance than Set Intersection
  • Over-weighted any mention for a ratings user who had rated few movies
algorithms scoring basic concept
Algorithms: ScoringBasic Concept
  • Emphasizes mentions of rarely-rated movies
  • De-emphasizes the number of ratings a user has
  • Assuming that scores are separable, sub-scores are calculated for each mention then multiplied to get an overall score
algorithms scoring formula

{u`  U who rated m} – 1

1 –

if u rated m

sub-score: ss(u,m) =





score s(u,t) =

mi T

Algorithms: ScoringFormula
  • Sub-score ss(u,m) gives more weight to rarely rated movies
  • Users who rated more than 1/3 of the movies were discarded (12 users total)
algorithms scoring evaluation
Algorithms: ScoringEvaluation
  • Outperformed TF-IDF
  • Including the heaviest-rating users reduced 1-identification performance
  • Using a flat sub-score of 1 for rated mentions reduced 1-identification performance
algorithms scoring with ratings

{u`  U who rated m} – 1

1 –

if u rated m and

r(u,m) – r(t,m)  

subscore: ss(u,m) =




Algorithms: ScoringWith Ratings

r(t,m) = rating given by user t for mention m

r(u,m) = rating given by user u for mention m

  • Mined rating value used to restrict the scoring algorithm
  • Exact Rating,  = 0
  • Fuzzy Rating,  = 1
altering the dataset overview
Altering the DatasetOverview
  • Question: How can dataset owners alter the dataset they release to preserve user privacy?
  • Suggestions
    • Perturbation
    • Generalization
    • Suppression
altering the dataset suppression
Altering the DatasetSuppression
  • Drop movies that are rarely-rated
  • Drop ratings of items that have ratings below a specified threshold
self defense overview
Self DefenseOverview
  • Question: How can users protect their own privacy?
  • Suggestions
    • Suppression
    • Misdirection
self defense suppression
Self DefenseSuppression
  • Same behavior from Altering the Dataset holds true here
  • Workaround:
    • Line-up items that have been mentioned and rated
    • Order items by how many times the item has been rated (rarest first)
    • Suppress mentions of only the top portion of list
self defense misdirection
Self DefenseMisdirection
  • User intentionally mentions items they have not rated
  • Procedure
    • Misdirection Item list is created
      • Choose items rated above threshold and order in increasing popularity
      • Choose items rated above threshold and order in decreasing popularity
      • Thresholds vary from 1 to 8192, in powers of 2
    • Each user takes the first item from list that they have not rated and mentions it
    • K-identification is re-computed
    • This is repeated for each K-identification level
  • Re-identification in a sparse relation space can violate privacy
  • Relationships to items in a sparse relation space can be a quasi-identifier
  • As prevention, suppression of datasets is impractical
  • User-level misdirection does provide some anonymity at a fairly low cost
future work
Future Work
  • How to determine that a user in one sparse dataset exists in another sparse dataset
  • Design a re-identifying algorithm that ignores the most popular mentions entirely
  • Construct an Intelligent Interface that helps people manage their privacy
  • If people were convinced to intentionally misdirect data, how would this change the nature of public discourse in sparse relation spaces?
  • Explain how user matches were determined
  • In TF-IDF Algorithm, notation was not clear in differentiating users for weights
  • Graphs were very useful in understanding behavior of algorithms
  • [1] Ackerman, M. S., Cranor, L. F., and Reagle, J. 1999. Privacy in e-commerce: examining user scenarios and privacy preferences. In Proc. EC99, pp. 1-8.
  • [2] Agrawal, R. and Srikant, R. 2000. Privacy-preserving data mining. In Proc. SIGMOD00, pp. 439-450.
  • [3] Berkovsky, S., Eytani, Y., Kuflik, T., and Ricci, R. 2005. Privacy-Enhanced Collaborative Filtering. In Proc. User Modeling Workshop on Privacy-Enhanced Personalization.
  • [4] Canny, J. 2002. Collaborative filtering with privacy via factor analysis. In Proc. SIGIR02, pp. 238-245.
  • [5] Dave, K., Lawrence, S., and Pennock, D. M. 2003. Mining the peanut gallery: opinion extraction and semantic classi-fication of product reviews. In Proc. WWW03, pp. 519-528.
  • [6] Drenner, S., Harper, M., Frankowski, D., Terveen, L., and Riedl, J. 2006. Insert Movie Reference Here: A System to Bridge Conversation and Item-Oriented Web Sites. Accepted for Proc. CHI06.
  • [7] Evfimievski, A., Srikant, R., Agrawal, R., and Gehrke, J. 2002. Privacy preserving mining of association rules. In Proc. KDD02, pp. 217-228.
  • [8] Hong, J.I. and J.A. Landay. An Architecture for Privacy- Sensitive Ubiquitous Computing. In Mobisys04 pp. 177-
  • [9] Lam, S.K. and Riedl, J. 2004. Shilling recommender systems for fun and profit. In Proc. WWW04, pp. 393-402.
references cont
References (cont.)
  • [10] Novak, J., Raghavan, P., and Tomkins, A. 2004. Anti- on the Web. In Proc. WWW04, pp. 30-39.
  • [11] Pang, B., Lee, L., and Vaithyanathan, S. 2002. Thumbs Sentiment classification using machine learning techniques. Proc. Empirical Methods in NLP, pp. 79-86.
  • [12] Polat, H., Du, W. 2003. Privacy-Preserving Collaborative Filtering Using Randomized Perturbation Techniques. ICDM03, p. 625.
  • [13] Ramakrishnan, N., Keller, B. J., Mirza, B. J., Grama, A. and Karypis, G. 2001. Privacy Risks in Recommender Systems. IEEE Internet Computing 5(6):54-62.
  • [14] Rizvi, S., and Haritsa, J. 2002. Maintaining Privacy in Association Rule Mining. In Proc. VLDB02, pp. 682-
  • [15] Sarwar, B. M., Karypis, G., Konstan, J.A., and Riedl, Item-based collaborative filtering recommendation algorithms. In Proc. WWW01.
  • [16] Sweeney, L. 2002. Achieving k-anonymity privacy protection using generalization and suppression. Int. J. Uncertain. Fuzziness Knowl.-Based Syst. 10(5):571-588.
  • [17] Sweeney, L. 2002. k-anonymity: a model for protecting privacy. Int. J. Uncertain. Fuzziness Knowl.-Based Syst. 557-570.
  • [18] Taylor, H. 2003. Most People Are “Privacy Pragmatists.” Harris Poll #17. Harris Interactive (March 19, 2003).
  • [19] Terveen, L., et al. 1997. PHOAKS: a system for sharing recommendations. CACM 40(3):59-62.
  • [20] Verykios, V. S., et al. 2004. State-of-the-art in privacy preserving data mining. SIGMOD Rec. 33(1):50-57.