you are what you say privacy risks of public mentions n.
Download
Skip this Video
Loading SlideShow in 5 Seconds..
You Are What You Say: Privacy Risks of Public Mentions PowerPoint Presentation
Download Presentation
You Are What You Say: Privacy Risks of Public Mentions

Loading in 2 Seconds...

play fullscreen
1 / 48

You Are What You Say: Privacy Risks of Public Mentions - PowerPoint PPT Presentation


  • 99 Views
  • Uploaded on

You Are What You Say: Privacy Risks of Public Mentions. Written By: Dan Frankowski, Dan Cosley, Shilad Sen, Loren Terveen, John Riedl Presented by: David Keppel, Jorly Metzger. Table of Contents. Introduction Related Work Experiment Setup Evaluation Algorithms Altering The Dataset

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'You Are What You Say: Privacy Risks of Public Mentions' - haines


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
you are what you say privacy risks of public mentions

You Are What You Say: Privacy Risks of Public Mentions

Written By:

Dan Frankowski, Dan Cosley, Shilad Sen,

Loren Terveen, John Riedl

Presented by:

David Keppel, Jorly Metzger

table of contents
Table of Contents
  • Introduction
  • Related Work
  • Experiment Setup
  • Evaluation
  • Algorithms
  • Altering The Dataset
  • Self Defense
  • Conclusion
did you ever use the term
Did you ever use the term…
  • “Did you ever use the termLong Dong Silver in conversation with Professor Hill?”
    • Clarence Thomas’ confirmation hearing for the U.S. Supreme Court
    • Video Rental History
    • Not permissible in court (1988 Video Privacy Protection Act)
i wish i could get
I wish I could get…
  • Tom Owad downloaded 260,000 Amazon wish lists
    • Flagged several “dangerous” books
    • Amazon wish list contained name, city, and state
    • Yahoo! PeopleSearch
    • Found the complete address of one of four wish list owners
some more examples
Some more examples
  • Blogs
  • Forums
  • Myspace
  • Netflix
  • AOL – oops
    • Release of search logs by users
    • Users could be identified using this data
you are what you say
You Are What You Say
  • Several Online identities
  • sparse relation space - the movies, journal articles, or authors you mention
  • Those properties allow re-identification.
  • It also may lead to other privacy violations,
    • name and address.
    • unforeseen consequences
a lot of information on yourself is out there
A lot of information on yourself is out there.
  • Many people reveal their preferences to organizations
  • Organizations keep people’s preference, purchase, or usage data.
  • Belief that this information should be private
  • These datasets usually are private to the organization that collects them.
why do i get all that spam
Why do I get all that Spam?
  • Why doesn’t that information stay private?
    • Research groups demand data (AOL search logs!)
    • Pool or trade data for mutual benefit
    • Government agencies forced to release data
    • Sell data to other businesses.
    • Bankrupt businesses may be forced to sell data
quasi identifier
Quasi-identifier
  • Even if obvious identifiers have been removed, they might accidentally contain a uniquely identifying quasi-identifier
    • 87% of the 248 million people in the 1990 U.S. census are likely to be uniquely identified based only on their 5-digit ZIP, gender, and birth date
  • quasi-identifiers can be linked to other databases
    • medical records of a former governor of Massachusetts by linking public voter registration data to a database of supposedly anonymous medical records sold to industry
basically
Basically….

DataSet 1 – sensitive information

DataSet 2 – identifying information

DataSet 3 – privacy compromised

Medical

history

Voter Registration Data

Name

Name

Name

Zip code

Zip code

Zip code

Gender

Gender

Gender

Birthday

Birthday

Birthday

Medical Record

Medical Record

the research paper is proposing
The research paper is proposing
  • Re-identification of users from a public web movie forum in a private movie ratings dataset.
  • Three major results
    • They develop algorithms for re-identifying
    • They evaluate whether private dataset owners can protect user privacy by hiding data;
    • They evaluate two methods for users in a public forum to protect their own privacy
the usefulness of re identification
The usefulness of re-identification
  • The importance of re-identification
    • amount of data available electronically is increasing rapidly - IR techniques can be applied.
    • Serious privacy risks for users
  • Re-identification may prove valuable
    • Identifying shills
    • even fighting terrorism!
linking people in sparse relation spaces
Linking People in Sparse Relation Spaces
  • sparse relation spaces
    • Purchase data, online music player, Wikipedia
    • differ from traditional databases
  • Identified vs. Non-identified datasets
  • Accessible vs inaccessible datasets
    • Amazon might re-identify customers on competitors’ websites by comparing their purchase history against reviews written on those sites, and decide to market (or withhold) special offers from them.
burning questions
Burning Questions
  • RISKS OF DATASET RELEASE: What are the risks to user privacy when releasing a dataset?
  • ALTERING THE DATASET: How can dataset owners alter the dataset they release to preserve user privacy?
  • SELF DEFENSE: How can users protect their own privacy?
related work
Related Work
  • Studies ( [1][18])
    • Shows large majority of internet users are concerned about their privacy.
  • Opinion mining
    • Novak et al. [10], investigated re-identifying multiple aliases of a user in a forum based on general properties of their post text.
    • marrying our algorithms to opinion mining methods will improve their ability to re-identify people
related work cont
Related Work (cont.)
  • Identified a number of ways to modify data to preserve privacy
    • perturbing attribute values by adding random noise (Agrawal et al. [2])
  • Techniques for preserving k-anonymity: (Sweeney [17])
    • suppression (hiding data)
    • generalization (reducing the fidelity of attribute values).
k idenification
K-idenification
  • K-anonymity: "A [dataset] release provides k-anonymity protection if the information for each person contained in the release cannot be distinguished from at least k-1 individuals whose information also appears in the release."
  • K-identification: k-anonymity on 2 datasets
    • a measure of how well an algorithm can narrow each user in a dataset to one of k users in another dataset
    • If k is large, or if k is small and the k-identification rate is low, users can plausibly deny being identified
experiment setup
Experiment Setup
  • Offline experiments using two sparse relation spaces – both from a snapshot of MovieLens Database (Jan., 2006).
  • Prove that re-identification of users is possible, by using information from both databases.
  • Information available:
    • a set of movie ratings
    • a forum for referencing movies
experiment setup cont
Experiment Setup (cont.)
  • MovieLens movie recommender:
    • a set of movie ratings – assigned by a user
    • a set of movie mentions – derived from the forum
experiment dataset
Experiment Dataset
  • Drawn from posts in the MovieLens forums and from the MovieLens user’s data set of movie ratings.
  • Dataset includes:
    • 12,565,530 ratings
    • 140,132 users,
    • 8,957 items.
  • Users can make movie references while posting to the forum
    • Manual
    • Automated
the power law
The power law
  • Typical and important feature of real world sparse relation spaces
  • (Review) Sparse relationship
    • A) relates people to items;
    • B) is sparse, having relatively few relationships recorded per person;
    • C) large space of items.
  • data roughly follows a power law
    • rating dataset
    • distribution between users and ratings
    • Mentions dataset
binning strategy
Binning Strategy
  • The bins contain similar numbers of users and have intuitive meaning
  • Hypothesize that identifying a user depends on the number of mentions:
  • Users with more mentions disclose more information.
experiment overview
Experiment Overview
  • Objective
  • Evaluation Criteria
  • Re-Identification Algorithms
  • Altering the Dataset
  • Self-Defense
objective
Objective

Take a user from the public dataset of mentions and attempt to re-identify them within the private dataset of rankings

PUBLIC

PRIVATE

  • member43
  • movie62
  • move12
  • member65
  • movie4
  • movie2
  • movie6
  • movie15

?

  • jmetzger
  • movie2
  • movie6
  • movie4

?

?

  • member21
  • movie4
  • move95
  • movie6
evaluation criteria overview
Evaluation CriteriaOverview
  • 133 users selected from the public mentions dataset to be target users. Each target user has at least one mention.
  • Users to be re-identified will reside in the private ratings dataset.
  • K-identification will be evaluated for k = 1, 5, 10, and 100.
evaluation criteria k identification
Evaluation CriteriaK-Identification
  • K-identification – measures how well an algorithm can narrow each user in a dataset to one of k users in another dataset
  • Let t = target user and j = rank rt. Then t is k-identified for k  j.

Note: for ties involving t, j is the highest rank among tied users

  • K-identification rate is the fraction of k-identified users

t = jmetzger

Mt = {movie4, movie10, movie12}

reIdentAlg(Mt) returns a likely list of ratings users in order of likelihood of being t

jmetzger is identified as Member45, then t is 4-identified and 5-identified

jmetzger is identified as Member9 and member94. Then t is 3-identified, 4-identified, and 5-identified

evaluation criteria k identification rate
Evaluation CriteriaK-Identification Rate

Evaluate for K = 1, 2, 4

4 target users selected from a public dataset

t1 = Member4

t2 = Member1

t3 = Member5

t4 = Member8

K = 1: K-Identification Rate = 2 / 4 = 50%

K = 2: K-identification Rate = 2 / 4 = 50%

K = 4: K-identification Rate = 3 / 4 = 75%

algorithms set intersection basic concept
Algorithms: Set IntersectionBasic Concept
  • Finds all users in the ratings dataset that rates every item mentioned by the target user t.
  • Each returned user will have the same likeness score.
  • Actual rating value given by user is ignored
algorithms set intersection evaluation

Behavior at 1-Identification

Algorithms: Set IntersectionEvaluation
  • Failure Scenarios
    • “Not Narrow” – more than k users matched
    • “No One Possible” – no users matched
    • “Misdirected” – users found, but none matched target user t
algorithms tf idf basic concept
Algorithms: TF-IDFBasic Concept
  • Finds users who have mentioned items, but have not rated those items
  • Desired Properties
    • Users who rate more mentions score higher
      • Concerned with Ratings Users that contain most mentions
    • Users who rate rare movies that are mentioned score higher than users who rate common mentioned movies
      • Concerned with the number of Ratings Users that have a particular mention
algorithms tf idf formula

U

Wum = tfumlog2

u U who rated m

wtwu

sim(t,u) =

wt xwu

Algorithms: TF-IDFFormula
  • Term frequency tfum for mentions users
  • is 1 if user mentioned m
  • is 0 otherwise
  • Term frequency tfum for ratings users
  • is 1 if user rated m
  • is 0 otherwise

t = target user

u = user

m = movie

U = set of all users

algorithms tf idf evaluation
Algorithms: TF-IDFEvaluation
  • Better performance than Set Intersection
  • Over-weighted any mention for a ratings user who had rated few movies
algorithms scoring basic concept
Algorithms: ScoringBasic Concept
  • Emphasizes mentions of rarely-rated movies
  • De-emphasizes the number of ratings a user has
  • Assuming that scores are separable, sub-scores are calculated for each mention then multiplied to get an overall score
algorithms scoring formula

{u`  U who rated m} – 1

1 –

if u rated m

sub-score: ss(u,m) =

U

otherwise

ss(u,mi)

0.05

score s(u,t) =

mi T

Algorithms: ScoringFormula
  • Sub-score ss(u,m) gives more weight to rarely rated movies
  • Users who rated more than 1/3 of the movies were discarded (12 users total)
algorithms scoring evaluation
Algorithms: ScoringEvaluation
  • Outperformed TF-IDF
  • Including the heaviest-rating users reduced 1-identification performance
  • Using a flat sub-score of 1 for rated mentions reduced 1-identification performance
algorithms scoring with ratings

{u`  U who rated m} – 1

1 –

if u rated m and

r(u,m) – r(t,m)  

subscore: ss(u,m) =

U

otherwise

0.05

Algorithms: ScoringWith Ratings

r(t,m) = rating given by user t for mention m

r(u,m) = rating given by user u for mention m

  • Mined rating value used to restrict the scoring algorithm
  • Exact Rating,  = 0
  • Fuzzy Rating,  = 1
altering the dataset overview
Altering the DatasetOverview
  • Question: How can dataset owners alter the dataset they release to preserve user privacy?
  • Suggestions
    • Perturbation
    • Generalization
    • Suppression
altering the dataset suppression
Altering the DatasetSuppression
  • Drop movies that are rarely-rated
  • Drop ratings of items that have ratings below a specified threshold
self defense overview
Self DefenseOverview
  • Question: How can users protect their own privacy?
  • Suggestions
    • Suppression
    • Misdirection
self defense suppression
Self DefenseSuppression
  • Same behavior from Altering the Dataset holds true here
  • Workaround:
    • Line-up items that have been mentioned and rated
    • Order items by how many times the item has been rated (rarest first)
    • Suppress mentions of only the top portion of list
self defense misdirection
Self DefenseMisdirection
  • User intentionally mentions items they have not rated
  • Procedure
    • Misdirection Item list is created
      • Choose items rated above threshold and order in increasing popularity
      • Choose items rated above threshold and order in decreasing popularity
      • Thresholds vary from 1 to 8192, in powers of 2
    • Each user takes the first item from list that they have not rated and mentions it
    • K-identification is re-computed
    • This is repeated for each K-identification level
conclusions
Conclusions
  • Re-identification in a sparse relation space can violate privacy
  • Relationships to items in a sparse relation space can be a quasi-identifier
  • As prevention, suppression of datasets is impractical
  • User-level misdirection does provide some anonymity at a fairly low cost
future work
Future Work
  • How to determine that a user in one sparse dataset exists in another sparse dataset
  • Design a re-identifying algorithm that ignores the most popular mentions entirely
  • Construct an Intelligent Interface that helps people manage their privacy
  • If people were convinced to intentionally misdirect data, how would this change the nature of public discourse in sparse relation spaces?
critiques
Critiques
  • Explain how user matches were determined
  • In TF-IDF Algorithm, notation was not clear in differentiating users for weights
  • Graphs were very useful in understanding behavior of algorithms
references
References
  • [1] Ackerman, M. S., Cranor, L. F., and Reagle, J. 1999. Privacy in e-commerce: examining user scenarios and privacy preferences. In Proc. EC99, pp. 1-8.
  • [2] Agrawal, R. and Srikant, R. 2000. Privacy-preserving data mining. In Proc. SIGMOD00, pp. 439-450.
  • [3] Berkovsky, S., Eytani, Y., Kuflik, T., and Ricci, R. 2005. Privacy-Enhanced Collaborative Filtering. In Proc. User Modeling Workshop on Privacy-Enhanced Personalization.
  • [4] Canny, J. 2002. Collaborative filtering with privacy via factor analysis. In Proc. SIGIR02, pp. 238-245.
  • [5] Dave, K., Lawrence, S., and Pennock, D. M. 2003. Mining the peanut gallery: opinion extraction and semantic classi-fication of product reviews. In Proc. WWW03, pp. 519-528.
  • [6] Drenner, S., Harper, M., Frankowski, D., Terveen, L., and Riedl, J. 2006. Insert Movie Reference Here: A System to Bridge Conversation and Item-Oriented Web Sites. Accepted for Proc. CHI06.
  • [7] Evfimievski, A., Srikant, R., Agrawal, R., and Gehrke, J. 2002. Privacy preserving mining of association rules. In Proc. KDD02, pp. 217-228.
  • [8] Hong, J.I. and J.A. Landay. An Architecture for Privacy- Sensitive Ubiquitous Computing. In Mobisys04 pp. 177-
  • [9] Lam, S.K. and Riedl, J. 2004. Shilling recommender systems for fun and profit. In Proc. WWW04, pp. 393-402.
references cont
References (cont.)
  • [10] Novak, J., Raghavan, P., and Tomkins, A. 2004. Anti- on the Web. In Proc. WWW04, pp. 30-39.
  • [11] Pang, B., Lee, L., and Vaithyanathan, S. 2002. Thumbs Sentiment classification using machine learning techniques. Proc. Empirical Methods in NLP, pp. 79-86.
  • [12] Polat, H., Du, W. 2003. Privacy-Preserving Collaborative Filtering Using Randomized Perturbation Techniques. ICDM03, p. 625.
  • [13] Ramakrishnan, N., Keller, B. J., Mirza, B. J., Grama, A. and Karypis, G. 2001. Privacy Risks in Recommender Systems. IEEE Internet Computing 5(6):54-62.
  • [14] Rizvi, S., and Haritsa, J. 2002. Maintaining Privacy in Association Rule Mining. In Proc. VLDB02, pp. 682-
  • [15] Sarwar, B. M., Karypis, G., Konstan, J.A., and Riedl, Item-based collaborative filtering recommendation algorithms. In Proc. WWW01.
  • [16] Sweeney, L. 2002. Achieving k-anonymity privacy protection using generalization and suppression. Int. J. Uncertain. Fuzziness Knowl.-Based Syst. 10(5):571-588.
  • [17] Sweeney, L. 2002. k-anonymity: a model for protecting privacy. Int. J. Uncertain. Fuzziness Knowl.-Based Syst. 557-570.
  • [18] Taylor, H. 2003. Most People Are “Privacy Pragmatists.” Harris Poll #17. Harris Interactive (March 19, 2003).
  • [19] Terveen, L., et al. 1997. PHOAKS: a system for sharing recommendations. CACM 40(3):59-62.
  • [20] Verykios, V. S., et al. 2004. State-of-the-art in privacy preserving data mining. SIGMOD Rec. 33(1):50-57.