privacy preserving market basket data analysis l.
Skip this Video
Download Presentation
Privacy Preserving Market Basket Data Analysis

Loading in 2 Seconds...

play fullscreen
1 / 22

Privacy Preserving Market Basket Data Analysis - PowerPoint PPT Presentation

  • Uploaded on

Privacy Preserving Market Basket Data Analysis. Ling Guo, Songtao Guo, Xintao Wu University of North Carolina at Charlotte. Market Basket Data. …. 1: presence 0: absence. Association rule (R.Agrawal SIGMOD 1993) with support and confidence .

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
Download Presentation

PowerPoint Slideshow about 'Privacy Preserving Market Basket Data Analysis' - holt

Download Now An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
privacy preserving market basket data analysis

Privacy Preserving Market Basket Data Analysis

Ling Guo, Songtao Guo, Xintao Wu

University of North Carolina at Charlotte

market basket data
Market Basket Data

1: presence 0: absence

  • Association rule (R.Agrawal SIGMOD 1993)
    • with support and confidence
other measures
Other measures

2 x 2 contingency table

Objective measures for A=>B

related work
Related Work
  • Privacy preserving association rule mining
    • Data swapping
    • Frequent itemset or rule hiding
    • Inverse frequent itemset mining
    • Item randomization
item randomization
Item Randomization

Original Data

Randomized Data

  • To what extent randomization affects mining results? (Focus)
  • To what extent it protects privacy?
randomized response stanley warner jasa 1965
Randomized Response ([ Stanley Warner; JASA 1965])

: Cheated in the exam : Didn’t cheat in the exam

Cheated in exam


Purpose: Get the proportion( ) of population

members that cheated in the exam.

  • Procedure:

“Yes” answer

Didn’t cheat

Randomization device

Do you belong to A? (p)

Do you belong to ?(1-p)

“No” answer


Unbiased estimate of is:

application of rr in mbd
Application of RR in MBD
  • RR can be expressed by matrix as: ( 0: No 1:Yes)


  • Extension to multiple variables

e.g., for 2 variables

  • Unbiased estimate of is:

stands for Kronecker product

diagonal matrix with elements

randomization example
Randomization example

Original Data

Randomized Data


A: Milk B: Cereals

Data owners

Data miners






We can get the estimate, how accurate we can achieve?


Estimated values

Both are frequent set

Original values

Frequent set


Not frequent set



Rule 6 is falsely recognized from estimated value!




Lower& Upper bound

Frequent set with high confidence

Frequent set without confidence

accuracy on support s
Accuracy on Support S
  • Estimate of support
  • Variance of support
  • Interquantile range (normal dist.)




accuracy on confidence c
Accuracy on Confidence C
  • Estimate of confidence A =>B
  • Variance of confidence
  • Interquantile range (ratio dist. is F(w))
    • Loose range derived on Chebyshev’s theorem


Let be a random variable with expected value and finite

variance .Then for any real

general framework
General Framework
  • Step1: Estimation
    • Express the measure as one derived function from the observed variables ( or their marginal totals , ).
    • Compute the estimated measure value.
  • Step2: Variance of the estimated measure
    • Get the variance of the estimated measure (a function with multi known variables) through Taylor approximation
  • Step 3: Derive the interquantile range through Chebyshev's theorem
example for with two variables
Example for with two variables
  • Step 1: Get the estimate of the measure
  • Step 2: Get the variance of the estimated measure
  • Step 3: Derive the interquantile range through Chebyshev's theorem .

Where: , , ,

accuracy bounds
Accuracy Bounds
  • With unknown distribution, Chebyshev theorm only gives loose bounds.

Bounds of the support vs. varying p

  • All the above discussions assume distortion matrices P are known to data miners
    • P could be exploited by attackers to improve the posteriori probability of their prediction on sensitive items
  • How about not releasing P?
    • Disclosure risk is decreased
    • Data mining result?
unknown distortion p
Unknown distortion P
  • Some measures have monotonic properties
  • Other measures don’t have such properties
applications hypothesis test
Applications: hypothesis test
  • From the randomized data, if we discover an itemset which satisfies , we can guarantee dependence exists among the original itemset since .

Still be able to derive the strong dependent itemsets from the randomized data

No false positive

  • Propose a general approach to deriving accuracy bounds of various measures adopted in MBD analysis
  • Prove some measures have monotonic property and some data mining tasks can be conducted directly on randomized data (without knowing the distortion). No false positive pattern exists in the mining result.
future work
Future Work
  • Which measures are more sensible to randomization?
  • The tradeoff between the privacy of individual data and the accuracy of data mining results
  • Accuracy vs. disclosure analysis for general categorical data
  • NSF IIS-0546027
  • Ph.D. students

Ling Guo

Songtao Guo