Sampling Large Databases for Association Rules

Sampling Large Databases for Association Rules Presented by Amitai Irron as part of the seminar: “Data Structures and Algorithms for Massive Data Sets” Directed by Yossi Matias Hannu Toivanen

Basic Definitions • R = {I1, I2, …,Im} - the set of items sold in a supermarket. • r = {t1, t2, …,tn}- a binary vector representing all the “baskets”. • Given an item set X, define it’s frequency in r as: • Given a frequency threshold min_fr, define the collection of frequent sets in r as: • Given two item sets X and Y, define the the confidence in the association X Y as:

Typical Algorithm Structure • An algorithm for discovering association rules is usually divided into two phases: • Discovery of frequent sets. • Calculation of the confidence of each association ruleX Y, for every non empty subset of every frequent set X. • Two classes of widespread algorithms, which differ in the way the first step is performed, are: • Level-Wise Algorithms - Use K passes over the database (where K is the size of the largest frequent set we seek). • Partition Algorithms - Use 2 database passes. In the first pass, locally frequent sets are discovered in partitions of the database (in main memory). These are then used as candidates for the second pass.

Sampling for Frequent Sets • In a way similar to what is done in the partition algorithm, we discover frequent sets in memory. • However, we use a single random sample,s, to perform the first approximation of the collection of frequent sets, so there is no need to pass over the entire database. • By using sampling, we lose accuracy, so as a first precaution, we use a new frequency threshold: low_fr < min_fr. • Using a technique described later, we need only one further database pass to discover all frequent sets. • During the full pass, it might become apparent that the approximation was not good enough, and a second full database pass is required.

The Negative Border • Given a collection S 2R, that is closed with respect to set inclusion (i.e., for every A in S, every subset B A is also in S), The negative border Bd-(S) consists of all the minimal item sets X R that are not in S. • Example: If R = {A, B, C, D, E, F} andthen the negative border is:We consider, e.g., {B,C}: it is not in F(r,min_fr), but all its subsets are, so it belongs in the negative border. • The candidates in the level-wise algorithm are, in fact, the negative border of the frequent sets.

Algorithm 1 for Discovering Frequent Sets • The following algorithm accepts a relation r (“baskets”) over R (supermarket items), a sample size ss, and two frequency thresholds: min_fr and low_fr. • A miss is when a set in Bd-(S)-S is found to be frequent in r. A failure is when there is a frequent set not inSBd-(S). • Failures are only possible when misses are present, so a second database pass will be required if and only if there are misses.

A Second Database Pass • We would like to be able to calculate S such that with high probability, all the sets that are frequent in r are present in S. • We will present an algorithm that guarantees that in 1- fraction of the cases, a second database pass will not be required. • The following algorithm accepts a relation r (“baskets”) over R (supermarket items), a frequency threshold min_fr and a subset S of F(r,min_fr) (typically, generated by algorithm 1).

Analysis of Sampling • We consider sampling with replacement, to avoid referring to the database size. • We denote the absolute error in the estimated frequency:e (X, s) = |fr (X, r) - fr (X, s)|. • For every item set X we denote by m(X, s) the number of rows containing X in the sample s. m(X, s) is a binomial random variable, and it’s distribution defined by: • We consider the minimum size of the sample to satisfy the requirements on the size of the error. • Theorem 1: Given an attribute set X and a random sample s of size:Then the probability that e (X, s)  is, at most .

Required Sample Size • We consider the following table:What it means is that if a chance of  for an error of more then  is acceptable, then the sample size |s| is large enough. • This result applies for a given item set X. For the result to apply for all item sets, we use the following corollary:

The Probability of a Miss • Corollary 2: Given a collection S of item sets and a random sample s of size:Then the probability that there is an item set XS such that e (X, s)  is at most .Proof: By theorem 1, the probability for e (X, s)  is at most  / |S|. Since there are |S| such sets, the probability in question is . • This means that we need to calculate  for the previous table (i.e., the probability of a miss) as  / |S|.

Bounding the Probability of a Failure • The Probability of a failure is estimated by computing the upper bound for the probability of a miss. • Theorem 3: Given a frequent set X, a random sample s and a probability parameter , the probability that Xis a miss is at most  when: • The probability of l independent misses is l. • A significant set of dependant misses, by it’s nature, is not likely to occur in the supermarket domain.

Open Issues • Sample has to be big enough for good result, but we don’t want to read most of the database from the disk in order to sample (depends on the physical storage). • Is there a method for discovering exact association rules using at most one pass over the database? • How else can sampling be used in Data Mining?

Sampling Large Databases for Association Rules