A Novel Two-Phase Sampling Algorithm for Efficient Association Rule Mining

Implementation of “A New Two-Phase Sampling Based Algorithmfor Discovering Association Rules” CSCI 6405 Data Warehousing and Data Mining Tokunbo Makanju Adan Cosgaya Faculty of Computer Science Dalhousie University Fall 2005

Overview • Introduction • Algorithm • Data Preparation • Experimental Results • Conclusions • References

Introduction • Size of datasets are getting larger • The time required to mine information from these datasets increases as datasets get larger • Demand for faster rule mining Solution: mine a sample of the original dataset

Algorithm • FAST (Finding Association in Sample Transactions) • 2 versions • FAST-Trim • FAST-Grow • FAST outline: • Obtain a simple random sample S • Compute frequency for each 1-itemset • Obtain a reduced sample S0 from S by either trimming S or growing S0. • Run a standard association-rule algorithm against S0

Algorithm • Distance Functions I1(T)= set of all 1-itemsets in transaction set T L1(T) = set of frequent 1-itemsets in transaction set T f(A;T) = support of itemset A in transaction set T

Algorithm Obtain a simple random sample S from D compute f(A;S) from each A element of S set i=0, S0(i)=, minDist = , and minStage=-1; while (|S0| < n) { divide S0 into disjoint groups of min(k,| S-S0|) transactions each; for each group G { set S0 = S0(i) {t*}, where Dist(S0(i) {t*},S) = min Dist(S0(i){t},S) } compute f(A; S0(i)) for each item A element of S0; if (Dist(S0(i),S) < minDist) { set minDist := dist (S0( i), S) and minStage := i; } set S0(i + 1 / := S0(i); } • FAST-Grow Algorithm

Data Preparation • Downloaded from fimi.cs.helsinki.fi/data/accidents.pdf • The data source for this dataset is the National Institute of Statistics from the region of Flanders in Belgium. • In total 572 unique attribute values can be found in the dataset and an average of 45 attribute values are recorded for each accident.

Experimental Results • Dataset with 340,183 transactions • Obtained a reduced sample of 30% • Final sample ratios of 2.5%, 5%, 7.5% and 10% • Parameters: • Minimum Support = 0.77% • Size of group k = 10

Experimental Results • Results

Conclusions • No need to process a large input dataset • FAST- grow can achieve a high accuracy even with a small sampling ratio of 5-10% • The algorithm has a better performance when using the fixed-size stopping criterion

References • [1] B. Chen, P. Haas, and P. Scheuermann. A new two-phase sampling based algorithm for discovering association rules. In Proceedings of ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2002 • [2] H. Bronnimann, B. Chen, P. Haas, M. Dash, Y. Qiao, P. Scheuermann, Efficient Data-Reduction Methods for On-Line Association Rule Discovery. Presented at NSF Workshop on Next-Generation Data Mining (NGDM02), November 2002. • [3] K. Geurts. Traffic Accidents Data Set. fimi.cs.helsinki.fi/data/accidents.pdf. Last Access: 17/11/2005 • [4] GNU publicly available implementation of Apriori algorithm, written by Christian Borgelt. http://fuzzy.cs.uni-magdeburg.de/~borgelt/software.html Last Access: 24/11/2005

Thank you! Questions?

A Novel Two-Phase Sampling Algorithm for Efficient Association Rule Mining