120 likes | 248 Views
This paper presents the FAST algorithm, a new two-phase sampling approach designed to enhance the efficiency of association rule mining from increasingly large datasets. By mining a representative sample of the original dataset, the algorithm significantly reduces processing time while maintaining accuracy. The study includes data preparation from a traffic accidents dataset, detailed algorithmic procedures, and experimental results showing successful performance with reduced sample sizes. Our findings indicate that even with small sampling ratios, the FAST-Grow variant can yield high accuracy and efficiency.
E N D
Implementation of “A New Two-Phase Sampling Based Algorithmfor Discovering Association Rules” CSCI 6405 Data Warehousing and Data Mining Tokunbo Makanju Adan Cosgaya Faculty of Computer Science Dalhousie University Fall 2005
Overview • Introduction • Algorithm • Data Preparation • Experimental Results • Conclusions • References
Introduction • Size of datasets are getting larger • The time required to mine information from these datasets increases as datasets get larger • Demand for faster rule mining Solution: mine a sample of the original dataset
Algorithm • FAST (Finding Association in Sample Transactions) • 2 versions • FAST-Trim • FAST-Grow • FAST outline: • Obtain a simple random sample S • Compute frequency for each 1-itemset • Obtain a reduced sample S0 from S by either trimming S or growing S0. • Run a standard association-rule algorithm against S0
Algorithm • Distance Functions I1(T)= set of all 1-itemsets in transaction set T L1(T) = set of frequent 1-itemsets in transaction set T f(A;T) = support of itemset A in transaction set T
Algorithm Obtain a simple random sample S from D compute f(A;S) from each A element of S set i=0, S0(i)=, minDist = , and minStage=-1; while (|S0| < n) { divide S0 into disjoint groups of min(k,| S-S0|) transactions each; for each group G { set S0 = S0(i) {t*}, where Dist(S0(i) {t*},S) = min Dist(S0(i){t},S) } compute f(A; S0(i)) for each item A element of S0; if (Dist(S0(i),S) < minDist) { set minDist := dist (S0( i), S) and minStage := i; } set S0(i + 1 / := S0(i); } • FAST-Grow Algorithm
Data Preparation • Downloaded from fimi.cs.helsinki.fi/data/accidents.pdf • The data source for this dataset is the National Institute of Statistics from the region of Flanders in Belgium. • In total 572 unique attribute values can be found in the dataset and an average of 45 attribute values are recorded for each accident.
Experimental Results • Dataset with 340,183 transactions • Obtained a reduced sample of 30% • Final sample ratios of 2.5%, 5%, 7.5% and 10% • Parameters: • Minimum Support = 0.77% • Size of group k = 10
Experimental Results • Results
Conclusions • No need to process a large input dataset • FAST- grow can achieve a high accuracy even with a small sampling ratio of 5-10% • The algorithm has a better performance when using the fixed-size stopping criterion
References • [1] B. Chen, P. Haas, and P. Scheuermann. A new two-phase sampling based algorithm for discovering association rules. In Proceedings of ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2002 • [2] H. Bronnimann, B. Chen, P. Haas, M. Dash, Y. Qiao, P. Scheuermann, Efficient Data-Reduction Methods for On-Line Association Rule Discovery. Presented at NSF Workshop on Next-Generation Data Mining (NGDM02), November 2002. • [3] K. Geurts. Traffic Accidents Data Set. fimi.cs.helsinki.fi/data/accidents.pdf. Last Access: 17/11/2005 • [4] GNU publicly available implementation of Apriori algorithm, written by Christian Borgelt. http://fuzzy.cs.uni-magdeburg.de/~borgelt/software.html Last Access: 24/11/2005
Thank you! Questions?