Title

Pincer-Search*:An Efficient Algorithm for Discovering the Maximum Frequent SetDao-I Lin and Zvi M. Kedem *Appeared inAdvances in Database Technology- EDBT’98, Proceedings, LNCS Vol. 1377, Springer, pp. 105-119, March 1998 • Title • NameDepartment of Computer ScienceCourant Institute of Mathematical SciencesNew York University • http://www/?

Applications • Association rule applications: • Based on supermarket databases, one might be interested to know that “95% of the customers who bought pasta and ground meat also bought spaghetti sauce” • Based on the alarm signals in telecommunication databases, one might be interested to know that “one can have 90% confidence that alarm C will occur within some interval of time if alarm A and alarm B have occurred in that interval of time” • Based on the stock market trading databases, one might be interested to know that “90% of the time during the last month when the prices of stock A and stock B went up, the price of stock C also went up”

Setting • Basic terms: • 1,2, …, n: The set of all items • e.g. items in supermarkets, alarm signals in telecommunication networks, or stocks in stock markets • Transaction: A set of items • e.g. items purchased in a supermarket, alarm signals occurring within an interval of time, or stocks that their prices went up during the last one hour • Database: A set of transactions • User-defined threshold (min-support): A number in [0,1] • Frequent itemset: A collection of items (an itemset) occurring in at least min-supportfraction of the database • The problem: • Given a large database of sets of items and a user-defined min-support threshold, what are the frequent itemsets?

The Importance of the Maximum Frequent Set • Maximal frequent itemsets: • The frequent itemsets such that no proper superset of them is frequent • Maximum frequent set: • The set of all maximal frequent itemsets • Fact: • An itemset is frequent if and only if it is a subset a maximal frequent itemset • The maximum frequent set uniquely determines the entire frequent set, since the union of its subsets forms the frequent set • Discovering the maximum frequent set is a key problem in many data mining applications: • Such as the discovery of association rules, theories, strong rules, episodes, and minimal keys

An Example • Database Transaction 1 {1,2,3,5} 2 {1,5} 3 {1,2} 4 {1,2,3} • Min-supportis 0.5 • Frequent itemsets are {1}, {2}, {3}, {5}, {1,2}, {1,3}, {1,5}, {2,3}, and {1,2,3}, since they occur in at least 2 out of 4 transactions • Maximum frequent set is {{1,2,3},{1,5}} {1,2,3,4,5} {1,2,3} {1,2} {1,3} {2,3} {1,5} {4} {5} {1}{2}{3}

Two Closure Properties • Let A and B be two itemsets and A B • Property1: A infrequent  B infrequent(if a transaction does not contain A, it cannot contain B) • Property2: B frequent  A frequent(if a transaction contains B, it must contain A) {1,2,3,5} {1,2,4,5} {1,3,4,5} {2,3,4,5} B {1,2,5} {1,3,5} {1,4,5} {2,3,5} {2,4,5} {3,4,5} {1,5} {2,5} {3,5} {4,5} A {5} B {1,2,3,4} {1,2,3} {1,2,4} {1,3,4} {2,3,4} A {1,2} {1,3} {1,4} {2,3} {2,4} {3,4} {1} {2} {3}

Traditional One-Way Search Approaches • Traditional approach for discovering the maximum frequent set is either using a bottom-up search or a top-down search approach • Bottom-up search is good when ALLmaximal frequent itemsets are short • Top-down search is good when ALLmaximal frequent itemsets are long • One-way search can only make use of ONE of the two closure properties to prune candidates

One-Way Search Algorithms • Property1 leads to bottom-up search algorithms, such as AIS (AIS93), Apriori (AS94), OCD (MTV94), SETM (HS95), DHP (PCY95), Partition (SON95), ML-T2+ (HF95), Sampling (T96), DIC (BMUT97), Clique (ZPOL97) • Property2 leads to top-down search algorithms, such as TopDown (ZPOL97), guess-and-correct (MT97) {1,2,3,4} {1,2,3} {1,2,4} {1,3,4} {2,3,4} Blue: frequent itemsets Red: maximal frequent itemsets Black: infrequent itemsets {1,2} {1,3} {2,3} {1,4} {2,4} {3,4} {1} {2} {3} {4} {5} {1,2,3,4,5} {1,2,3,4} {1,2,3,5} {1,2,4,5} {1,3,4,5} {2,3,4,5} {1,2,5} {1,3,5} {1,4,5} {2,3,5} {2,4,5} {3,4,5} {1,5} {2,5} {3,5} {4,5} {5}

Complexity of One-Way Searches • For bottom-up search, every frequent itemset is explicitly examined (in the example, until {1,2,3,4} is examined) • For top-down search, every infrequent itemset is explicitly examined (in the example until {5} is examined) {1,2,3,4} {1,2,3} {1,2,4} {1,3,4} {2,3,4} Blue: frequent itemsets Red: maximal frequent itemsets Black: infrequent itemsets {1,2} {1,3} {2,3} {1,4} {2,4} {3,4} {5} {1} {2} {3} {4} {1,2,3,4,5} {1,2,3,4} {1,2,3,5} {1,2,4,5} {1,3,4,5} {2,3,4,5} {1,2,5} {1,3,5} {1,4,5} {2,3,5} {2,4,5} {3,4,5} {1,5} {2,5} {3,5} {4,5} {5}

Our Two-Way Search Approach: Pincer-Search • Run both bottom-up search and top-down search at the same time • Use information gathered in the bottom-up search to helppruning candidates in the top-down search • Use Property1 to eliminate candidates in the top-down search • Use information gathered in the top-down search to helppruning candidates in the bottom-up search • Use Property2 to eliminate candidates in the bottom-up search • Can efficiently discover both long and short maximal frequent itemsets

Pincer Search: CombiningTop-down and Bottom-up Searches • Eliminated in thetop-down search by using the Property1 • Eliminated in thebottom-up searchby using the Property2 • This example shows how combining both searches could dramatically reduce • the number of candidates examined • the pass of reading the database {1,2,3,4,5} {1,2,3,4} {1,3,4,5} {1,2,3,5} {1,2,4,5} {2,3,4,5} {1,2,5} {1,3,5} {1,4,5} {2,3,5} {2,4,5} {3,4,5} {1,2,3} {1,2,4} {1,3,4} {2,3,4} {1,5} {2,5} {3,5} {4,5} Blue: frequent itemsets Red: maximal frequent itemsets Black: infrequent itemsets Green: itemsets not examined {1,2} {1,3} {1,4} {2,3} {2,4} {3,4} {1} {2} {3} {4} {5}

Performance:Observations and Experiments • Non-monotone property of the maximum frequent set • Both the number of candidates and the number of of frequent itemsets increase as the min-support decreases • NOT true for the number of maximal frequent itemsets • If MFS is {{1,2},{2,3},{3,4}} when min-support is 9% • If min-support decreases to 6% then MFS could become {{1,2,3}} • This property will NOT help bottom-up search algorithms • However, this property may help the Pincer-Search algorithm • Concentrated and scattered distributions • Concentrated: For the same number of frequent itemsets,the frequent items are grouped in a NARROW and TALL shape; a few LONG maximal frequent itemsets • Scattered: For the same number of frequent itemsets,the frequent itemsets are grouped in a WIDE and SHORT shape; many SHORT maximal frequent itemsets

Scattered Distributions

Experiments on Scattered Distributions • The benchmark databases are generated by a well-know synthetic data generation program from IBM Quest project • |T| is the average transaction size, |I| is the average size of the maximal frequent itemsets, |D| is the number of transactions, and |L| is the number of the maximal frequent itemsets • The experiment on T5.I2.D100K shows that although Pincer-Search algorithm used more candidates than Apriori algorithm (due to the candidates considered in the MFCS), Pincer-Search algorithm still performed better since the I/O time saved compensated the extra cost • The experiment on T10.I4.D100K shows that it is also possible for Pincer-Search algorithm to spend efforts on maintaining the MFCS, but did not prune enough candidates to cover the extra cost • For instance, Pincer-Search algorithm performed slightly worse than Apriori algorithm when min-support is 0.75%

Concentrated Distributions

Experiments on Concentrated Distributions • These experiments show that Pincer-Search algorithm is good for discovering the maximum frequent set with concentrated distributions • The improvements can be up to several orders of magnitude • For instance, the improvements are more than 2 orders of magnitude on the experiment of T20I15.D100K database and when min-supports are 7% and 6% • One can expect even greater improvements when some maximal frequent itemsets are longer

Census Data

Experiments on Real-Life Databasesand Conclusions • Pincer-Search algorithm performed quite well on the experiments on this PUMS database, which contains Public Use Microdata Samples • Some preliminary experiments on NYSE stock market databases also show promising results • Conclusions: • Pincer-Search is good for concentrated distributions • In general, can use Adaptive Pincer-Search • Delay the use of the two-way search approach until a later pass • More experiments on real-life databases are in progress

Title

Title

Presentation Transcript

Title or Title

Title Title

Title title title title Title title title

Title Here Title Here Title Here

Poster Title Poster Title Poster Title Poster Title

Title guarantee; Title report; Chain of title; Title search.

title title

title sub-title

Title Text Title Text Title Text Title Text Title Text Title Text Title Text Title Text

Title Text Title Text Title Text Title Text Title Text Title

Title Sub-title

Title or Title

PRESENTATION TITLE PRESENTATION TITLE PRESENTATION TITLE

Poster Title Poster Title Poster Title

Title / Title

Title Title Title Title Title Title Title Title Title Title

TITLE TITLE TITLE Author School, Address

TITLE TITLE TITLE Authors School, Address

title title

Title Long Title

TITLE TITLE TITLE Author School, Address

title title title