Classification Using Statistically Significant Rules

Classification Using Statistically Significant Rules Sanjay Chawla School of IT University of Sydney (joint work with Florian Verhein and Bavani Arunasalam)

Overview • Data Mining Tasks • Associative Classifiers • Support and Confidence for Imbalanced Datasets • The use of exact tests to mine rules • Experiments • Conclusion

Associative Classifier Data Mining • Data Mining research has settled into an equilibrium involving four tasks Pattern Mining (Association Rules) Classification DB Clustering Anomaly or Outlier Detection ML

Outlier Detection • Outlier Detection (Anomaly Detection) can be studied from two aspects • Unsupervised • Nearest Neighbor or K-Nearest Neighbor Problem • Supervised • Classification for imbalanced data set • Fraud Detection • Medical Diagnosis

Example: Association Rules (Agrawal, Imielinksi and Swami, 93 SIGMOD) • An implication expression of the form X  Y, where X and Y are itemsets • Example: {Milk, Diaper}  {Beer} • Rule Evaluation Metrics • Support (s) • Fraction of transactions that contain both X and Y • Confidence (c) • Measures how often items in Y appear in transactions thatcontain X From “Introduction to Data Mining”, Tan,Steinbach and Kumar

Association Rule Mining (ARM) Task • Given a set of transactions T, the goal of association rule mining is to find all rules having • support ≥ minsup threshold • confidence ≥ minconf threshold • Brute-force approach: • List all possible association rules • Compute the support and confidence for each rule • Prune rules that fail the minsup and minconf thresholds  Computationally prohibitive!

Mining Association Rules • Two-step approach: • Frequent Itemset Generation • Generate all itemsets whose support  minsup • Rule Generation • Generate high confidence rules from each frequent itemset, where each rule is a binary partitioning of a frequent itemset • Frequent itemset generation is still computationally expensive

Frequent Itemset Generation Given d items, there are 2d possible candidate itemsets From “Introduction to Data Mining”, Tan,Steinbach and Kumar

Reducing Number of Candidates • Apriori principle: • If an itemset is frequent, then all of its subsets must also be frequent • Apriori principle holds due to the following property of the support measure: • Support of an itemset never exceeds the support of its subsets • This is known as the anti-monotone property of support

Illustrating Apriori Principle Found to be Infrequent Pruned supersets From “Introduction to Data Mining”, Tan,Steinbach and Kumar

Classification using ARM In a Classification task we want to predict the class label (Gender) using the attributes A good (albeit stereotypical) rule is {Beer,Diaper}  Male whose support is 60% and confidence is 100%

Classification Based on Association (CBA): Liu, Hsu and Ma (KDD 98) • Mine association rules of the form Ac on the training data • Prune or Select rules using a heuristic • Rank rules • Higher confidence; higher support; smallest antecedent • New data is passed through the ordered rule set • Apply first matching rule or variation thereof

Several Variations

Downsides of Support (3) • Support is biased towards the majority class • Eg: classes = {yes, no}, sup({yes})=90% • minSup > 10% wipes out any rule predicting “no” • Suppose X  no has confidence 1 and support 3%. Rule discarded if minSup > 3% even though it perfectly predicts 30% of the instances in the minority class! • In summary, support has many downsides –especially for classification.

Downside of Confidence(1) Conf(A C) = 20/25 = 0.8 Support(AC) = 20/100 = 0.2 Correlation between A and C: Thus, when the data set is imbalanced a high support and high confidence rule may not necessarily imply that the antecedent and the consequent are positively correlated.

Downside of Confidence (2) • Reasonable to expect that for “good rules” the antecedent and consequent are not independent! • Suppose • P(Class=Yes) = 0.9 • P(Class=Yes|X) = 0.9

Complement Class Support(CCS) • The following are equivalent for a rule A  C • A and C are positively correlated • The support of the antecedent(A) is less than CCS(AC) • Conf(AC) is greater than the support of Consequent(C)

Downsides of Confidence (3) Another useful observation • Higher confidence (support) for a rule in the minority class implies higher correlation, and lower correlation in the minority class implies lower confidence, but neither of these apply for the majority class. • Confidence (support) tends to bias the majority class.

Statistical Significant Rules • Support is a computationally efficient measure (anti-monotonic) • Tendency to “force” a statistical interpretation on support • Lets start with a statistically correct approach • And “force” it to be computationally efficient

Exact Tests • Let the class variable be {0,1}. • Suppose we have two rules X 1 and Y1 • We want to determine if the two rules are different, i.e., they have different effect on “causing” or ‘associating” with 1 or 0 • e.g., medicine and placebo

Exact Tests Table[a,b;c,d] We assume X and Y are binomial random variables with the same parameter p We want to determine, the probability that a specific table instance occurs “purely by chance”

Exact Tests We can calculate the exact probability of a specific table instance without resorting to asymptotic approximations. This can be used to calculate the p-value of [a,b;c,d]

Fisher Exact Test • Given a table, [a,b;c,d], Fisher Exact Test will find the probability (p-value) of obtaining the given table or a more positively associated table under the assumption that X and Y come from the same distribution.

Forcing “anti-monotonic” • We test a rule X 1 against all its immediate generalizations {X-z  1; z in X} • The • The rule is significant if • Pvalue < significance level (typically 0.05) • Use a bottom up approach and only test rules whose immediate generalizations are significant • Webb[06] has used Fisher Exact tests for generic association rule mining

Example • Suppose we have already determined that the rules (A = a1)  1 and (A = a2)  1 are significant. • Now we want to test if • X=(A =a1) ^ (A=a2)  1 is significant • Then we carry out a FET on X and X –{A=a1} and X and X-{A=a2}. • If the minimum of their p-value is less than the significance level we keep the X 1 rule, otherwise we discard it.

Ranking Rules • We have already observed that • high confidence rule for the majority class may be “more” negatively correlated than the same rule predicting the other class • A high positively correlated rule that predicts the minority class may have a lower confidence than the same rule predicting the other class

Experiments: Random Dataset • Attributes independent and uniformly distributed. • Makes no sense to find any rules – other than by chance • However minSup=1% and minConf=0.5 mines 4149/13092 -- over 31% of all possible rules. • Using our FET technique with standard significance level we find only 11 (0.08%)

Experiments: Balanced Dataset • Similar performance (within 1%)

Experiments: Balanced Dataset • But mines only 0.06% the number of rules • By searching only 0.5% of the search space • And using only 0.4% of the time.

Experiments: Imbalanced Dataset • Higher performance than support-confidence techniques • Using 0.07% of the search space and time, 0.7% the number of rules.

Contributions • Strong evidence and arguments against the use of support and confidence for imbalanced classification • Simple technique for using Fisher’s Exact test for finding positively associated and statistically significant rules. • Uses on average 0.4% of the time, searches only 0.5% of the search space, finds only 0.06% of the rules as support-confidence techniques. • Similar performance on balanced datasets, higher on imbalanced datasets. • Parameter free (except for significance level)

References • Verhein and Chawla, Classification Using Statistically Significant Rules http://www.it.usyd.edu.au/~chawla • Arunasalam and Chawla, CCCS: A Top-Down Associative Classifier for Imbalanced Class Distribution [ACM SIGKDD 2006; pp 517-522]

Classification Using Statistically Significant Rules