1 / 17

Discriminative Pattern Mining

Discriminative Pattern Mining. By Mohammad Hossain. Based on the paper. Mining Low-Support Discriminative Patterns from Dense and High-Dimensional Data by 1. Gang Fang 2. Gaurav Pandey 3. Wen Wang 4. Manish Gupta 4.Michael Steinbach 5.Vipin Kumar.

Download Presentation

Discriminative Pattern Mining

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Discriminative Pattern Mining By Mohammad Hossain

  2. Based on the paper Mining Low-Support Discriminative Patterns from Dense and High-Dimensional Data by 1. Gang Fang 2. GauravPandey 3. Wen Wang 4. Manish Gupta 4.Michael Steinbach 5.Vipin Kumar

  3. What is Discriminative Pattern • A pattern is said to be Discriminative when its occurrence in two data sets (or in two different classes of a single data set) is significantly different. • One way to measure such discriminative power of a pattern is to find the difference between the supports of the pattern in two data sets. • When this support-difference (DiffSup) is greater then a threshold the the pattern is called discriminative.

  4. An example If we consider the DiffSup =2 then the pattern C and ABC become interesting patterns.

  5. Importance • Discriminative patterns have been shown to be useful for improving the classification performance for data sets where combinations of features have better discriminative power than the individual features • For example, for biomarker discovery from case-control data (e.g. disease vs. normal samples), it is important to identify groups of biological entities, such as genes and single-nucleotide polymorphisms (SNPs), that are collectively associated with a certain disease or other phenotypes

  6. P1 = {i1, i2, i3} P2 = {i5, i6, i7} P3 = {i9, i10} P4 = {i12, i13, i14}. DiffSup is NOT Anti-monotonic As a result, it will not work in Apriori like framework.

  7. Apriori: A Candidate Generation-and-Test Approach • Apriori pruning principle: If there is any itemset which is infrequent, its superset should not be generated/tested! • Method: • Initially, scan DB once to get frequent 1-itemset • Generate length (k+1) candidate itemsets from length k frequent itemsets • Test the candidates against DB • Terminate when no frequent or candidate set can be generated

  8. The Apriori Algorithm—An Example Supmin = 2 Database TDB L1 C1 1st scan C2 C2 L2 2nd scan L3 C3 3rd scan

  9. But here we see, though the patterns AB and AC both have DiffSup < threshold (2) their super set ABC has DiffSup = 2 which is equal to threshold and thus becomes interesting. So AB, AC cannot be pruned.

  10. BASIC TERMINOLOGY AND PROBLEM DEFINITION • Let D be a dataset with a set of m items, I = {i1, i2, ..., im}, two class labels S1 and S2. The instances of class S1 and S2 are denoted by D1 and D2. We have |D| = |D1| + |D2|. • For a pattern (itemset) α = {α1,α2,...,αl} the set of instances in D1 and D2 that contain α are denoted by Dα1 and Dα2. • The relative supports of α in classes S1 and S2 are RelSup1(α) = |Dα1 |/|D1| and RelSup2(α) = |Dα2 |/}D2| • The absolute difference of the relative supports of α in D1 and D2 is denoted as DiffSup(α) = |RelSup1(α) − RelSup2(α)|

  11. New function • Some new functions are proposed that has anti-monotonic property and can be used in a apriori like frame work for pruning purpose. • One of them is BiggerSup defined as: BiggerSup(α) = max(RelSup1(α), RelSup2(α)). • BiggerSup is anti-monotonic and the upper bound of DiffSup. So we may use it for pruning in the apriori like frame work.

  12. BiggerSup is a weak upper bound of DiffSup. • For instance, in the previous example if we want to use it to find discriminative patterns with thresold 4, • P3 can be pruned, because it has a BiggerSup of 3. • P2 can not be pruned (BiggerSup(P2) = 6), even though it is not discriminative (DiffSup(P2) = 0). • More generally, BiggerSup-based pruning can only prune infrequent non-discriminative patterns with relatively low support, but not frequent non- discriminative patterns.

  13. A new measure: SupMaxK • The SupMaxK of an itemset α in D1 and D2 is defined as SupMaxK(α) = RelSup1(α) − maxβ⊆α(RelSup2(β)), where |β| = K • If K=1 then it is called SupMax1 and defined as SupMax1(α) = RelSup1(α) − maxa∈α(RelSup2({a})). • Similarly with K=2 we can define SupMax2 which is also called SupMaxPair.

  14. Properties of the SupMaxK Family

  15. Relationship between DiffSup, BiggerSup and the SupMaxK Family

  16. SupMaxPair: A Special Member Suitable for High-Dimensional Data • In SupMaxK, as K increases we get more complete set of discriminative patterns. • But as K increased the complexity of calculation of SupMaxK also increases. • In fact the complexity of calculation of SupMaxK is O(mK). • So for high dimensional data (where m is large) high value of K (K>2)makes it infeasible. • In that case SupMaxPair can be used.

More Related