Mining Sequence Classifiers for Early Prediction

Mining Sequence Classifiers for Early Prediction
Zhengzheng Xing, Jian Pei, Guozhu Dong, Philip S. Yu SIAM on Data Mining, 2008: 644-655 Reporter: 張漢斌朱政豪李佳樺李培福 2013/11/28

Outline Introduction Problem Definition Sequential Classification Rules (SCR) The Generalized Sequential Decision Tree Method(GSDT) Empirical Evaluation Conclusion Application Future Work

Abstract What’s the problem to resolve? Make a precise prediction as early as possible. c X q b Z a X d c f e a a X I got it!!! Z will happen eventually!!! No need to wait for more data!!! I know: aacc Z Classifier Unknown Happening Happened

Introduction Why the work is valuable? Disease diagnosis: Earlier the diagnosis, higher the survival rate. Disaster prediction: Earlier the prediction, lower the cost.

Introduction (cont.) Why the work is distinctive? Early prediction had not been well studied before: Existing methods extract features from the whole sequence. None of them explore the utility of features in early prediction. e.g. SVM & ANN are whole-sequence based. Contributions: First identify the problem of sequence classification on early prediction. Propose two algorithms achieving high accuracy while using only very short prefix Sequential Classification Rule(SCR): rule based Generalized Sequential Decision Tree(GSDT): decision tree based

Problem Definition Sequence Classifier C: Making the prediction once it is confident. C(s[1, l0]) = Z = C(s[1, l0+1]) = C(s[1, l0+2]) = … = C(s) Cost Cost(C, s) = length of minprefix(s, f) = l0 Sequence Classification for Early Prediction Construct a sequence classifier C such that C has an expected accuracy p0 and minimizes the expected prediction cost. p0 is a user specified parameter. e.g. Sequence s: abcdaccd Clsssifier C: ac Z Stable Prediction: C(abc) = Z = C(abcd) = C(abcda) = … = C(abcdaccd) Cost(C, s) = 3 = length of minprefix(abcdaccd, ac) = length of (abc)

Sequential Classification Rules (SCR)

Sequential Classification Rules A feature is a short sequence. e.g. f=bbd, s=acbbdadbbdca minprefix(s,f)=acbbd When f s, minprefix(s,f) = s A sequential classification rule is in the form e.g. minprefix(s,R) = When R s, minprefix(s,R) = s

Sequential Classification Rules(cont.) Support of R in SDB Confidence of rule R on SDB Ex. R = f3f4f5 for class label=c1

Feature Selection What is a good feature? 1. Frequency 2. Discriminativeness 3. Earliness The utility measure of f is defined as

Feature Selection(cont.) f = ab Value = 0 when features have the same as class label

Feature Selection(cont.) To measure discriminativeness The entropy of SDB is given by E(SDB) = To measure frequency and earliness The weighted support of f is given by Utility value threshold ?

Feature Selection(cont.) To overcome the problem First find top-k features in utility. Find a set of rules using the top-k features. If the rules are insufficient in classification, we mine the next k features. How to mine top-k features efficiently ?

Mining Top-k features Sequence enumeration tree Length 1 Length 2 Length 3 Top-3 utility value Seed set = {a, b, c, aa, ab, …} Seed set = {a, ab, aaa}

Mining Top-k features(cont.) Seed set = {a, ab, aaa}, Once we obtain a set of k seed features, we can use the seeds to prune the search space. Pruned!

Mining Top-k features(cont.) Theorem 3.1. (Utility bound) Let SDB be the training data set. For features f and such that f is a prefix of Because

Mining Top-k features(cont.) For example. greatest U(f=aaa) < U(f=aac) Pruned! Seed set = {a, aa, aac} , Ulb(f=aac) Seed set = {a, aa, abb} , Ulb(f=abb)

Mining Sequential Classification Rules Given a set of features , all possible rules using the features in F can be enumerated using a rule enumeration tree. ∅

Mining Sequential Classification Rules(cont.) Five status: inactive, active, chosen, pruned, or processed. ∅ inactive inactive inactive inactive inactive inactive R = {}

Mining Sequential Classification Rules(cont.) Start with the node which has the lowest prediction cost. ∅ Lowest Cost If its confidence active chosen active active inactive inactive inactive R = { }

Mining Sequential Classification Rules(cont.) A child node is set to active if its support is at least min_sup. ∅ Lowest Cost If its confidence active processed active active inactive active inactive inactive R = {}

Mining Sequential Classification Rules(cont.) Otherwise, the child node and its descendants are set to pruned. ∅ Lowest Cost If its confidence processed active active pruned inactive inactive R = {}

Mining Sequential Classification Rules(cont.) If a s matches both R1 and R2, and , then ∅ Lowest Cost If its confidence processed active active chosen active pruned chosen active R = { } R = { }

Mining Sequential Classification Rules(cont.) Some sequences in SDB which do not match any rules in R. Subset We select features and mine rules on

Mining Sequential Classification Rules(cont.) When a set of rules Input : The earliest matched rule gives the prediction. In this example, the first rule gives the prediction. c1

SCR Summary The major idea is to mine a set of sequential classification rules as the classifier. This method considers three characteristics. Frequency Discriminativeness Earliness We conduct best-first search for the sake of efficiency.

Advantages and Disadvantages Advantage It considers frequency, discriminativeness and earliness which rise accuracy and lower the prediction cost. Without loss, it tries all possible combination features and rules. Disadvantage The enumeration tree will be larger when the feature set is larger. It doesn’t consider two or more features which take place simultaneously.

The Generalized Sequential Decision Tree Method(GSDT)

Decision Tree Method

Challenges The classical decision tree construction framework cannot be applied straightforwardly to sequence data for early prediction. There are not natural attributes in sequences Challenge How to construct some “attributes” using features in sequences?

The GSDT Framework The idea of using a set of features as an attribute in GSDT. Current situation

An Example of GSDT A set of features A set of features:{f1, f2, f3, f4} How to select a set of features as an attribute? 4 subsets:{, , , } Recursive subsets Stop growing a branch If a branch is pure enough, that is, the majority class has a population of at least p0. If the training subset in the branch has less than min subsequences(minimum support threshold).

Attribute Composition Top-k feature selection method + greedy approach.

Attribute Composition (cont.) SDB’= SDB A = f1 , f2 , f3 A = f1 , f2 , f3 , f4 A = f1 , f2 A = f1 A = A = and SDB’= SDB at the beginning. A A Findtop-k feature, iteratively add to A the feature which has the largest support in SDB’. If a sequence in SDB’ matches a feature in A, then the sequence is removed from SDB’. If the k features are used up but SDB’ is not empty yet, another k features are extracted.

GSDT Summary Generalized sequential decision tree method for early prediction. Redundancy is allowed in covering the sequences in the training set to improve the robustness of GSDTs. A GSDT is easy to construct. Importantly, GSDT has good understandability. This kind of information is not captured by the sequential classification rule (SCR) approach.

Advantages and Disadvantages Advantages Easy to construct and can achieve accuracy comparable to the state-of-the-art methods. Unseen sequence to be classified in the future does not contain any feature at the root node. Disadvantages Redundancy

Empirical Evaluation

Empirical Evaluation Data sets : Compared algorithms : BP, GSDT, ID3, KBANN, O‘Neil, SCR, 1-NN, 3-NN Measures : Error rate Accuracy Average prefix length Run time

E. Coli Promoter Data Set 106 real instances : 53 promoter sequences 53 non-promotersequences Instance content : class : “+” promoter, “-” non-promoter instance name 57 nucleotides : a, t, c, g e.g. +, S10, tactagcaatacgcttgcgttcggtggttaagtatgtataatgcgcgggcttgtcgt

E. Coli Promoter Data Set (Cont.) Experimental methodology : Leave-one-out Feature Extraction in SCR and GSDT : Maximal feature length = 20 Top-30 features Parameter Settings:

E. Coli Promoter Data Set (Cont.) Exprimental Results : Meaningful rules found both in SCR and in GSDT : TATAA  promoter ATAAT  promoter

Drosophila Promoter Data Set 854 real instances : 327 promoter sequences 527 non-promoter sequences Instance content : class with instance name : “AF” promoter, “RH” non-promoter 300 nucleotides : A, T, C, G e.g. AF035546.substring, ATGTCAGT .... TCGCTG Experimental methodology : 10-fold cross validation

Drosophila Promoter Data Set (Cont.) Parameter Settings : Exprimental Results :

Control Chart Time SeriesData Set 600 discretilized synthetic instances : Normal, cyclic, increasing trend, decreasing trend, upward shift, downward shift 60 time points Experimental methodology : Randomly split 300 training and 300 testing instances Feature Extraction in SCR and GSDT : Maximal feature length = 20 Top-30 features

Control Chart Time SeriesData Set (Cont.) Exprimental Results : Accuracy with respect to class Accuracy with respect to class

Conclusion The results show that SCR and GSDT can obtain competitve prediction accuracy using a short prefix on average. Comparison between SCR and GSDT :

Applications Early prediction of moviebox office success Early prediction of human motion Mestyan, M., T. Yasseri and J. Kertesz (2013). "Early prediction of movie box office success based on Wikipedia activity big data." PLoS One8(8): e71226. Jim Mainprice et al., “Human-Robot Collaborative Manipulation Planning Using Early Prediction of Human Motion“. IEEE/RSJ, Nov. 2013

Applications (cont.) Personalized recommendation: Good mood:

Applications (cont.) Personalized recommendation: Bad mood:

Future Work Limitation of the work: What When The time gap between the moment we know what will happen and the moment that it really happen is overlooked. Sometimes, the time gap is REALLY important: Earthquake prediction Stock prediction

Future Work(cont.) Proposed method: Class label is an instance with a time stamp, as well as features. Record the time gap between the class label(what eventually happened) and the last feature of the feature series(the first moment we are sure about what will happen). Mine deeper patterns.

Thank you!

Mining Sequence Classifiers for Early Prediction