Sequence Data Mining: Techniques and Applications

Sequence Data Mining: Techniques and Applications Sunita Sarawagi IIT Bombay http://www.it.iitb.ac.in/~sunita Mark Craven University of Wisconsin http://www.biostat.wisc.edu/~craven

What is a sequence? • Ordered set of elements: s = a1,a2,..an • Each element ai could be • Numerical • Categorical: domain a finite set of symbols S, |S|=m • Multiple attributes • The length n of a sequence is not fixed • Order determined by time or position and could be regular or irregular

Real-life sequences • Classical applications • Speech: sequence of phonemes • Language: sequence of words and delimiters • Handwriting: sequence of strokes • Newer applications • Bioinformatics: • Genes: Sequence of 4 possible nucleotides, |S|=4 • Example: AACTGACCTGGGCCCAATCC • Proteins: Sequence of 20 possible amino-acids, |S|=20 • Example: • Telecommunications: alarms, data packets • Retail data mining: Customer behavior, sessions in a e-store (Example, Amazon) • Intrusion detection

Intrusion detection • Intrusions could be detected at • Network-level (denial-of-service attacks, port-scans, etc) [KDD Cup 99] • Sequence of TCP-dumps • Host-level (attacks on privileged programs like lpr, sendmail) • Sequence of system calls • |S| = set of all possible system calls ~100 open lseek lstat mmap execve ioctl ioctl close execve close unlink

Outline • Traditional mining operations on sequences • Classification • Clustering • Finding repeated patterns • Primitives for handling sequence data • Sequence-specific mining operations • Partial sequence classification (Tagging) • Segmenting a sequence • Predicting next symbol of a sequence • Applications in biological sequence mining

Classification of whole sequences Given: • a set of classes C and • a number of example sequences in each class, train a model so that for an unseen sequence we can say to which class it belongs Example: • Given a set of protein families, find family of a new protein • Given a sequence of packets, label session as intrusion or normal • Given several utterances of a set of words, classify a new utterance to the right word

Conventional classification methods • Conventional classification methods assume • record data: fixed number of attributes • single record per instance to be classified (no order) • Sequences: • variable length, order important. • Adapting conventional classifiers to sequences • Generative classifiers • Boundary-based classifiers • Distance based classifiers • Kernel-based classifiers

Generative methods • For each class i, train a generative model Mi to maximize likelihood over all training sequences in the class i • Estimate Pr(ci) as fraction of training instances in class i • For a new sequence x, • find Pr(x|ci)*Pr(ci) for each i using Mi • choose i with the largest value of Pr(x|ci)*P(ci) Pr(x|c1)*Pr(c1) x Pr(x|c2)*Pr(c2) Pr(x|c3)*Pr(c3) Need a generative model for sequence data

Boundary-based methods • Data: points in a fixed multi-dimensional space • Output of training: Boundaries that define regions within which same class predicted • Application: Tests on boundaries to find region Decision trees Neural networks Linear discriminants Need to embed sequence data in a fixed coordinate space

Kernel-based classifiers • Define function K(xi, x) that intuitively defines similarity between two sequences and satisfies two properties • K is symmetric i.e., K(xi, x) = K(x, xi) • K is positive definite • Training: learn for each class c, • wicfor each train sequence xi • bc • Application: predict class of x • For each class c, find f(x,c) = S wicK(xi, x)+bc • Predicted class is c with highest value f(x,c) • Well-known kernel classifiers • Nearest neighbor classifier • Support Vector Machines • Radial Basis functions Need kernel/similarity function

Sequence clustering • Given a set of sequences, create groups such that similar sequences in the same group • Three kinds of clustering algorithms • Distance-based: • K-means • Various hierarchical algorithms • Model-based algorithms • Expectation Maximization algorithm • Density-based algorithms • Clique Need similarity function Need generative models Need dimensional embedding

Outline • Traditional mining on sequences • Primitives for handling sequence data • Embed sequence in a fixed dimensional space • All conventional record mining techniques will apply • Generative models for sequence • Sequence classification: generative methods • Clustering sequences: model-based approach • Distance between two sequences • Sequence classification: SVM and NN • Clustering sequences: distance-based approach • Sequence-specific mining operations • Applications

Embedding sequences in fixed dimensional space • Ignore order, each symbol maps to a dimension • extensively used in text classification and clustering • Extract aggregate features • Real-valued elements: Fourier coefficients, Wavelet coefficients, Auto-regressive coefficients • Categorical data: number of symbol changes • Sliding window techniques (k: window size) • Define a coordinate for each possible k-gram a (mk coordinates) • a-th coordinate is number of times a in sequence • (k,b) mismatch score: a-th coordinate is number of k-grams in sequence with b mismatches with a • Define a coordinate for each of the k-positions • Multiple rows per sequence

Sliding window: window-size 3 Sliding window examples open lseek ioctl mmap execve ioctl ioctl open execve close mmap One symbol per column One row per trace Multiple rows per trace mis-match scores: b=1

Detecting attacks on privileged programs • Short sequences of system calls made during normal execution of system calls are very consistent, yet different from the sequences of its abnormal executions • Two approaches • STIDE (Warrender 1999) • Create dictionary of unique k-windows in normal traces, count what fraction occur in new traces and threshold. • RIPPER –based (Lee 1998) • next...

Classification models on k-grams trace data • When both normal and abnormal data available • class label = normal/abnormal: • When only normal trace, • class-label=k-th system call Learn rules to predict class-label [RIPPER]

Examples of output RIPPER rules • Both-traces: • if the 2nd system call is vtimes and the 7th is vtrace, then the sequence is “normal” • if the 6th system call is lseek and the 7th is sigvec, then the sequence is “normal” • … • if none of the above, then the sequence is “abnormal” • Only-normal: • if the 3rd system call is lstat and the 4th is write, then the 7th is stat • if the 1st system call is sigblock and the 4th is bind, then the 7th is setsockopt • … • if none of the above, then the 7th is open

Experimental results on sendmail • The output rule sets contain ~250 rules, each with 2 or 3 attribute tests • Score each trace by counting fraction of mismatches and thresholding Summary: Only normal traces sufficient to detect intrusions Percent of mismatching traces

More realistic experiments • Different programs need different thresholds • Simple methods (e.g. STIDE) work as well • Results sensitive to window size • Is it possible to do better with sequence specific methods? [from Warrender 99]

Outline • Traditional mining on sequences • Primitives for handling sequence data • Embed sequence in a fixed dimensional space • All conventional record mining techniques will apply • Distance between two sequences • Sequence classification: SVM and NN • Clustering sequences: distance-based approach • Generative models for sequences • Sequence classification: whole and partial • Clustering sequences: model-based approach • Sequence-specific mining operations • Applications in biological sequence mining

Probabilistic models for sequences • Independent model • One-level dependence (Markov chains) • Fixed memory (Order-l Markov chains) • Variable memory models • More complicated models • Hidden Markov Models

Independent model • Model structure • A parameter for each symbol in S • Probability of a sequence s being generated from the model • example: Pr(AACA) = Pr(A) Pr(A) Pr(C) Pr(A) = Pr(A)3 Pr(C) = 0.13 ´0.9 • Training transition probabilities • Data T : set of sequences • Count(s ε T): total number of times substring s appears in training data T Pr(s) = Count(sε T) / length(T) Pr(A) = 0.1 Pr(C) = 0.9

First Order Markov Chains • Model structure • A state for each symbol in S • Edges between states with probabilities • Probability of a sequence s being generated from the model • Example: Pr(AACA) = Pr(A) Pr(A|A) Pr(C|A) Pr(A|C) = 0.5 ´ 0.1 ´ 0.9 ´ 0.4 • Training transition probability between states Pr(s|b) = Count(bsε T) / Count(bε T) start 0.5 0.5 0.4 A C 0.9 0.6 0.1

start 0.5 Higher order Markov Chains l = memory of sequence • Model • A state for each possible suffix of length l |S|lstates • Edges between states with probabilities • Pr(AACA) = Pr(AA)Pr(C|AA) Pr(A|AC) = 0.5 ´ 0.9 ´ 0.7 • Training model Pr(s|s) = count(ssε T) / count(s ε T) C 0.9 AA AC C 0.6 A 0.1 C 0.3 A 0.4 A 0.7 CA CC 0.8 C 0.2 l = 2

Variable Memory models start • Probabilistic Suffix Automata (PSA) • Model • States: substrings of size no greater than lwhere no string is suffix of another • Calculating Pr(AACA): • Pr(A)Pr(A|A)Pr(C|A)Pr(A|AC) • 0.5 ´ 0.3 ´ 0.7 ´ 0.1 • Training: not straight-forward • Eased by Prediction Suffix Trees • PSTs can be converted to PSA after training 0.2 0.5 C 0.9 C 0.6 AC CC C 0.7 A 0.6 A 0.1 A A 0.3

Prediction Suffix Trees (PSTs) • Suffix trees with emission probabilities of observation attached with each tree node • Linear time algorithms exist for constructing such PSTs from training data [Apostolico 2000] C 0.6 0.28, 0.72 e C 0.9 AC CC 0.3, 0.7 A C 0.25, 0.75 C 0.7 A 0.1 A AC CC 0.4, 0.6 0.1, 0.9 Pr(AACA)=0.28 ´ 0.3 ´ 0.7 ´ 0.1 A 0.3

0.5 0.9 0.5 0.1 0.8 0.2 Hidden Markov Models A C 0.6 0.4 • Doubly stochastic models • Efficient dynamic programming algorithms exist for • Finding Pr(S) • The highest probability path P that maximizes Pr(S|P) (Viterbi) • Training the model • (Baum-Welch algorithm) A C 0.9 0.1 S1 S2 S4 S3 A C 0.3 0.7 A C 0.5 0.5

Discriminative training of HMMs • Models trained to maximize likelihood of data might perform badly when • Model not representative of data • Training data insufficient • Alternatives to Maximum-likelihood/EM • Objective functions: • Minimum classification error • Maximum posterior probability of actual label Pr(c|x) • Maximum mutual information with class • Harder to train above functions, number of alternatives to EM proposed • Generalized probabilistic descent [Katagiri 98] • Deterministic annealing [Rao 01]

HMMs for profiling system calls • Training: • Initial number of states = 40 (roughly equals number of distinct system calls) • Train on normal traces • Testing: • Need to handle variable length and online data • For each call, find the total probability of outputting given all calls before it. • If probability below a threshold call it abnormal. • Trace is abnormal if fraction of abnormal calls are high

More realistic experiments • HMMs • Take longer time to train • Less sensitive to thresholds, no window parameter • Best overall performance • VMM and Sparse Markov Transducers also shown to perform significantly better than fixed window methods [Eskin 01] [from Warrender 99]

ROC curves comparing different methods [from Warrender 99]

Outline • Traditional mining on sequences • Primitives for handling sequence data • Sequence-specific mining operations • Partial sequence classification (Tagging) • Segmenting a sequence • Predicting next symbol of a sequence • Applications in biological sequence mining

Partial sequence classification (Tagging) • The tagging problem: • Given: • A set of tags L • Training examples of sequences showing the breakup of the sequence into the set of tags • Learn to breakup a sequence into tags • (classification of parts of sequences) • Examples: • Text segmentation • Break sequence of words forming an address string into subparts like Road, City name etc • Continuous speech recognition • Identify words in continuous speech

Title Journal Year Author Volume Page Text sequence segmentation Example: Addresses, bib records House number Zip State Building Road City 4089 Whispering Pines Nobel Drive San Diego CA 92122 P.P.Wangikar, T.P. Graycar, D.A. Estell, D.S. Clark, J.S. Dordick (1993) Protein and Solvent Engineering of Subtilising BPN' in Nearly Anhydrous Organic Media J.Amer. Chem. Soc. 115, 12231-12237.

Approaches used for tagging • Learn to identify start and end boundaries of each label • Method: for each label, build two classifiers for accepting its two boundaries. • Any conventional classifier would do: • Rule-based most common. • K-windows approach: • For each label, train a classifier on k-windows • During testing • classify each k-window • Adapt state-based generative models like HMM

Road name S1 S2 Prefix Suffix S4 S3 Building name S1 S2 Prefix Suffix S4 S3 State-based models for sequence tagging • Two approaches: • Separate model per tag with special prefix and suffix states to capture the start and end of a tag

Combined model over all tags Nested model Each element another HMM • Example: IE • Naïve Model: One state per element … Mahatma Gandhi Road Near Parkland ... … [Mahatma Gandhi Road Near : Landmark] Parkland ...

Other approaches • Disadvantages of generative models (HMMs) • Maximizing joint probability of sequence and labels may not maximize accuracy • Conditional independence of features a restrictive assumption • Alternative: Conditional Random Fields • Maximize conditional probability of labels given sequence • Arbitrary overlapping features allowed

Outline • Traditional mining on sequences • Primitives for handling sequence data • Sequence-specific mining operations • Partial sequence classification (Tagging) • Segmenting a sequence • Predicting next symbol of a sequence • Applications in biological sequence mining

Segmenting sequences Basic premise: Sequence assumed to be generated from multiple models Goal: discover boundaries of model change Applications: Bioinformatics Better models by breaking up heterogeneous sequences Return A Pick B

Players A and B A has a set of coins with different biases A repeatedly Picks arbitrary coin Tosses it arbitrary number of times B observes H/T Guesses transition points and biases Simpler problem: Segmenting a 0/1 sequence Return Pick A Toss B

A MDL-based approach • Given n head/tail observations • Can assume n different coins with bias 0 or 1 • Data fits perfectly (with probability one) • Many coins needed • Or assume one coin • May fit data poorly • “Best segmentation” is a compromise between model length and data fit 5/7 1/3 1/4

MDL For each segment i: L(Mi): cost of model parameters: log(Pr(head)) + segment boundary: log (sequence length) L(Di|Mi): cost of describing data in segment Si given model Mi: log(Hh T(1-h)) H: #heads, T: #tails Goal: find segmentation that leads to smallest total cost åsegment i L(Mi) + L(Di|Mi)

Shortest path How to find optimal segments Sequence of 17 tosses: Derived graph with 18 nodes and all possible edges Edge cost = model cost + data cost

Non-independent models • In previous method each segment is assumed to be independent of each other, does not allow model reuse of the form: • The (k,h) segmentation problem: • Assume a fixed number h of models is to be used for segmenting into k parts an n-element sequence (k > h) • (k,k) segmentation solvable by dynamic programming • General (k,h) problem NP hard

S2 S3 S4 S1 Approximations: for (k,h) segmentation • Solve (k,k) to get segments • Solve (n,h) to get H models • Example: 2-medians • Assign each segment to best of H A second variant (Replace step 2 above with) • Find summary statistics for each k segment, cluster them into H clusters M1 M2

Results of segmenting genetic sequences From: Gionis & Mannila2003

Outline • Traditional mining operations on sequences • Primitives for handling sequence data • Sequence-specific mining operations • Applications in biological sequence mining • Classifying proteins according to family • Finding genes in DNA sequences BREAK

Two sequence mining applications

The protein classification task Given: amino-acid sequence of a protein Do: predict the family to which it belongs GDLSTPDAVMGNPKVKAHGKKVLGAFSDGLAHLDNLKGTFATLSELHCDKLHVDPENFRLLGNVCVLAHHFGKEFTPPVQAAYAKVVAGVANALAHKYH

Sequence Data Mining: Techniques and Applications