We have seen the Classic Time Series Data Mining Tasks…

We have seen the Classic Time Series Data Mining Tasks… Clustering Classification Query by Content

Lets look at some problems which are more interesting… Novelty Detection Rule Discovery Motif Discovery 10  s = 0.5 c = 0.3

Novelty Detection Fault detection Interestingness detection Anomaly detection Surprisingness detection

…note that this problem should not be confused with the relatively simple problem of outlier detection. Remember Hawkins famous definition of an outlier... ... an outlier is an observation that deviates so much from other observations as to arouse suspicion that it was generated from a different mechanism... Thanks Doug, the check is in the mail. We are not interested in finding individually surprising datapoints, we are interested in finding surprising patterns. Douglas M. Hawkins

Lots of good folks have worked on this, and closely related problems. It is referred to as the detection of “Aberrant Behavior1”, “Novelties2”, “Anomalies3”, “Faults4”, “Surprises5”, “Deviants6” ,“Temporal Change7”, and “Outliers8”. • Brutlag, Kotsakis et. al. • Daspupta et. al., Borisyuk et. al. • Whitehead et. al., Decoste • Yairi et. al. • Shahabi, Chakrabarti • Jagadish et. al. • Blockeel et. al., Fawcett et. al. • Hawkins.

Arrr... what be wrong with current approaches? The blue time series at the top is a normal healthy human electrocardiogram with an artificial “flatline” added. The sequence in red at the bottom indicates how surprising local subsections of the time series are under the measure introduced in Shahabi et. al.

35 30 25 20 15 10 5 0 -5 -10 0 100 200 300 400 500 600 700 800 900 1000 Simple Approaches I Limit Checking

Simple Approaches II Discrepancy Checking 35 30 25 20 15 10 5 0 -5 -10 0 100 200 300 400 500 600 700 800 900 1000

Our Solution Based on the following intuition, a pattern is surprising if its frequency of occurrence is greatly different from that which we expected, given previous experience… This is a nice intuition, but useless unless we can more formally define it, and calculate it efficiently

Note that unlike all previous attempts to solve this problem, our notion surprisingness of a pattern is not tied exclusively to its shape. Instead it depends on the difference between the shape’s expected frequency and its observed frequency. For example consider the familiar head and shoulders pattern shown below... The existence of this pattern in a stock market time series should not be consider surprising since they are known to occur (even if only by chance). However, if it occurred ten times this year, as opposed to occurring an average of twice a year in previous years, our measure of surprise will flag the shape as being surprising. Cool eh? The pattern would also be surprising if its frequency of occurrence is less than expected. Once again our definition would flag such patterns.

We call our algorithm… Tarzan! Tarzan (R) is a registered trademark of Edgar Rice Burroughs, Inc. “Tarzan” is not an acronym. It is a pun on the fact that the heart of the algorithm relies comparing two suffix trees, “tree to tree”! Homer, I hate to be a fuddy-duddy, but could you put on some pants?

We begin by defining some terms… Professor Frink? Definition 1: A time series pattern P, extracted from database X is surprising relative to a database R, if the probability of its occurrence is greatly different to that expected by chance, assuming that R and X are created by the same underlying process.

Definition 1: A time series pattern P, extracted from database X is surprising relative to a database R, if the probability of occurrence is greatly different to that expected by chance, assuming that R and X are created by the same underlying process. But you can never know the probability of a pattern you have never seen! And probability isn’t even defined for real valued time series!

1.5 C 1 0.5 0 - 0.5 - 1 Q - 1.5 0 20 40 60 80 100 120 ˆ baabccbc = C babcacca ˆ = Q We need to discretize the time series into symbolic strings… UUFDUUFDDUUD

I’ve prepared a little math background... • We use  to denote a nonempty alphabet of symbols. • A string over  is an ordered sequence of symbols from the alphabet. • Given a string x, the number of symbols in x defines the length |x| of x. We assume |x| = n. • Let us decompose a text x in uvw, i.e., x = uvw where u, v, and w are strings over . • Strings u, v, and w are substrings, or words, of x. • u is called a prefix of x. • w is called a suffix of x.

We write x[i] 1  i  |x| to indicate the ith symbol in x. • We use x[i,j] as shorthand for the substring x[i] x[i+1]…x[j] where 1  i  j  n, with the convention that x[i,i] = x[i]. • Substrings in the form x[1,j] correspond to to the prefixes of x. • Substrings in the form x[i,n] correspond to to the suffixes of x. • We say that a string y has an occurrence at position i of a text x if y[1] = x[i] , y[2] = x[i+1] , …, y[m] = x[i+m-1] where m = |y|. • For any substring y of x, we denote by fx(y) the number of occurrences of y in x.

Ifx = principalskinner •  is {a,c,e,i,k,l,n,p,r,s} • |x| is 16 • skinis a substring of x • prinis a prefix of x • neris a suffix of x • If y = in, then fx(y) = 2 • If y = pal, then fx(y) = 1 • principalskinner

We consider a string generated by a stationary Markov chain of order M ≥ 1 on the finite alphabet ∑. Let x = x[1]x[2] … x[n] be an observation of the random process and y = y[1]y[2] … y[m] an arbitrary but fixed pattern over ∑with m < n. • The stationary Markov chain is completely determined by its transition matrix ∏ = (π(y[1,M], c))y[1],…,y[M],c∑ where • π(y[1,M],c) = P(Xi+1 = c|X[i-M+1,i] = y[1,M]) • are called transition probabilities, with y[1],…,y[M],c∑ and M ≤ i ≤ n-1. The vector of the stationary probabilities  of a stationary Markov chain with transition matrix ∏ is defined as the solution of  = ∏. • We now introduce the random variable that describes the occurrences of the word y. We define Zi, 1 ≤ i ≤ n-m+1 to be 1 if y occurs in x starting at position i, 0 otherwise. We set • so that Zyis the random variable for the total number of occurrences fx(y).

We can now estimate the probability of seeing a pattern, even if we have never seen it before! John Smith Cen Lee John Jones Mike Lee John Smith Bill Chu Selina Chu Bill Smith David Gu Peggy Gu John Lee Susan Chu E(John Smith) = 2/12 E(Selina Chu) = 1/12 E(John Chu) = 0? E(John Chu) = (4/12) * (3/12) f(John) = 4/12 f(Chu) =3/12

suffix_tree Preprocess (stringr, stringx) let Tr = Suffix_tree(r) let Tx = Suffix_tree (x) let Annotate_f(w)(Tr) Annotate _f(w)(Tx) visitTx in breadth-first traversal, for each node udo let w = L(u), m=|w| ifw occurs in Trthen letÊ(w)=αfr(w) else find the largest 1 < l <m-1 such that using the suffix tree Tr if such l exists then let else let letz(w) = fx(w)- Ê(w) storez(w) in the node u return Tx

void Tarzan(time_series R, time_series X, intl1, inta, intl2, realc) letx = Discretize_time_series(X, l1, a) let r = Discretize_time_series(R, l1, a) letTx = Preprocess(r, x) fori=1, |x| - l2 + 1 let retrievez(w) from Tx if |z(w)| > cthen print i, z(w)

Experimental Evaluation Sensitive and Selective, just like me • We would like to demonstrate two features of our proposed approach • Sensitivity (High True Positive Rate) The algorithm can find truly surprising patterns in a time series. • Selectivity (Low False Positive Rate) The algorithm will not find spurious “surprising” patterns in a time series

We compare our work to two obvious rivals… • TSA-tree: A wavelet based approach by Shahabi et al. • The authors suggest an method to find both trends and “surprises” in large time series datasets. The authors achieve this using a wavelet-based tree structure that can represent the data at different scales. • They define “surprise” in time series as “…sudden changes in the original time series data, which are captured by local maximums of the absolute values of (wavelet detail coefficients)”. • IMM: Immunology inspired approach by Dasgupta et.al. • This work is inspired by the negative selection mechanism of the immune system, which discriminates between self and non-self. • In this case self is the model of the time series learned from the reference dataset, and non-self are any observed patterns in the new dataset that do not conform to the model within some tolerance.

Experiment 1 I created an anomaly by halving the period of the sine wave in the green section here Training data R Test data X We constructed a reference dataset R by creating a sine wave with 800 datapoints and adding some Gaussian noise. We then built a test dataset X using the same parameters as the reference set, however we also inserted an artificial anomaly by halving the period of the sine wave in the green section.

The training data is just a noisy sin wave The test data has an anomaly, a subsection where the period was halved The IMM algorithm failed to find the anomaly But Tarzan detects the anomaly! So did the TSA Wavelet tree!

Experiment 2 We consider a dataset that contains the power demand for a Dutch research facility for the entire year of 1997. The data is sampled over 15 minute averages, and thus contains 35,040 points. Demand for Power? Excellent! 2500 2000 1500 1000 500 0 200 400 600 800 1000 1200 1400 1600 1800 2000 The first 3 weeks of the power demand dataset. Note the repeating pattern of a strong peak for each of the five weekdays, followed by relatively quite weekends

Tarzan TSA Tree IMM We used from Monday January 6th to Sunday March 23rd as reference data. This time period is devoid of national holidays. We tested on the remainder of the year. We will just show the 3 most surprising subsequences found by each algorithm. For each of the 3 approaches we show the entire week (beginning Monday) in which the 3 largest values of surprise fell. Both TSA-tree and IMM returned sequences that appear to be normal workweeks, however Tarzan returned 3 sequences that correspond to the weeks that contain national holidays in the Netherlands. In particular, from top to bottom, the week spanning both December 25th and 26th and the weeks containing Wednesday April 30th (Koninginnedag, “Queen's Day”) and May 19th (Whit Monday). Mmm.. anomalous..

Experiment 3 The previous experiments demonstrate the ability of Tarzan to find surprising patterns, however we also need to consider Tarzans selectivity. If even a small fraction of patterns flagged by our approach are false alarms, then, as we attempt to scale to massive datasets, we can expect to be overwhelmed by innumerable spurious “surprising” patterns Shut up Flanders!! In designing an experiment to show selectivity we are faced with the problem of finding a database guaranteed to be free of surprising patterns. Because using a real data set for this task would always be open to subjective post-hoc explanations of results, we will conduct the experiment on random walk data Och aye! By definition, random walk data can contain any possible pattern. In fact, as the size of a random walk dataset goes to infinity, we should expect to see every pattern repeated an infinite number of times. We can exploit this property to test our wee algorithm

If we train Tarzan on another short random walk dataset, we should expect that the test data would be found surprising, since the chance that similar patterns exist in the short training database are very small. However as we increase the size of the training data, the surprisingness of the test data should decrease, since it is more likely that similar data was encountered. To restate, the intuition is this, the more experience our algorithm has seeing random walk data, the less surprising our particular section of random walk XRW should appear.

Future Work orTarzan’s Revenge! Although we see Tarzans ability to find surprising patterns without user intervention as a great advantage, we intend to investigate the possibility on incorporating user feedback and domain based constraints We have concentrated solely on the intricacies of finding the surprising patterns, without addressing the many Meta questions that arise. For example, the possible asymmetric costs of false alarms and false dismissals, and the actionability of discovered knowledge. We intend to address these issues in future work

Motif Discovery Winding Dataset ( The angular speed of reel 2 ) 0 50 0 1000 150 0 2000 2500 Informally, motifs are reoccurring patterns…

Motif Discovery To find these 3 motifs would require about 6,250,000 calls to the Euclidean distance function.

Why Find Motifs? · Mining association rules in time series requires the discovery of motifs. These are referred to as primitive shapes and frequent patterns. · Several time series classification algorithms work by constructing typical prototypes of each class. These prototypes may be considered motifs. · Many time series anomaly/interestingness detection algorithms essentially consist of modeling normal behavior with a set of typical shapes (which we see as motifs), and detecting future patterns that are dissimilar to all typical shapes. · In robotics, Oates et al., have introduced a method to allow an autonomous agent to generalize from a set of qualitatively different experiences gleaned from sensors. We see these “experiences” as motifs. · In medical data mining, Caraca-Valente and Lopez-Chavarrias have introduced a method for characterizing a physiotherapy patient’s recovery based of the discovery of similar patterns. Once again, we see these “similar patterns” as motifs.

T Trivial Matches Space Shuttle STS - 57 Telemetry C ( Inertial Sensor ) 0 100 200 3 00 400 500 600 70 0 800 900 100 0 Definition 1. Match: Given a positive real number R (called range) and a time series T containing a subsequence C beginning at position p and a subsequence M beginning at q, if D(C, M) R, then M is called a matching subsequence of C. Definition 2. Trivial Match: Given a time series T, containing a subsequence C beginning at position p and a matching subsequence M beginning at q, we say that M is a trivial match to C if either p = q or there does not exist a subsequence M’ beginning at q’ such that D(C, M’) > R, and either q < q’< p or p < q’< q. Definition 3. K-Motif(n,R): Given a time series T, a subsequence length n and a range R, the most significant motif in T (hereafter called the 1-Motif(n,R)) is the subsequence C1 that has highest count of non-trivial matches (ties are broken by choosing the motif whose matches have the lower variance). The Kth most significant motif in T (hereafter called the K-Motif(n,R) ) is the subsequence CK that has the highest count of non-trivial matches, and satisfies D(CK, Ci) > 2R, for all 1  i < K.

OK, we can define motifs, but how do we find them? The obvious brute force search algorithm is just too slow… Our algorithm is based on a “hot” idea from bioinformatics, random projection* * J Buhler and M Tompa. Finding motifs using random projections. In RECOMB'01. 2001.

A simple worked example of our motif discovery algorithm The next 4 slides T ( m= 1000 ) 0 500 1000 C 1 ^ a c b a C Assume that we have a time series T of length 1,000, and a motif of length 16, which occurs twice, at time T1 and time T58. 1 ^ S a c b a 1 b c a b 2 : : : : : a b c a = 3 { , , } n = 16 : : : : : w = 4 a c c a 58 : : : : : b c c c 985

A mask {1,2} was randomly chosen, so the values in columns {1,2} were used to project matrix into buckets. Collisions are recorded by incrementing the appropriate location in the collision matrix

Once again, collisions are recorded by incrementing the appropriate location in the collision matrix A mask {2,4} was randomly chosen, so the values in columns {2,4} were used to project matrix into buckets.

We can calculate the expected values in the matrix, assuming there are NO patterns… 1 2 2 1 3 : 2 27 1 58 Suppose E(k,a,w,d,t) = 2 3 1 2 2 : 3 1 0 1 2 98 5 1 2 58 98 5 : :

A Simple Experiment Lets imbed two motifs into a random walk time series, and see if we can recover them C A D B 0 20 40 60 80 100 120 0 20 40 60 80 100 120

Planted Motifs C A B D

“Real” Motifs 0 20 40 60 80 100 120 0 20 40 60 80 100 120

Some Examples of Real Motifs Astrophysics ( Photon Count) 250 350 450 550 650 0 0 0 0 0

10k 8k Brute Force Seconds 6k TS - P = 100 i 4k TS - P Exp 2k 0 1000 2000 3000 4000 5000 Length of Time Series How Fast can we find Motifs?

Future Work on Motif Detection • A more detailed theoretical analysis. • Significance testing. • Discovery of motifs in multidimensional time series • Discovery of motifs under different distance measures such as Dynamic Time Warping

2 2 1.5 1.5 1 1 0.5 0.5 0 0 -0.5 -0.5 15 -1 -1 -1.5 -1.5 -2 -2 0 5 10 15 20 25 30 0 2 4 6 8 10 12 14 16 18 20 Rule Finding in Time Series Support = 9.1 Confidence = 68.1 Das, G., Lin, K., Mannila, H., Renganathan, G. & Smyth, P. (1998). Rule Discovery from Time Series. In proceedings of the 4th Int'l Conference on Knowledge Discovery and Data Mining. New York, NY, Aug 27-31. pp 16-22.

Papers Based on Rule Discovery from Time Series • Mori, T. & Uehara, K. (2001). Extraction of Primitive Motion and Discovery of Association Rules from Human Motion. • Cotofrei, P. & Stoffel, K (2002). Classification Rules + Time = Temporal Rules. • Fu, T. C., Chung, F. L., Ng, V. & Luk, R. (2001). Pattern Discovery from Stock Time Series Using Self-Organizing Maps. • Harms, S. K., Deogun, J. & Tadesse, T. (2002). Discovering Sequential Association Rules with Constraints and Time Lags in Multiple Sequences. • Hetland, M. L. & Sætrom, P. (2002). Temporal Rules Discovery Using Genetic Programming and Specialized Hardware. • Jin, X., Lu, Y. & Shi, C. (2002). Distribution Discovery: Local Analysis of Temporal Rules. • Yairi, T., Kato, Y. & Hori, K. (2001). Fault Detection by Mining Association Rules in House-keeping Data. • Tino, P., Schittenkopf, C. & Dorffner, G. (2000). Temporal Pattern Recognition in Noisy Non-stationary Time Series Based on Quantization into Symbolic Streams. and many more…

All these people are fooling themselves!They are not finding rules in time series, and it is easy to prove this!

A Simple Experiment... “if stock rises then falls greatly, follow a smaller rise, then we can expect to see within 20 time units, a pattern of rapid decrease followed by a leveling out.” The punch line is…

If the top data miners in the world can fool themselves into finding patterns that don’t exist, what does that say about the field?

We have seen the Classic Time Series Data Mining Tasks…