DISCOVERING MOTIFS IN TIME SERIES. Duong Tuan Anh Faculty of Computer Science and Technology Ho Chi Minh City University of Technology. Tutorial MIWAI December 2012. OUTLINE. Introduction Definitions of time series motifs Applications of time series motifs
Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.
Duong Tuan Anh
Faculty of Computer Science and Technology
Ho Chi Minh City University of Technology
Tutorial MIWAI December 2012
A time series is a collection of observations made sequentially in time
29
28
27
26
25
24
23
0
50
100
150
200
250
300
350
400
450
500
25.1750
25.2250
25.2500
25.2500
25.2750
25.3250
25.3500
25.3500
25.4000
25.4000
25.3250
25.2250
25.2000
25.1750
..
..
24.6250
24.6750
24.6750
24.6250
24.6250
24.6250
24.6750
24.7500
Examples: Financial time series, scientific time series
Q. Yang & X. Wu, “10 Challenging Problems in Data Mining Research”, Int. Journal on Information Technology and Decision Making, Vol. 5, No. 4 (2006), 597604
3.Mining sequence data and time series data
Time series data mining is a field of data mining to deal with the challenges from the characteristics of time series data.
Time series data have the following characteristics:
Very large datasets (terabytesized)
Subjectivity (The definition of similarity depends on the user)
Different sampling rates
Noise, missing data, etc.
Classification
Clustering
Query by Content
Rule Discovery
Motif Discovery
10
s = 0.5
c = 0.3
Visualization
Novelty Detection
Problem Description
Unsupervised detection andmodeling of previously unknownrecurring patterns in realvalued
time series
Discovery due to unknowns
J. Lin, E. Keogh, Patel, P. and Lonardi, S., Finding Motifs in Time Series, The 2nd Workshop on Temporal Data Mining, at the 8th ACM SIGKDD Int. Conf. on Knowledge Discovery and Data Mining, Edmonton, Alberta, Canada, 2002.
Given a time series T, a subsequence of length n and a range R, the most significant motif in T (called 1Motif) is the subsequence C1 that has the highest count of nontrivial matches.
The Kth most significant motif in T (called thereafter KMotif) is the subsequence CK that has the highest count of nontrivial matches, and satisfies D(CK, Ci) > 2R, for all 1 ≤ i < K .
If the motifs are only required to be R distance apart as in A, then the two motifs may share the majority of their elements. In contrast, B illustrates that requiring the centers to be at least 2R apart insures that the motifs are unique.
best_motif_count_so_far = 0
best_motif_location_so_far = null;
fori = 1 to length(T) – n + 1
count = 0; pointers = null;
forj = 1 to length(T) – n + 1
if Non_Trivial_Match (C[i: i + n – 1], C[j: j + n – 1], R) then
count = count + 1;
pointers = append (pointers, j);
end
end
if count > best_motif_count_so_far then
best_motif_count_so_far = count;
best_motif_location_so_far = i;
motif_matches = pointers;
end
end
The algorithm requires O(m2) calls to the distance function.
This procedure calls
distance function
Motifs can be used for time series classification. This can be done in two steps:
Buza, K. and Thieme, L. S.: Motifbased Classification of Time Series with Bayesian Networks and SVMs. In: A. Fink et al. (eds.) Advances in Data Analysis, Data Handling and Business Intelligences, Studies in Classification, Data Analysis, Knowledge Organization. SpringerVerlag, pp. 105114 (2010).
Motif information are used to initialization kmeans clustering of time series:
Phu, L. and Anh, D. T., Motifbased Method for Initialization kMeans Clustering of Time Series Data, Proc. of 24th Australasian Joint Conference (AI 2011), Perth, Australia, Dec. 58. Dianhui Wang, Mark Reynolds (Eds.), LNAI 7106, SpringerVerlag, 2011, pp. 1120.
Jiang, Y., Li, C., Han, J.: Stock temporal prediction based on time series motifs. In: Proc. of 8th Int. Conf. on Machine Learning and Cybernetics, Baoding, China, July 1215 (2009).
The process consists of 4 steps:
Gruber C., Coduro, M., Sick, B.: Signature Verification with Dynamic RBF Networks and Time Series Motifs. In : Proc of 10th Int. Workshop on Frontiers in Handwriting Recognition (2006).
Xi, X., Keogh, E., Wei, L., MafraNeto, A., Finding Motifs in a Database of Shapes, Proc. of SIAM 2007, pp. 249270.
Random Projection 2007)
MueenKeogh Algorithm
where
B = 1,…,a1 are called breakpoints (0 and a are defined as  and +).
Using the breakpoints, the time series
will be discretized into the symbolic string C = c1c2….cw. Each segment will be coded as a symbol ciusing the formula:
where k indicates the kth symbol in the alphabet, 1 the 1st symbol in the alphabet and a the ath symbol in the alphabet.
Table 1: A lookup table that contains the breakpoints that divide a Gaussian distribution in an arbitrary number (from 3 to 10) of equiprobable regions.
1 divide a Gaussian distribution in an arbitrary number (from 3 to 10) of equiprobable regions.
2
3
4
5
6
7
C
C
0
20
40
60
80
100
120
c
c
c
b
b
b
a
a


0
0
40
60
80
100
120
20
Note we made two parameter choicesThe word size (w), in this case 8.
8
3
1
2
1
The alphabet size (cardinality, a), in this case 3.
D(Q, Ca) D(Q, Cb) + D(Ca, Cb).
Table 1: Experiments on the number of distance function calls (Stock dataset)
K. B. Pratt and E. Fink, “Search for patterns in compressed time series”, International Journal of Image and Graphics, vol. 2, no. 1, pp. 89106, 2002.
Function getMotifCandidateSequence(T)
N = length(T);
EP = findSignificantExtremePoints(T, R);
maxLength = MAX_MOTIF_LENGTH;
for i = 1 to (length(EP)2) do
motifCandidate = getSubsequence(T, epi, epi+2)
if length(motifCandidate) > maxLength
then
addMotifCandidate(resample(motifCandidate, maxLength))
else
addMotifCandidate(motifCandidate)
end if
end for
end
Spline Interpolation or homothety
Homothetyis a transformation in affine space. Given a point O and a valuek ≠ 0. A homothety with center O and ratio k transforms M to M’ such that .
The Figure shows a homothety with center O and ratio k = ½ which transforms the triangle MNP to the triangle M’N’P’.
The algorithm that performs homothety to transform a motif candidate T with length N (T = {Y1,…,YN}) to motif candidate of length N’ is given as follows.
a) Uniform scaling divide a Gaussian distribution in an arbitrary number (from 3 to 10) of equiprobable regions.
b) Shifting along the vertical axis
where b (2)
From Eq. (2), we can derive a suitable value for the shifting parameter b such that we can find the best match between the two motif candidates Q’ and T’ as follows:
function getHierarchicalClustering(MCS, u, d)
C = getIntitialClustering(MCS)
while size(C)>u do
[Ci, Cj] = getMostSimilarClusters()
addCluster(C, mergeClusters(Ci, Cj )
removeCluster(C, Ci);
removeCluster(C, Cj );
endwhile
return C
end
From the experimental results, we can see that: divide a Gaussian distribution in an arbitrary number (from 3 to 10) of equiprobable regions.
Left divide a Gaussian distribution in an arbitrary number (from 3 to 10) of equiprobable regions.) ECG dataset. Right) A motif was discovered in the dataset.
Left) Power dataset. Right) A motif was discovered in the dataset.
Dataset ECG Memory Power ECG
7900 6800 35000 140000

Efficiency 0.0000638 0.0000920 0.0001304 0.0000635
This motif discovery method (EP_C) is very efficient, especially for large time series. It is much more efficient than Random Projection.
For example, experiment on Koski_ECG dataset from UCR Archive: http://www.cs.ucr.edu/~eamonn/SAX/koski_ecg.dat
This time series has 144002 points, run time for detection motif: 3secs
6. Jiang, Y., Li, C., Han, J.: divide a Gaussian distribution in an arbitrary number (from 3 to 10) of equiprobable regions.Stock temporal prediction based on time series motifs. In: Proc. of 8th Int. Conf. on Machine Learning and Cybernetics, Baoding, China, July 1215 (2009).
7. Li, Q., Lopez, I.F.V. and Moon, B.: Skyline Index for time series data, IEEE Trans. on Knowledge and Data Engineering, Vol. 16, No. 4 (2004)
8. Phu, L. and Anh, D. T., Motifbased Method for Initialization kMeans Clustering of Time Series Data, Proc. of 24th Australasian Joint Conference (AI 2011), Perth, Australia, Dec. 58. Dianhui Wang, Mark Reynolds (Eds.), LNAI 7106, SpringerVerlag, 2011, pp. 1120.
9. Son, N.T., Anh, D.T., Discovering Approximate Time Series Motif based on MP_C Method with the Support of Skyline Index, Proc. of 4th Int. Conf. on Knowledge and Systems Engineering (KSE’2012), Aug. 1719, Da Nang, Vietnam (to appear).
10. Nguyen Thanh Son, Duong Tuan Anh, Time Series Similarity Search based on Middle Points and Clipping,Proc. of 3rd Conference on Data Mining and Optimization (DMO), 2729 June, 2011, Putrajaya, Malaysia, pp.1319.
11. Xi, X., Keogh, E., Wei, L., MafraNeto, A., Finding Motifs in a Database of Shapes, Proc. of SIAM 2007, pp. 249270.