SAX: a Novel Symbolic Representation of Time Series. Authors Jessica Lin Eamonn Keogh Li Wei Stefano Lonardi. Presenter Arif Bin Hossain. Slides incorporate materials kindly provided by Prof. Eamonn Keogh. Time Series.
Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.
Arif Bin Hossain
Slides incorporate materials kindly provided by Prof. Eamonn Keogh
Join: Given two data collections, link items occurring in each
Annotation: obtain additional information from given data
Query by content: Given a large data collection, find the k most similar objects to an object of interest.
Clustering: Given a unlabeled dataset, arrange them into groups by their mutual similarity
Classification: Given a labeled training set, classify future unlabeled examples
Anomaly Detection: Given a large collection of objects, find the one that is most different to all the rest.
Motif Finding: Given a large collection of objects, find the pair that is most similar.
For example, suppose you have one gig of main memory and want to do K-means clustering…
Clustering ¼ gig of data, 100 sec
Clustering ½ gig of data, 200 sec
Clustering 1 gig of data, 400 sec
Clustering 1.1 gigs of data, few hours
Bradley, M. Fayyad, & Reina: Scaling Clustering Algorithms to Large Databases. KDD 1998: 9-15
Allows a time series of arbitrary length n to be reduced to a string of arbitrary length w (w<<n)
Data is divided into w equal sized frames.
Mean value of the data falling within a frame is calculated
Vector of these values becomes the PAA
Sample dataset consists 3 decreasing trend, 3 upward shift and 3 normal classes
Assign each point to one of k clusters whose center is nearest
Each iteration tries to minimize the sum of squared intra-clustered error
SAX beats Euclidean distance due to the smoothing effect of dimensional reduction
Since decision trees are expensive to use with high dimensional dataset, Regression Tree [Geurts.2001] is a better approach for data mining on time series
Use binary numbers for labeling the words
Different alphabet size(cardinality)within a word
Comparison of words with different cardinalities