sax a novel symbolic representation of time series l.
Download
Skip this Video
Loading SlideShow in 5 Seconds..
SAX: a Novel Symbolic Representation of Time Series PowerPoint Presentation
Download Presentation
SAX: a Novel Symbolic Representation of Time Series

Loading in 2 Seconds...

play fullscreen
1 / 24

SAX: a Novel Symbolic Representation of Time Series - PowerPoint PPT Presentation


  • 215 Views
  • Uploaded on

SAX: a Novel Symbolic Representation of Time Series. Authors Jessica Lin Eamonn Keogh Li Wei Stefano Lonardi. Presenter Arif Bin Hossain. Slides incorporate materials kindly provided by Prof. Eamonn Keogh. Time Series.

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'SAX: a Novel Symbolic Representation of Time Series' - paco


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
sax a novel symbolic representation of time series

SAX: a Novel Symbolic Representation of Time Series

Authors

Jessica Lin

Eamonn Keogh

Li Wei

Stefano Lonardi

Presenter

Arif Bin Hossain

Slides incorporate materials kindly provided by Prof. Eamonn Keogh

time series
Time Series
  •  A time series is a sequence of data points, measured typically at successive times spaced at uniform time intervals. [Wiki]
  • Example:
    • Economic, Sales, Stock market forecasting
    • EEG, ECG, BCI analysis

30

20

10

0

2000

4000

6000

8000

0

problems
Problems

Join: Given two data collections, link items occurring in each

Annotation: obtain additional information from given data

Query by content: Given a large data collection, find the k most similar objects to an object of interest.

Clustering: Given a unlabeled dataset, arrange them into groups by their mutual similarity

problems cont
Problems (Cont.)

Classification: Given a labeled training set, classify future unlabeled examples

Anomaly Detection: Given a large collection of objects, find the one that is most different to all the rest.

Motif Finding: Given a large collection of objects, find the pair that is most similar.

data mining constraints
Data Mining Constraints

For example, suppose you have one gig of main memory and want to do K-means clustering…

Clustering ¼ gig of data, 100 sec

Clustering ½ gig of data, 200 sec

Clustering 1 gig of data, 400 sec

Clustering 1.1 gigs of data, few hours

Bradley, M. Fayyad, & Reina: Scaling Clustering Algorithms to Large Databases. KDD 1998: 9-15

generic data mining
Generic Data Mining
  • Create an approximation of the data, which will fit in main memory, yet retains the essential features of interest
  • Approximately solve the problem at hand in main memory
  • Make (hopefully very few) accesses to the original data on disk to confirm the solution
why symbolic representation
Why Symbolic Representation?
  • Reduce dimension
  • Numerosity reduction
  • Hashing
  • Suffix Trees
  • Markov Models
  • Stealing ideas from text processing/ bioinformatics community
s ymbolic a ggregate appro x imation sax
Symbolic Aggregate ApproXimation(SAX)
  • Lower bounding of Euclidean distance
  • Lower bounding of the DTW distance
  • Dimensionality Reduction
  • Numerosity Reduction

baabccbc

slide10
SAX

Allows a time series of arbitrary length n to be reduced to a string of arbitrary length w (w<<n)

Notations

how to obtain sax
How to obtain SAX?
  • Step 1: Reduce dimension by PAA
    • Time series C of length n can be represented in a w-dimensional space by a vector Ć = ć1,…ćw
    • The ith element is calculated by
    • Reduce dimension from 20 to 5. The 2nd element will be
how to obtain sax12
How to obtain SAX?

Data is divided into w equal sized frames.

Mean value of the data falling within a frame is calculated

Vector of these values becomes the PAA

C

C

0

20

40

60

80

100

120

how to obtain sax13

c

c

c

b

b

b

a

a

-

-

0

0

40

60

80

100

120

20

How to obtain SAX?
  • Step 2: Discretization
    • Normalize Ć to have a Gaussian distribution
    • Determine breakpoints that will produce a equal-sized areas under Gaussian curve.

Words: 8

Alphabet: 3

baabccbc

distance measure
Distance Measure
  • Given 2 time series Q and C
    • Euclidean distance
    • Distance after transforming the subsequence to PAA
distance measure15
Distance Measure
  • Define MINDIST after transforming to symbolic representation
  • MINDIST lower bounds the true distance between the original time series
numerosity reduction
Numerosity Reduction
  • Subsequences are extracted by a sliding window
  • Sequences are mostly repetitive subsequence
    • Sliding window finds aabbcc
    • If the next sequence is also aabbcc, just store the position
  • This optimization depends on the data, but typically yields a reduction factor of 2 or 3
    • Space shuttle telemetry with subsequence length 32
experimental validation
Experimental Validation
  • Clustering
    • Hierarchical
    • Partitional
  • Classification
    • Nearest neighbor
    • Decision tree
  • Motif discovery
hierarchical clustering
Hierarchical Clustering

Sample dataset consists 3 decreasing trend, 3 upward shift and 3 normal classes

partitional clustering k means
Partitional Clustering (k-means)

Assign each point to one of k clusters whose center is nearest

Each iteration tries to minimize the sum of squared intra-clustered error

nearest neighbor classification
Nearest Neighbor Classification

SAX beats Euclidean distance due to the smoothing effect of dimensional reduction

decision tree classification
Decision Tree Classification

Since decision trees are expensive to use with high dimensional dataset, Regression Tree [Geurts.2001] is a better approach for data mining on time series

motif discovery
Motif Discovery
  • Implemented the random projection algorithm of Tompa and Buhler [ICMB2001]
    • Hashing subsequenced into buckets using a random subset of their features as a key
new version isax
New Version: iSAX

Use binary numbers for labeling the words

Different alphabet size(cardinality)within a word

Comparison of words with different cardinalities

thank you
Thank you

Questions?