overview of anomaly detection in time series data n.
Skip this Video
Download Presentation
Overview of Anomaly Detection in Time Series Data

Loading in 2 Seconds...

play fullscreen
1 / 40

Overview of Anomaly Detection in Time Series Data - PowerPoint PPT Presentation

  • Uploaded on

Overview of Anomaly Detection in Time Series Data. LÊ VĂN QUỐC ANH. Outline. Introduction Anomaly detection approaches Classification based Nearest Neighbor Based Predictive Window-Based Disk Aware Discord Discovery And others approaches Comments Conclusion References.

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
Download Presentation

PowerPoint Slideshow about 'Overview of Anomaly Detection in Time Series Data' - chadwick-rowe

Download Now An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
  • Introduction
  • Anomaly detection approaches
    • Classification based
    • Nearest Neighbor Based
    • Predictive
    • Window-Based
    • Disk Aware Discord Discovery
    • And others approaches
  • Comments
  • Conclusion
  • References
  • Time series data problems:
    • Similarity search
    • Classification
    • Clustering
    • Motif discovery
    • Anomaly/novelty detection
    • Visualization

* [Keogh]

  • Time series data problems:
    • Similarity search
    • Classification
    • Clustering
    • Motif discovery
    • Anomaly/novelty detection
    • Visualization

* [Keogh]

problem de nition
Problem Definition
  • Anomaly/novelty detection refers to the problem of finding patterns in data that do not conform to expected behavior
problem de nition cont
Problem Definition (cont.)
  • Finding discords in large scale time series

[V. Chandola]

  • Intrusion detection for cyber-security
  • Fraud detection for credit cards
  • Fault detection in safety critical systems
  • Industrial damage detection
  • Medical and public health anomaly detection
  • Stock market analysis
existing anomaly detection techniques
Existing anomaly detection techniques
  • Classification based
  • Nearest Neighbor Based
  • Predictive
  • Window-Based
  • Disk Aware Discord Discovery
  • And others techniques
classification based approaches
Classification based approaches
  • Learn a model from a set of labeled data instances and then, classify a test instance into one of the classes using the learnt model
  • Operate in two phases:
    • training phase: learning from trainning data
    • testing phase: test instance as normal or anomalous
  • Assumption: A classifier that can distinguish between normal and anomalous classes can be learnt in the given feature space.
classification based approaches cont1
Classification based approaches(cont.)
  • Some techniques:
    • Neural Networks based
    • Bayesian Networks based
    • Support Vector Machines based
    • Rule based
classification based approaches cont2
Classification based approaches(cont.)
  • Advantages:
    • can distinguish between instances belonging to different classes
    • testing phase is fast
  • Disadvantages:
    • have to assign a label to each test instance
    • rely on availability of accurate labels for various normal classes
nearest neighbor based
Nearest Neighbor Based
  • Assumption: Normal data instances occur in dense neighborhoods, while anomalies occur far from their closest neighbors.
  • require a distance defined between two data instances
nearest neighbor based cont1
Nearest Neighbor Based(cont.)
  • Advantages:
    • purely data driven
  • Disadvantages:
    • if the data has normal instances that do not have enough close neighbors or if the data has anomalies that have enough close neighbors, the technique fails to label them correctly
    • performance greatly relies on a distance measure
    • defining distance measures between instances can be challenging when the data is complex
predictive techniques
Predictive techniques
  • Forecast the next observation in the time series, using the statistical model and the time series observed so far, and compare the forecasted observation with the actual observation to determine if an anomaly has occurred.
  • Some techniques: Regression, Auto Regression ARMA, ARIMA, SVR (Support Vector Regression)
predictive techniques cont
Predictive techniques(cont.)
  • Advantages:
    • provide a statistically justifiable solution for anomaly detection if the assumptions regarding the underlying data distribution hold true
  • Disadvantages:
    • rely on the assumption that the data is generated from a particular distribution
window based
  • Extract fixed length (w) windows from a test time series, and assign an anomaly score to each window. The per-window scores are then aggregated to obtain the anomaly score for the test time series.
  • Some proposed techniques:
    • HOT SAX
    • AWDD
    • WAT
hot sax
  • [Eamonn Keogh,Jessica Lin, Ada Fu]
  • Finding the most unusual time series subsequence
    • discord
  • Improve BFDD algorithm (Brute Force Discord Discovery) with heristic ordering
  • Use SAX for discretization








awdd technique
AWDD technique
  • M. Chuah, F. Fu (2006)
  • AWDD - Adaptive Window Based Discord Discovery
  • Apply for ECG time series
awdd technique cont
AWDD technique(cont.)
  • Advantages:
    • use adaptive rather than fixed windows
  • Disadvantages:
    • deal only with ECG datasets
wat technique
WAT technique
  • Y. Bu et al (2006)
  • WAT - Wavelet and Augmented Trie
  • Employs Haar wavelet transform and symbol word mapping orderly on raw time series to build prefix tree for Inner and Outer loop heuristic
  • can view a subsequence in different resolutions
    • the first symbol of each word gives us the lowest resolution for each subsequence
wat technique cont
WAT technique(cont.)
  • Advantages:
    • require 2 parameter (1 intuitive parameter)
    • better performance than HOT SAX
  • Disadvantages:
    • assume the coefficients are in Gaussian distribution
    • assume that the data reside in main memory
dadd technique
DADD technique
  • DADD - Disk Aware Discord Discovery (2008)

[Yankov, Keogh and Rebbapragada]

  • Finding unusual time series in terabyte sized datasets on secondary memory
  • Algorithm has two phases:
    • Phase 1: a candidate selection phase
      • given a threshold r , finds a set of all discords at distance at least r from their nearest neighbor
    • Phase 2: a discord refinement phase
      • remove all false discords from the candidate set
dadd technique cont
DADD technique (cont.)
  • Advantages:
    • equires only two linear scans of the disk with a tiny buffer of main memory
    • very simple to implement
  • Disadvantages:
    • depend on threshold r
proposed approach
Proposed approach
  • Using Vector Quantization for discretization
  • Improve BFDD algorithm with ordering heuristic
using histogram model

Codebook s=16


Series Transformation








c mdbca i fajbb

m i njjama I njm

h ldfkophcako

o gcblpoccblh

l hnkkkplcacg

k kgjhhgkgjlp




Using histogram model
similarity measure



Similarity measure


1 2...s

using multiple resolutions
Using multiple resolutions
  • Codebook (6,60)
  • Codebook (16,30)
for each resolution
For each resolution
  • Start with lowest resolution and a group of all subsequences
  • For each resolution
    • groups which have more than one subsequences are splitted based on a threshold r
  • Stop when have groups with one subsequences or reach the highest resolution
improve bfdd
Improve BFDD
  • Outer Loop Heuristic:
    • groups which have smallest subsequences count are considered first
  • Inner Loop Heuristic:
    • when ith subsequence is considered in the outer loop, all subsequences in the same group are considered first in the Inner Loop
  • [1] E. Keogh, J. Lin, W. Fu. HOT SAX: Efficiently Finding the Most Unusual Time Series Subsequence. In Proc. of the 5th IEEE International Conference on Data Mining (ICDM 2005), November 27-30, 2005, pp. 226-233.
  • [2] D. Yankov, E. Keogh, U. Rebbapragada, Disk Aware Discord Discovery: Finding Unusual Time Series in Terabyte Sized Datasets, 2008
  • [3] E. Keogh.Mining Shape and Time Series Databases with Symbolic Representations. Tutorial of the 13rd ACM Interantional Conference on Knowledge Discovery and Data Mining (KDD 2007), August 12-15, 2007.
  • [4] J. Lin, E. Keogh, A. Fu, and H. Van Herle, Approximations to Magic: Finding Unusual Medical Time Series, the 18th IEEE International Symposium on Computer-Based Medical Systems, pp. 329-334, 2005.
  • [5] M. Chuah and F. Fu, ECG anomaly detection via time series analysis, Technical Report LU-CSE-07-001, 2007.
references cont
References (cont.)
  • [6] V. Megalooikonomou, Q. Wang, G. Li, C. Faloutsos. A Multiresolution Symbolic Representation of Time Series. In Proc. of the 21st International Conference on Data Engineering (ICDE 2005), April 5-8, 2005, pp. 668-679, 2005.
  • [7] V. Chandola, D. Cheboli, and V. Kumar, Detecting Anomalies in a Time Series Database,Technical Report TR 09-004, 2009.
  • [8] Y. Bu, T-W Leung, A. Fu, E. Keogh, J. Pei, and S. Meshkin, WAT: Finding Top-K Discords in Time Series Database, in Proc. of the 2007 SIAM International Conference on Data Mining (SDM'07), Minneapolis, MN, USA, April 26-28, 2007.
  • [9] Q. Wang, V. Megalooikonomou, A dimensionality reduction technique for efficient time series similarity analysis, Information Systems 33, 115–132, 2008.
  • [10] H. B. Kekre Tanuja K. Sarode, Fast Codebook Search Algorithm for Vector Quantization using Sorting Technique , International Conference on Advances in Computing, Communication and Control (ICAC3’09), 2009.