disk aware discord discovery finding unusual time series in terabyte sized datasets n.
Download
Skip this Video
Loading SlideShow in 5 Seconds..
Disk Aware Discord Discovery: Finding Unusual Time Series in Terabyte Sized Datasets PowerPoint Presentation
Download Presentation
Disk Aware Discord Discovery: Finding Unusual Time Series in Terabyte Sized Datasets

Loading in 2 Seconds...

play fullscreen
1 / 33

Disk Aware Discord Discovery: Finding Unusual Time Series in Terabyte Sized Datasets - PowerPoint PPT Presentation


  • 188 Views
  • Uploaded on

Disk Aware Discord Discovery: Finding Unusual Time Series in Terabyte Sized Datasets. Dragomir Yankov, Eamonn Keogh, Computer Science & Eng. Dept. University of California, Riverside. Umaa Rebbapragada Dept. of Computer Science Tufts University. Best paper winner: ICDM 2007. Outline.

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Disk Aware Discord Discovery: Finding Unusual Time Series in Terabyte Sized Datasets' - rod


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
disk aware discord discovery finding unusual time series in terabyte sized datasets
Disk Aware Discord Discovery:

Finding Unusual Time Series in Terabyte Sized Datasets

Dragomir Yankov, Eamonn Keogh,

Computer Science & Eng. Dept.

University of California, Riverside

Umaa Rebbapragada

Dept. of Computer Science

Tufts University

Best paper winner: ICDM 2007

outline
Outline
  • What inspired the current work
  • The time series discord detection problem
  • An efficient algorithm for mining disk resident discords
    • Detecting range-based discords
    • Detecting the top k discords
  • Experimental results
    • Evaluating the effectiveness of the discord definition
    • Scalability of the discord detection algorithm
a motivating example
A motivating example
  • Myriads of telescopes around the world constantly record valuable astronomical data, e.g. star light-curves

Click Image to Play

  • A light-curve is a real-valued time series
  • of light magnitude measurements
  • derived from telescopic images

Eclipsed binary:

Sirius A&B

Movie: By kind permissions of

Prof. Richard W. Pogge, OSU

Image: Chandra X-ray observatory

a motivating example cont
A motivating example (cont)
  • The American Association of Variable Star Observers has a database of over 10.5 million variable star brightness measurements going back over ninety years
  • Over 400,000 new variable star brightness measurements are added to the database every year
  • Many of the observations are noisy or are preprocessed inaccurately prior to storing
  • Efficient, unsupervised methods for cleaning the data are required
a motivating example cont1
A motivating example (cont)
  • Data are inherently non-convex and hard to model probabilistically.
  • Anomalies should be
  • defined with respect to
  • the non-linear manifolds
  • defined by the light-
  • curve time series (true
  • for many time series
  • datasets)
definitions and assumptions
Definitions and assumptions
  • Notation
    • time series:
    • subseqence:
    • time series database:
  • Function (may not be a metric) defines an ordering for the elements in

Nasdaq Composite (Oct06-Oct07)

time series discords
Time series discords
  • Most-significant discord – the subsequence with maximal distance to its nearest neighbor
generalized discord definitions
Generalized discord definitions
  • Most-significant k-th NN discord – the subsequence with maximal distance to its k-th nearest neighbor
generalized discord definitions1
Generalized discord definitions
  • Most-significant k-NN discord – the subsequence with maximal distance to its k nearest neighbors in

The algorithm utilizes the first of these discord definitions for its computational efficiency and intuitive interpretation

disk aware discord detection
Disk aware discord detection
  • Detecting discords is harder than finding similar patterns
    • anytime algorithms can quickly detect similarities
    • anomalies require computation time
  • Indexing is not a solution
    • time series are high dimensional
    • dimensionality reduction is often inadequate
    • linear scan is faster than 10% random disk accesses

We are looking for an algorithm that performs two disk scans and “approximately linear” number of computations

discord detection algorithm
Discord detection algorithm
  • Phase 1 – candidates selection phase

- discord range

discord detection algorithm1
Discord detection algorithm
  • Phase 1 – candidates selection phase

- discord range

discord detection algorithm2
Discord detection algorithm
  • Phase 1 – candidates selection phase

- discord range

discord detection algorithm3
Discord detection algorithm
  • Phase 1 – candidates selection phase

- discord range

discord detection algorithm4
Discord detection algorithm
  • Phase 1 – candidates selection phase

- discord range

discord detection algorithm5
Discord detection algorithm
  • Phase 2 – candidates refinement phase

?

- discord range

discord detection algorithm6
Discord detection algorithm
  • Phase 2 – candidates refinement phase

- discord range

discord detection algorithm7
Discord detection algorithm
  • Phase 2 – candidates refinement phase

Upon completion sort the candidates list C

correctness of the algorithm
Correctness of the algorithm
  • The candidates set C contains all discords at distance at least r from their NN, plus some other elements
  • The refinement phase removes from C all false positives, and no real discord is pruned
  • Correctness: the range discord algorithm detects all discords and only the discords with respect to the specified range r
finding a good range parameter
Finding a good range parameter
  • Selecting large r may result in an empty discord set, while too small r can render the algorithm inefficient
  • Computing the nearest neighbor distance distribution (NNDD) is

expensive

  • NNDD depends

on the number

of examples in

the data

approximating nndd
Approximating NNDD
  • Intuition – though the relative volume in the upper tail decreases, the absolute number of discords cut by r remains sufficient when adding more data
  • Detecting the top k discords
    • Select a uniformly random sample
    • Compute the top k discords in
    • Order their NN distances as:
    • Set
    • Run the disk aware algorithm with range parameter
experimental evaluation
Experimental evaluation

We performed two sets of experiments

  • Experiments showing the utility of the time series discord definition
  • Experiments showing the scalability of the disk aware discord detection algorithm
experimental evaluation utility of the discord definition
Experimental evaluation - utility of the discord definition
  • Star light-curve data from the

Optical Gravitational Lensing

Experiment (OGLE)

  • Three classes of light-curves
  • Eclipsed binaries
  • Cepheids
  • RR Lyrae variables

typical examples

top two discords

in each class

experimental evaluation utility of the discord definition1
Experimental evaluation -utility of the discord definition
  • MSN web

queries made

in 2002

  • The most significant discord using rotation invariant Euclidean distance

patterns dominated by a weekly cycle

anticipated bursts

periodicity 29.5 days – the length of a synodic month

experimental evaluation utility of the discord definition2
Experimental evaluation -utility of the discord definition
  • Anomaly detection in video sequences (multivariate data)
  • Adapting the method

as a data cleaning

procedure

the top one discord shown with only one of the existing clusters

our method achieves 100% accuracy on the planted anomalous trajectories

experimental evaluation utility of the discord definition3
Experimental evaluation -utility of the discord definition
  • Population growth data – we studied the growth rate of 206 countries for the last 25 years, looking for the most dramatic 5 year event

the top 2 discords with a set of 10 representative countries for contrast

experimental evaluation scalability of the disk aware algorithm
Experimental evaluation –scalability of the disk aware algorithm
  • We generated 3 data

sets of size up to 0.35Tb

of random walk time series

  • Six non-random walk

time series were planted,

we looked for the top 10

discords

  • Time efficiency on the three random walk data sets:

two of the planted series (top) were among the top 10 discords

experimental evaluation scalability of the disk aware algorithm1
Experimental evaluation –scalability of the disk aware algorithm
  • Time efficiency (Heterogeneous data):
  • Main memory requirement for different thresholds
experimental evaluation scalability of the disk aware algorithm2
Experimental evaluation –scalability of the disk aware algorithm
  • Parallelizing the algorithm (m computers):

Candidate selection phase

Candidate refinement phase

experimental evaluation scalability of the disk aware algorithm3
Experimental evaluation –scalability of the disk aware algorithm
  • Parallelizing the algorithm (dataset: one million random walks ):

The runtime overhead for 8 computers is approximately 30%. This is due to the increased candidate set size |C| at the end of phase 1

conclusion
Conclusion
  • Discords provide for an effective definition of rare time series patterns.
  • The presented disk aware algorithm has all requirements of a good off-the-shelf data mining tool:
    • The results are interpretable
    • It is extremely efficient and largely scalable
    • Very easy to implement (“8 lines in Matlab”)
  • Allows for straight-forward parallel and online extensions
acknowledgements
Acknowledgements
  • We would like to thank to:
    • Dr. Pavlos Protopapas (Harvard University) – light-curve dataset
    • Dr. Michail Vlachos (IBM Watson) – MSN web query data
    • Dr. Longin Jan Latecki (Temple University) – Trajectory dataset1
    • Dr. Andrew Naftel (University of Manchester) - Trajectory dataset2

also

    • Dr. Jessica Lin (George Mason University) and
    • Dr. Ada Fu (Chinese University of Hong Kong) – for useful discussions
slide33
All datasets and the code can be downloaded from: http://www.cs.ucr.edu/~dyankov/projects/

THANK YOU!