Fast Subsequence Matching in Time-Series Databases.
This presentation is the property of its rightful owner.
Sponsored Links
1 / 28

Fast Subsequence Matching in Time-Series Databases. C. Faloustos, M. Ranganathan, Y. Manolopoulos PowerPoint PPT Presentation


  • 54 Views
  • Uploaded on
  • Presentation posted in: General

Fast Subsequence Matching in Time-Series Databases. C. Faloustos, M. Ranganathan, Y. Manolopoulos Presented by George Liu / Luis L. Perez. Time series?. Definition Applications Financial markets Weather forecasting Healthcare. What kind of problem are we trying to solve?.

Download Presentation

Fast Subsequence Matching in Time-Series Databases. C. Faloustos, M. Ranganathan, Y. Manolopoulos

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Fast subsequence matching in time series databases c faloustos m ranganathan y manolopoulos

Fast Subsequence Matching in Time-Series Databases.

C. Faloustos, M. Ranganathan,

Y. Manolopoulos

Presented by

George Liu / Luis L. Perez


Time series

Time series?

  • Definition

  • Applications

    • Financial markets

    • Weather forecasting

    • Healthcare


What kind of problem are we trying to solve

What kind of problem are we trying to solve?

  • Whole sequence matching

    • Given a database S with n sequences, all of them equally long, and a query sequence Q of the same length.

    • Find all sequences in S that match with Q.

  • Subsequence matching

    • Given a database S with n sequences, with potentially different lengths, and a query sequence Q.

    • Find all sequences in S that contain Q.


Useful notation

Useful notation

  • Given a sequence S

    • Len(S) denotes the length of the sequence

    • S[i] denotes the ith element

    • S[i:j] denotes the subsequence between S[i] and S[j]

  • Given two sequences, S and Q

    • D(S,Q) denotes the distance between S and Q.

      • Euclidean

  • Distance bound: e

    • Max. distance for two sequences to be considered “equal”


Na ve approaches

Naïve approaches

  • Sequential scanning

    • Clearly unfeasible

  • R-tree

    • Might work, but dimensionality is extremely high (proportional to sequence length)‏

    • Poor performance

  • What can we do to improve performance?


Dimensionality reduction

Dimensionality reduction

  • Redundant data, lots of patterns

  • Feature extraction

  • Data transformation

    • Cosine

    • Wavelet

    • Fourier <-- we'll focus on this.


Discrete fourier transformation

Discrete Fourier Transformation

  • Map a sequence x in time-domain to a sequence X in frequency-domain

  • Reversible!

  • Fast and easy-to-implement algorithms

  • Energy preservation property

    • Key concept in dimensionality reduction.

    • Just keep the first 2 or 3 coefficients.


Parseval s theorem

Parseval's theorem

  • Let S and Q be the original sequences.

  • S' and Q' after applying DFT.D(S,Q) = D(S',Q')

  • Why is this important?

  • Distance underestimation, remember the bound e.

    • D(S,Q) < e ---> D(S', Q') < e

    • We will get no false dismissals.


Subsequence matching

Subsequence Matching

  • The problem:

    • You are given a collection of N sequences of real numbers. (S1, S2, .., Sn). Potentially different length.

    • User specifies query subsequence of length Q and the tolerance e, the max. acceptable dis-similarity.

    • You want all to return all the sequences along with the correct offsets k that matches the query and acceptable e.

  • Solutions:

    • many!


Possible solutions

Possible Solutions

  • 1) Brute Force method - Sequential scan every possible subsequence of the data sequences for a match.

  • 2) I-Naive - Transform all subsequences to points in feature space and store those points into an R-tree.

  • 3) ST-Index - Transform all subsequences to points in feature space. Store MBRs of sub-trails into an R*-tree.

  • Note: I-Naive and ST-Index are similar in the initial steps.


Possible solutions i naive

Possible Solutions I-naive

  • *Assume that the min. query length is w. w changes according to the application. (ie, stock markets have a larger w that are interested in weekly/monthly patterns)‏

  • Procedure:

    • 1) Use the "sliding window" to find every subsequence in a sequence.

    • 2) DFT those subsequences of size w to a point in featured space.

    • 3) A trail is produced of Len(S)-w+1 points.


Possible solutions i naive1

Possible Solutions I-naive

  • Procedure cont:

    • 4) Store all the points of the trails in feature space in a spatial access method. (R*-tree)‏

    • 5) When presented with a query of length w and tolerance e, extract the features of the query and perform the spatial access range query with radius e.

    • 6) Discard false alarms by retrieving all those subsequences and calculating their actual distance from the query.

  • Note: Very, very slow approach. Worst that Sequential Scan. You have a large R*-tree (tall and slow).


Possible solutions st index

Possible Solutions ST-Index

  • *Assume that the min. query length is w. w changes according to the application. (ie, stock markets have a larger w that are interested in weekly/monthly patterns)‏

  • Procedure:

    • 1) Use the "sliding window" to find every subsequence in a sequence.

    • 2) DFT those subsequences of size w to a point in featured space.

    • 3) A trail is produced of Len(S)-w+1 points.


Possible solutions st index1

Possible Solutions ST-Index

  • Procedure cont.

    • 4) Divide the trail of points in feature space into sub-trails. (algorithm mentioned later)‏

    • 5) Represent each of them in a MBR.

    • 6) Store the MBR into a spatial access method. (ie. R*-Tree)‏


Mbrs in f dimension

MBRs in F-Dimension


Mbrs in f dimension1

MBRs in F-Dimension


Mbrs in f dimension2

MBRs in F-Dimension


Mbrs in f dimension3

MBRs in F-Dimension


Mbrs in f dimension4

MBRs in F-Dimension


Insertions

Insertions

  • Problem: How do we divide these trails into sub-trails?

    • Two heuristics:

      • 1) Every sub-trail has a predetermined, fixed number. (I-fixed)‏

      • 2) Every sub-trail has a predetermined, fixed length. (I-fixed)‏

  • Solution: Use an "adaptive heuristic." (I-adaptive)‏


I adaptive algorithm

I-adaptive Algorithm

  • - Based on the idea of the marginal cost of a point in terms of disk accesses.

    Marginal cost (mc) = Disk Accesses of a given MBR / k points in a given MBR

  • Algorithm

    Assign the first point of the trail in a sub-trail.

    FOR each successive point

    IF it increase the marginal cost of the current sub-trail

    THEN start another sub-trail

    ELSE include it in the current sub-trail


I adaptive algorithm1

I-adaptive Algorithm


Searching

Searching

  • Consider the sub-trail length w and distance bound e.

  • Let Q be the query sequence

  • If Len(Q) = w, it's all good.

    • Algorithm Search_Short:

      • Use DFT to map Q to a point q in feature space. Make it a sphere with radius e.

      • Retrieve all the sub-trails whose MBRs intersect the query region using our index.

      • Throw away false alarms.


Searching1

Searching

  • Now, what if Len(Q) > w?

  • Requires more analysis, but basically we have that Len(Q) = p*w

  • So we can split Q in several subsequences of length p.

  • What about the radius? r = e/sqrt(p)‏


Searching2

Searching

  • So we have...

    • Algorithm Search_Long:

      • Break sequence Q in p sub-queries with radius e/sqrt(p)‏

      • Retrieve from the index all the sub-trails whose MBRs insersect at least one of the other sub-query regions.

      • Examine the sub-sequences, discard false alarms.


Experimental results

Experimental results


Experimental results1

Experimental results

  • Stock price database with ~300,000 points

  • 1 number = 4 bytes

  • DFT keeping first 3 coefficients (actually 6)

  • w = 512 bytes

  • R*-tree


Experimental results2

Experimental results

  • Space

    • Naïve methods: 24mb

    • This method: 5kb

  • Time - “short” queries (Len(Q) = w)‏

    • 3 to 100 times better response times

  • Time - “long” queries (Len(Q) > w)‏

    • 10 to 100 times better response times


  • Login