1 / 28

# Fast Subsequence Matching in Time-Series Databases. C. Faloustos, M. Ranganathan, Y. Manolopoulos - PowerPoint PPT Presentation

Fast Subsequence Matching in Time-Series Databases. C. Faloustos, M. Ranganathan, Y. Manolopoulos Presented by George Liu / Luis L. Perez. Time series?. Definition Applications Financial markets Weather forecasting Healthcare. What kind of problem are we trying to solve?.

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.

## PowerPoint Slideshow about 'Fast Subsequence Matching in Time-Series Databases. C. Faloustos, M. Ranganathan, Y. Manolopoulos' - norm

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

C. Faloustos, M. Ranganathan,

Y. Manolopoulos

Presented by

George Liu / Luis L. Perez

• Definition

• Applications

• Financial markets

• Weather forecasting

• Healthcare

• Whole sequence matching

• Given a database S with n sequences, all of them equally long, and a query sequence Q of the same length.

• Find all sequences in S that match with Q.

• Subsequence matching

• Given a database S with n sequences, with potentially different lengths, and a query sequence Q.

• Find all sequences in S that contain Q.

• Given a sequence S

• Len(S) denotes the length of the sequence

• S[i] denotes the ith element

• S[i:j] denotes the subsequence between S[i] and S[j]

• Given two sequences, S and Q

• D(S,Q) denotes the distance between S and Q.

• Euclidean

• Distance bound: e

• Max. distance for two sequences to be considered “equal”

• Sequential scanning

• Clearly unfeasible

• R-tree

• Might work, but dimensionality is extremely high (proportional to sequence length)‏

• Poor performance

• What can we do to improve performance?

• Redundant data, lots of patterns

• Feature extraction

• Data transformation

• Cosine

• Wavelet

• Fourier <-- we'll focus on this.

• Map a sequence x in time-domain to a sequence X in frequency-domain

• Reversible!

• Fast and easy-to-implement algorithms

• Energy preservation property

• Key concept in dimensionality reduction.

• Just keep the first 2 or 3 coefficients.

• Let S and Q be the original sequences.

• S' and Q' after applying DFT. D(S,Q) = D(S',Q')

• Why is this important?

• Distance underestimation, remember the bound e.

• D(S,Q) < e ---> D(S', Q') < e

• We will get no false dismissals.

• The problem:

• You are given a collection of N sequences of real numbers. (S1, S2, .., Sn). Potentially different length.

• User specifies query subsequence of length Q and the tolerance e, the max. acceptable dis-similarity.

• You want all to return all the sequences along with the correct offsets k that matches the query and acceptable e.

• Solutions:

• many!

• 1) Brute Force method - Sequential scan every possible subsequence of the data sequences for a match.

• 2) I-Naive - Transform all subsequences to points in feature space and store those points into an R-tree.

• 3) ST-Index - Transform all subsequences to points in feature space. Store MBRs of sub-trails into an R*-tree.

• Note: I-Naive and ST-Index are similar in the initial steps.

Possible Solutions I-naive

• *Assume that the min. query length is w. w changes according to the application. (ie, stock markets have a larger w that are interested in weekly/monthly patterns)‏

• Procedure:

• 1) Use the "sliding window" to find every subsequence in a sequence.

• 2) DFT those subsequences of size w to a point in featured space.

• 3) A trail is produced of Len(S)-w+1 points.

Possible Solutions I-naive

• Procedure cont:

• 4) Store all the points of the trails in feature space in a spatial access method. (R*-tree)‏

• 5) When presented with a query of length w and tolerance e, extract the features of the query and perform the spatial access range query with radius e.

• 6) Discard false alarms by retrieving all those subsequences and calculating their actual distance from the query.

• Note: Very, very slow approach. Worst that Sequential Scan. You have a large R*-tree (tall and slow).

Possible Solutions ST-Index

• *Assume that the min. query length is w. w changes according to the application. (ie, stock markets have a larger w that are interested in weekly/monthly patterns)‏

• Procedure:

• 1) Use the "sliding window" to find every subsequence in a sequence.

• 2) DFT those subsequences of size w to a point in featured space.

• 3) A trail is produced of Len(S)-w+1 points.

Possible Solutions ST-Index

• Procedure cont.

• 4) Divide the trail of points in feature space into sub-trails. (algorithm mentioned later)‏

• 5) Represent each of them in a MBR.

• 6) Store the MBR into a spatial access method. (ie. R*-Tree)‏

• Problem: How do we divide these trails into sub-trails?

• Two heuristics:

• 1) Every sub-trail has a predetermined, fixed number. (I-fixed)‏

• 2) Every sub-trail has a predetermined, fixed length. (I-fixed)‏

• - Based on the idea of the marginal cost of a point in terms of disk accesses.

Marginal cost (mc) = Disk Accesses of a given MBR / k points in a given MBR

• Algorithm

Assign the first point of the trail in a sub-trail.

FOR each successive point

IF it increase the marginal cost of the current sub-trail

THEN start another sub-trail

ELSE include it in the current sub-trail

• Consider the sub-trail length w and distance bound e.

• Let Q be the query sequence

• If Len(Q) = w, it's all good.

• Algorithm Search_Short:

• Use DFT to map Q to a point q in feature space. Make it a sphere with radius e.

• Retrieve all the sub-trails whose MBRs intersect the query region using our index.

• Throw away false alarms.

• Now, what if Len(Q) > w?

• Requires more analysis, but basically we have that Len(Q) = p*w

• So we can split Q in several subsequences of length p.

• So we have...

• Algorithm Search_Long:

• Break sequence Q in p sub-queries with radius e/sqrt(p)‏

• Retrieve from the index all the sub-trails whose MBRs insersect at least one of the other sub-query regions.

• Examine the sub-sequences, discard false alarms.

• Stock price database with ~300,000 points

• 1 number = 4 bytes

• DFT keeping first 3 coefficients (actually 6)

• w = 512 bytes

• R*-tree

• Space

• Naïve methods: 24mb

• This method: 5kb

• Time - “short” queries (Len(Q) = w)‏

• 3 to 100 times better response times

• Time - “long” queries (Len(Q) > w)‏

• 10 to 100 times better response times