Efficient similarity search in sequence databases rakesh agrawal christos faloutsos and arun swami
1 / 18

Leila Kaghazian - PowerPoint PPT Presentation

  • Uploaded on
  • Presentation posted in: General

Efficient Similarity Search in Sequence Databases Rakesh Agrawal, Christos Faloutsos and Arun Swami. Leila Kaghazian. Similarity. Exact queries Similarity Identify companies with similar pattern of growth Determine products with similar selling pattern

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.

Download Presentation

Leila Kaghazian

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript

Efficient Similarity Search in Sequence Databases Rakesh Agrawal, Christos Faloutsos and Arun Swami

Leila Kaghazian


  • Exact queries

  • Similarity

    • Identify companies with similar pattern of growth

    • Determine products with similar selling pattern

    • Discover stocks with similar movement in stock prices.

Similarity Queries

  • Whole Matching. The sequences to be compared have the same length n.

    • Range Query. Given a query sequences that are similar within distance “e”.

    • All-Pairs queries. Given N sequences, find the pairs of sequences that are within “e” of each other.

  • Subsequence Matching. The query sequence is smaller; we look for a subsequence in the large sequence that best matches the query sequence.

Extracting Features from Sequences

  • For numerical sequences, extracting K features, mapping it to k-dimensional space and using multidimensional index methods (R*-tree, R-tree,grid-files,…) to store and search these points.

  • Completeness of feature extracting

  • Dimensionality “curse”

Discrete Fourier Transform

  • All periodic waves can be generated by combining Sin and Cos waves of different frequencies

  • Number of Frequencies may not be finite

  • Fourier Transform Decomposes a Periodic Wave into its Component Frequencies

DFT Concept I

DFT Concept II

DFT Characteristics

  • Completeness of feature extracting

  • Dimensionality curse

  • Parseval theorem gives that Euclidean distance between two signals x and y in the time domain is the same as their Euclidean distance in the frequency domain

Proposed Technique

  • Obtain the coefficients of DFT of each sequence in the database

  • Build a multidimensional index (F-index) using the first fc (<5)Fourier coefficients.

  • For a range query, obtain the first fc Fourier coefficients of the query.

  • For an all-pairs query, doing a spatial join using the F-index (superset of the answer set)

  • The actual answer set is obtained in a post-processing step

Euclidean distance features

  • Euclidean distance is useful in many cases

  • It can be used with any other type of similarity measure

  • Euclidean distance is the optimal distance measure of estimation if signals are corrupted by Gaussian additive noise

  • It is preserved under orthonormal transforms

DFT Characteristics

  • Preserves the distance

  • Is easy to compute

  • Concentrate the energy of the signal in few coefficients

  • It’s a orthonormal transform

    • The data dependent ones

      • + better performance

      • - expensive data reorganization if data set evolves over time

    • Data independent ones(DFT, DCT, wavelet)

Number of Fourier coefficients

  • Worst-case signal is White noise when xt is completely independent of its neighbors.

    • It has the same energy in every frequency means all frequency are equally important. This is bad for F-index.

  • Random walks (brown noise)

    • Stock movements and exchange rates

  • Primary and secondary trends correspond to strong, low frequency signals while minor trends corresponds to weak, high frequency signals

Performance Experiments

  • How to choose the number of Fourier coefficients to be retained (cut-off frequency fc) in the F-index method.

    • A larger fc

      • reduces the false hits

      • increases the search time.

  • How does the search time grow as a function of number of sequences in the database?

  • How does the length n of the sequences affect the performance?

Number of Fourier coefficients

Range Queries

All-Pairs Queries

Different Sequence Set Size

All-Pairs Queries

Range Queries

Varying Sequence Length

All-Pairs Queries

Range Queries


  • The minimum execution time for both range and all-pairs queries is achieved for a small number of fc

  • Increasing the number of sequences in the database results in higher gains for this method

  • Increasing the length of the sequence n also results in higher gain for the method


  • Use DFT to extract sequence features

  • Only first few coefficient is strong enough

  • DFT is orthonotmal

  • Use R*-tree for indexing

  • Use Euclidean distance

  • Complexity is O(nlog(n))

  • Login