Similarity searches in sequence databases
Download
1 / 67

Similarity Searches in Sequence Databases - PowerPoint PPT Presentation


  • 174 Views
  • Uploaded on

Similarity Searches in Sequence Databases. Sang-Hyun Park KMeD Research Group Computer Science Department University of California, Los Angeles. Contents. Introduction Whole Sequence Searches Subsequence Searches Segment-Based Subsequence Searches Multi-Dimensional Subsequence Searches

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about ' Similarity Searches in Sequence Databases' - goldy


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
Similarity searches in sequence databases
Similarity Searches in Sequence Databases

Sang-Hyun Park

KMeD Research Group

Computer Science Department

University of California, Los Angeles


Contents
Contents

  • Introduction

  • Whole Sequence Searches

  • Subsequence Searches

  • Segment-Based Subsequence Searches

  • Multi-Dimensional Subsequence Searches

  • Conclusion


What is sequence

25

20

temperature

(oC)

15

10

5

time

8AM

10AM

12PM

2PM

4PM

6PM

8PM

10PM

What is Sequence?

  • A sequence is an ordered list of elements.

    S = 14.3, 18.2, 22.0, 22,4, 19.5, 17.1, 15.8, 15.1

  • Sequences are principal data format in many applications.


What is similarity search
What is Similarity Search?

  • Similarity search finds sequences whose changing patterns are similar to that of a query sequence.

  • Example

    • Detect stocks with similar growth patterns

    • Find persons with similar voice clips

    • Find patients whose brain tumors have similar evolution patterns

  • Similarity search helps in clustering, data mining, and rule discovery.


Classification of similarity search
Classification of Similarity Search

  • Similarity Searches are classified as:

    • Whole sequence searches

    • Subsequence searches

    • Example

      • S =  1,2,3 

      • Subsequences (S) = { 1, 2, 3, 1,2, 2,3, 1,2,3 }

      • In whole sequence searches, the sequence S itself is compared with a query sequence Q.

      • In subsequence searches, every possible subsequence of S can be compared with a query sequence q.


Similarity measure
Similarity Measure

  • Lp Distance Metric

    • L1 : Manhattan distance or city-block distance

    • L2 : Euclidean distance

    • L : maximum distance in any element pairs

    • requires that two sequences should have the same length


Similarity measure 2
Similarity Measure (2)

  • Time Warping Distance

    • Originally introduced in the area of speech recognition

    • Allows sequences to be stretched along the time axis

      3,5,6 3,3,5,6  3,3,3,5,6  3,3,3,5,5,6  …

    • Each element of a sequence can be mapped to one or more neighboring elements of another sequence.

    • Useful in applications where sequences may be of different lengths or different sampling rates

Q = 10, 15, 20 

S =  10, 15, 16, 20 


Similarity measure 3
Similarity Measure (3)

  • Time Warping Distance (2)

    • Defined recursively

    • Computed by dynamic programming technique, O(|S||Q|)

DTW (S, Q[2:-])

DTW (S[2:-], Q)

DTW (S[2:-], Q[2:-])

DTW (S, Q) = DBASE (S[1], Q[1]) + min

DBASE (S[1], Q[1]) = | S[1] – Q[1] | P

Q

Q[2:-]

Q[1]

S

S[2:-]

S[1]


Similarity measure 4

6

16

11

12

6

13

9

10

7

10

7

8

6

6

4

5

5

3

2

3

4

1

1

2

S

3

4

3

Q

Similarity Measure (4)

  • Time Warping Distance (3)

    • S = 4,5,6,7,6,6, Q = 3,4,3

    • When using L1 as a DBASE, DTW (S, Q) = 12

| S[i]Q[j] | + min (V1,V2,V3)

S[i]

V2

V3

V1

Q[j]


False alarm and false dismissal
False Alarm and False Dismissal

  • False Alarm

    • Candidates not similar to a query.

    • Minimize false alarms for efficiency

  • False Dismissal

    • Similar sequences not retrieved by index search

    • Avoid false dismissals for correctness

data sequences

candidates

candidates

false alarm

similar

seq.

similar

seq.

false dismissal


Contents1
Contents

  • Introduction

  • Whole Sequence Searches

  • Subsequence Searches

  • Segment-Based Subsequence Searches

  • Multi-Dimensional Subsequence Searches

  • Conclusion


Problem definition
Problem Definition

  • Input

    • Set of data sequences {S}

    • Query sequence Q

    • Distance tolerance 

  • Output

    • Set of data sequences whose distances to Q are within 

  • Similarity Measure

    • Time warping distance function, DTW

    • L as a distance function for each element pair

    • If the distance of every element pair is within , then DTW(S,Q)  .


Previous approaches
Previous Approaches

  • Naïve Scan [Ber96]

    • Read every data sequence from database

    • Apply dynamic programming technique

    • For m data sequences with average length L, O(mL|Q|)

  • FastMap-Based Technique [Yi98]

    • Use FastMap technique for feature extraction

    • Map features into multi-dimensional points

    • Use Euclidean distance in index space for filtering

    • Could not guarantee “no false dismissal”


Previous approaches 2
Previous Approaches (2)

  • LB-Scan [Yi98]

    • Read every data sequence from database

    • Apply the lower-bound distance function Dlb which satisfies the following lower-bound theorem:

      Dlb (S,Q)    DTW (S,Q)  

    • Faster than the original time warping distance function (O(|S|+|Q|) vs. O(|S||Q|))

    • Guarantee no false dismissal

    • Based on sequential scanning


Proposed approach
Proposed Approach

  • Goal

    • No false dismissal

    • High query processing performance

  • Sketch

    • Extract a time-warping invariant feature vector

    • Build a multi-dimensional index

    • Use a lower-bound distance function for filtering


Proposed approach 2
Proposed Approach (2)

  • Feature Extraction

    • F(S) =  First(S), Last(S), Max(S), Min(S) 

    • F(S) is invariant to time warping transformation.

  • Distance Function for Feature Vectors

| First(S)  First(Q) |

| Last(S)  Last(Q) |

| Max(S)  Max(Q) |

| Min(S)  Min(Q) |

DFT (F(S), F(Q)) = max


Proposed approach 3
Proposed Approach (3)

  • Distance Function for Feature Vectors (2)

    • Satisfies lower-bounding theorem:

      DFT (F(S),F(Q))    DTW (S,Q)  

    • More accurate than Dlb proposed in LB-Scan

    • Faster than Dlb (O(1) vs. O(|S|+|Q|))


Proposed approach 4
Proposed Approach (4)

  • Indexing

    • Build a multi-dimensional index from a set of feature vectors

    • Index entry  First(S), Last(S), Max(S), Min(S), Identifier(S) 

  • Query Processing

    • Extract a feature vector F(Q)

    • Perform range queries in index space to find data points included in the following query rectangle:

       [ First(Q)  , First(Q) +  ],[ Last(Q)  , Last(Q) +  ],

      [ Max(Q)  , Max(Q) +  ], [ Min(Q)  , Min(Q) +  ] 

    • Perform post-processing to discard false alarms


Performance evaluation
Performance Evaluation

  • Implementation

    • Implemented with C++ on UNIX operating system

    • R-tree is used as a multi-dimensional index.

  • Experimental Setup

    • S&P 500 stock data set (m=545, L=232)

    • Random walk synthetic data set

    • SunSparc Ultra-5


Performance evaluation 2
Performance Evaluation (2)

  • Filtering Ratio

    • Better-than LB-Scan


Performance evaluation 3
Performance Evaluation (3)

  • Query Processing Time

    • Faster than LB-Scan and Naïve-Scan


Contents2
Contents

  • Introduction

  • Whole Sequence Searches

  • Subsequence Searches

  • Segment-Based Subsequence Searches

  • Multi-Dimensional Subsequence Searches

  • Conclusion


Problem definition1
Problem Definition

  • Input

    • Set of data sequences {S}

    • Query sequence q

    • Distance tolerance 

  • Output

    • Set of subsequences whose distances to q are within 

  • Similarity Measure

    • Time warping distance function, DTW

    • Any LP metric as a distance function for element pairs


Previous approaches1
Previous Approaches

  • Naïve-Scan [Ber96]

    • Read every data subsequence from database

    • Apply dynamic programming technique

    • For m data sequences with average length n, O(mL2|q|)


Previous approaches 21
Previous Approaches (2)

  • ST-Index [Fal94]

    • Assume that the minimum query length (w) is known in advance.

    • Locates a sliding window of size w at every possible location

    • Extract a feature vector inside the window

    • Map a feature vector into a point and group trails into MBR (Minimum Bounding Rectangle)

    • Use Euclidean distance in index space for filtering

    • Could not guarantee “no false dismissal”


Proposed approach1
Proposed Approach

  • Goal

    • No false dismissal

    • High performance

    • Support diverse similarity measure

  • Sketch

    • Convert into sequences of discrete symbols

    • Build a sparse suffix tree

    • Use a lower-bound distance function for filtering

    • Apply branch-pruning to reduce the search space


Proposed approach 21
Proposed Approach (2)

  • Conversion

    • Generate categories from the distribution of element values

      • Maximum-entropy method

      • Equal-interval method

      • DISC method

    • Convert element to the symbol of the corresponding category

    • Example

      A = [0, 1.0], B = [1.1, 2.0], C = [2.1, 3.0], D = [3.1, 4.0]

      S = 1.3, 1.6, 2.9, 3.3, 1.5, 0.1

      SC = B, B, C, D, B, A


Proposed approach 31
Proposed Approach (3)

  • Indexing

    • Extract suffixes from sequences of discrete symbols.

    • Example

      From S1C= A, B, B, A,

      we extract four suffixes: ABBA, BBA, BA, A


Proposed approach 41
Proposed Approach (4)

  • Indexing (2)

    • Build a suffix tree.

      • Suffix tree is originally proposed to retrieve substrings exactly matched to the query string.

      • Suffix tree consists of nodes and edges.

      • Each suffix is represented by the path from the root node to a leaf node.

      • Labels on the path from the root to the internal node Ni represents the longest common prefix of the suffixes under Ni

      • Suffix tree is built with computation and space complexity, O(mL).


Proposed approach 42
Proposed Approach (4)

  • Indexing (3)

    • Example : suffix tree from S1C= A, B, B, A and S2C= A, B

A

B

B

B

$

A

A

B

$

$

$

$

A

$

S1C[1:-]

S2C[1:-]

S1C[4:-]

S1C[2:-]

S1C[3:-]

S2C[2:-]


Proposed approach 5
Proposed Approach (5)

  • Query Processing

query (q, )

Index Searching

candidates

answers

Post Processing

suffix tree

data sequences


Proposed approach 6
Proposed Approach (6)

  • Index Searching

    • Visit each node of suffix tree by depth-first traversal.

    • Build lower-bound distance table for q and edge labels.

    • Inspect the last columns of newly added rows to find candidates.

    • Apply branch-pruning to reduce the search space.

    • Branch-pruning theorem:

      If all columns of the last row of the distance table have values larger than a distance tolerance , adding more rows on this table does not yield the new values less than or equal to .


Proposed approach 7
Proposed Approach (7)

  • Index Searching (2)

    • Example : q = 2, 2, 1,  = 1.5

N1

A

1

2

2

A

2

2

1

q

…..

N2

B

D

B

1

1

1.1

D

2.1

2.1

4.1

A

1

2

2

N3

N4

A

1

2

2

2

2

1

q

2

2

1

q

…..

…..


Proposed approach 8
Proposed Approach (8)

  • Lower-Bound Distance Function DTW-LB

0 if v is within the range of A (A.min  v) P if v is smaller than A.min (v  A.max) P if v is larger than A.max

DBASE-LB (A, v) =

v

A.max

A.max

A.max

v

A.min

A.min

A.min

v

possible minimum

distance = 0

possible minimum

distance = (A.min – v)P

possible minimum

distance = (v – A.max)P


Proposed approach 9
Proposed Approach (9)

  • Lower-Bound Distance Function DTW-LB (2)

    • satisfies the lower-bounding theorem

      DTW-LB(sC, q)    DTW (s,q)  

    • computation complexity O(|sC||q|)

DTW-LB (sC, q) = DBASE-LB(sC[1], q[1]) +

min

DTW-LB (sC, q[2:-])

DTW-LB (sC[2:-], q)

DTW-LB (sC[2:-], q[2:-])


Proposed approach 10
Proposed Approach (10)

  • Computation Complexity

    • m is the number of data sequences.

    • L is the average length of data sequences.

    • The left expression is for index searching.

    • The right expression is for post-processing.

    • RP ( 1) is the reduction factor by branch-pruning.

    • RD ( 1) is the reduction factor by sharing distance tables.

    • n is the number of subsequences requiring post-processing.


Proposed approach 11
Proposed Approach (11)

  • Sparse Indexing

    • The index size is linear to the number of suffixes stored.

    • To reduce the index size, we build a sparse suffix tree (SST).

    • That is, we store the suffix SC[i:-] only if SC[i]  SC[i–1].

    • Compaction Ratio

    • Example

      • SC = A, A, A, A, C, B, B

      • store only three suffixes (SC[1:-], SC[5:-], and SC[6:-])

      • compaction ratio C = 7/3


Proposed approach 12
Proposed Approach (12)

  • Sparse Indexing (2)

    • When traversing the suffix tree, we need to find non-stored suffixes and compute their distances to q.

    • Assume that k elements of sC have the same value.

    • Then, sC[1:-] is stored but sC[i:-] (i=2,3,…,k) is not stored.

    • For non-stored suffixes,

      we introduce another lower-bound distance function.

      DTW-LB2 (sC[i:-], q) = DTW-LB(sC, q) – (i – 1)  DBASE-LB(sC[1], q[1])

    • DTW-LB2 satisfies the lower-bounding theorem.

    • DTW-LB2 is O(1) when DTW-LB(sC, q) is given.


Proposed approach 13
Proposed Approach (13)

  • Sparse Indexing (3)

    • With sparse indexing, the complexity becomes:

      • m is the number of data sequences.

      • L is the average length of data sequences.

      • C is the compaction ratio.

      • n is the number of subsequences requiring post-processing.

      • RP ( 1) is the reduction factor by branch-pruning.

      • RD ( 1) is the reduction factor by sharing distance tables.


Performance evaluation1
Performance Evaluation

  • Implementation

    • Implemented with C++ on UNIX operating system

  • Experimental Setup

    • S&P 500 stock data set (m=545, L=232)

    • Random walk synthetic data set

    • Maximum-Entropy (ME) categorization

    • Disk-based suffix tree construction algorithm

    • SunSparc Ultra-5


Performance evaluation 21
Performance Evaluation (2)

  • Comparison with Naïve-Scan

    • increasing distance-tolerances

    • S&P 500 stock data set, |q|=20


Performance evaluation 31
Performance Evaluation (3)

  • Scalability Test

    • increasing average length of data sequences

    • random-walk data set, |q|=20,m=200


Performance evaluation 4
Performance Evaluation (4)

  • Scalability Test (2)

    • increasing total number of data sequences

    • random-walk data set, |q|=20, L=200


Contents3
Contents

  • Introduction

  • Whole Sequence Searches

  • Subsequence Searches

  • Segment-Based Subsequence Searches

  • Multi-Dimensional Subsequence Searches

  • Conclusion


Introduction
Introduction

  • We extend the proposed subsequence searching method to large sequence databases.

  • In the retrieval of similar subsequences with time warping distance function,

    • Sequential Scanning is O(mL2|q|).

    • The proposed method is O(mL2|q| / R) (R  1).

    • It makes search algorithms suffer from severe performance degradation when L is very large.

  • For a database with long sequences, we need a new searching scheme linear to L.


Sbass
SBASS

  • We propose a new searching scheme: Segment-Based Subsequence Searching scheme (SBASS)

    • Sequences are divided into a series of piece-wise segments.

    • When a query sequence q with k segments is submitted, q is compared with those subsequences which consist of k consecutive data segments.

    • The lengths of segments may be different.

    • SS represents the segmented sequence of S.

      S = 4,5,8,9,11,8,4,3 |S| = 8

      SS = 4,5,8,9,11, 8,4,3 |SS| = 2


Sbass 2
SBASS (2)

  • Only four subsequences of SS are compared with QS.

    SS[1],SS[2], SS[2],SS[3], SS[3],SS[4], SS[4],SS[5]

S

SS[3]

SS[2]

SS[1]

SS[4]

SS[5]

SS

qS

qS[1]

qS[2]


Sbass 3
SBASS (3)

  • For SBASS scheme, we define the piece-wise time warping distance function (where k = |qS| = |sS|).

  • Sequential scanning for SBASS scheme is O(mL|q|).

  • We introduce an indexing technique with O(mL|q|/R) (R  1).


Sketch of proposed approach
Sketch of Proposed Approach

  • Indexing

    • Convert sequences to segmented sequences.

    • Extract a feature vector from each segment.

    • Categorize feature vectors.

    • Convert segmented sequences to sequences of symbols.

    • Construct suffix tree from sequences of symbols.

  • Query Processing

    • Traverse the suffix tree to find candidates.

    • Discard false alarms in post processing.


Segmentation
Segmentation

  • Approach

    • Divide at peak points.

    • Divide further if maximum deviation from interpolation line is too large.

    • Eliminate noises.

  • Compaction Ratio (C) = |S| / |SS|

too large deviation

noises


Feature extraction
Feature Extraction

  • From each subsequence segment, extract a feature vector:

    (V1, VL,L, +, –)

VL

+

–

V1

L


Categorization and index construction
Categorization and Index Construction

  • Categorization

    • Group similar feature vectors together using multi-dimensional categorization methods like Multi-attribute Type Abstraction Hierarchy (MTAH).

    • Assign unique symbol to each category

    • Convert segmented sequences to sequences of symbols.

      S = 4,5,8,8,8,8,9,11,8,4,3

      SS = 4,5,8,8,8,8,9,11, 8,4,3

      SF = (4,11,8,2,1), (8,3,3,0,1.5)

      SC = A,B

  • From sequences of symbols, construct the suffix tree.


Query processing
Query Processing

  • For query processing, we calculate lower-bond distances between symbols and keep them in table.

  • Given the query sequence q and the distance tolerance ,

    • Convert q to qS and then to qC.

    • Search the suffix tree to find those subsequences whose lower-bound distances to qC are within .

    • Discard false alarms in post processing.


Query processing 2
Query Processing (2)

Index Searching

candidates

answers

q, 

qS

qC

Post Processing

suffix tree

data sequences


Computation complexity
Computation Complexity

  • Sequential scanning is O(mL|q|).

  • Complexity of the proposed search algorithm is :

    • n is the number of subsequences contained in candidates.

    • C is the compaction ratio or the average number of elements in segments.

    • RD ( 1) is the reduction factor by sharing edges of suffix tree.


Performance evaluation2
Performance Evaluation

  • Test Set : Pseudo Periodic Synthetic Sequences

  • m = 100, L = 10,000

  • Achieved up to 6.5 times speed-up compared to sequential scanning.

60

50

40

SeqScan

30

time (sec)

20

Our Approach

10

0.2

0.4

0.6

0.8

1.0

distance tolerance


Contents4
Contents

  • Introduction

  • Whole Sequence Searches

  • Subsequence Searches

  • Segment-Based Subsequence Searches

  • Multi-Dimensional Subsequence Searches

  • Conclusion


Introduction1
Introduction

  • So far, we assumed that elements have single-dimensional numeric values.

  • Now, we consider multi-dimensional sequences.

    • Image Sequences

    • Video Streams

Medical Image Sequence


Introduction 2
Introduction (2)

  • In multi-dimensional sequences, elements are represented by feature vectors.

    S = S[1], …, S[N], S[i] = (S[i][1], …, S[i][F])

  • Our proposed subsequence searching techniques are extended to the retrieval of similar multi-dimensional subsequences.


Introduction 3
Introduction (3)

  • Multi-Dimensional Time Warping Distance

    DMTW (S, Q[2:-])

    DMTW (S, Q) = DMBASE (S[1], Q[1]) + min DMTW (S[2:-], Q)

    DMTW (S[2:-],Q[2:-])

    DMBASE (S[1], Q[1]) = ( Wi | S[1][i]  Q[1][i] | )

    • F is the number of features in each element.

    • Wi is the weight of i-th dimension.


Sketch of our approach
Sketch of Our Approach

  • Indexing

    • Categorize multi-dimensional element values using MTAH.

    • Assign unique symbols to categories.

    • Convert multi-dimensional sequences into sequences of symbols.

    • Construct suffix tree from a set of sequences of symbols.

  • Query Processing

    • Traverse suffix tree.

    • Find candidates whose lower-bound distances to q are within .

    • Do post processing to discard false alarms.


Application to kmed
Application to KMeD

  • In the environment of KMeD, the proposed technique is applied to the retrieval of medical image sequences having similar spatio-temporal characteristics to those of the query sequence.

  • KMeD [CCT:95] has the following features:

    • Query by both image and alphanumeric contents

    • Model temporal, spatial and evolutionary nature of objects

    • Formulate queries using conceptual and imprecise terms

    • Support cooperative processing


Application to kmed 2
Application to KMeD (2)

  • Query

    • Medical Image Sequence

    • Attribute names and their relative weights

    • Distance tolerance

DistFromLV (0.6)

Circularity (0.1)

Size

(0.3)


Application to kmed 3
Application to KMeD (3)

Query

User Model

Query Analysis

Contour Extraction

Feature Extraction

Distance Function

matching seq.

Visual Presentation

Similarity Searches

feedback

medical image seq.

index structure


Contents5
Contents

  • Introduction

  • Whole Sequence Searches

  • Subsequence Searches

  • Segment-Based Subsequence Searches

  • Multi-Dimensional Subsequence Searches

  • Conclusion


Summary
Summary

  • Sequence is an ordered list of elements.

  • Similarity search helps in clustering and data mining.

  • For sequences of different lengths or different sampling rates, time warping distance is useful.

  • We proposed the whole sequence searching method with spatial access method and lower-bound distance function.

  • We proposed the subsequence searching method with suffix tree and lower-bound distance functions.

  • We proposed the segment-based subsequence searching method for large sequence databases.

  • We extended the subsequence searching method to the retrieval of similar multi-dimensional subsequences.


Contribution
Contribution

  • We proposed the tighter and faster lower-bound distance function for efficient whole sequence searches without false dismissal.

  • We demonstrated the feasibility of using time warping similarity measure on a suffix tree.

  • We introduced the branch pruning theorem and the fast lower-bound distance function for efficient subsequence searches without false dismissal.

  • We applied categorization and sparse indexing for scalability.

  • We applied the proposed technique to the real application (KMeD).


ad