similarity searches in sequence databases
Download
Skip this Video
Download Presentation
Similarity Searches in Sequence Databases

Loading in 2 Seconds...

play fullscreen
1 / 67

Similarity Searches in Sequence Databases - PowerPoint PPT Presentation


  • 175 Views
  • Uploaded on

Similarity Searches in Sequence Databases. Sang-Hyun Park KMeD Research Group Computer Science Department University of California, Los Angeles. Contents. Introduction Whole Sequence Searches Subsequence Searches Segment-Based Subsequence Searches Multi-Dimensional Subsequence Searches

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about ' Similarity Searches in Sequence Databases' - goldy


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
similarity searches in sequence databases
Similarity Searches in Sequence Databases

Sang-Hyun Park

KMeD Research Group

Computer Science Department

University of California, Los Angeles

contents
Contents
  • Introduction
  • Whole Sequence Searches
  • Subsequence Searches
  • Segment-Based Subsequence Searches
  • Multi-Dimensional Subsequence Searches
  • Conclusion
what is sequence

25

20

temperature

(oC)

15

10

5

time

8AM

10AM

12PM

2PM

4PM

6PM

8PM

10PM

What is Sequence?
  • A sequence is an ordered list of elements.

S = 14.3, 18.2, 22.0, 22,4, 19.5, 17.1, 15.8, 15.1

  • Sequences are principal data format in many applications.
what is similarity search
What is Similarity Search?
  • Similarity search finds sequences whose changing patterns are similar to that of a query sequence.
  • Example
    • Detect stocks with similar growth patterns
    • Find persons with similar voice clips
    • Find patients whose brain tumors have similar evolution patterns
  • Similarity search helps in clustering, data mining, and rule discovery.
classification of similarity search
Classification of Similarity Search
  • Similarity Searches are classified as:
    • Whole sequence searches
    • Subsequence searches
    • Example
      • S =  1,2,3 
      • Subsequences (S) = { 1, 2, 3, 1,2, 2,3, 1,2,3 }
      • In whole sequence searches, the sequence S itself is compared with a query sequence Q.
      • In subsequence searches, every possible subsequence of S can be compared with a query sequence q.
similarity measure
Similarity Measure
  • Lp Distance Metric
    • L1 : Manhattan distance or city-block distance
    • L2 : Euclidean distance
    • L : maximum distance in any element pairs
    • requires that two sequences should have the same length
similarity measure 2
Similarity Measure (2)
  • Time Warping Distance
    • Originally introduced in the area of speech recognition
    • Allows sequences to be stretched along the time axis

3,5,6 3,3,5,6  3,3,3,5,6  3,3,3,5,5,6  …

    • Each element of a sequence can be mapped to one or more neighboring elements of another sequence.
    • Useful in applications where sequences may be of different lengths or different sampling rates

Q = 10, 15, 20 

S =  10, 15, 16, 20 

similarity measure 3
Similarity Measure (3)
  • Time Warping Distance (2)
    • Defined recursively
    • Computed by dynamic programming technique, O(|S||Q|)

DTW (S, Q[2:-])

DTW (S[2:-], Q)

DTW (S[2:-], Q[2:-])

DTW (S, Q) = DBASE (S[1], Q[1]) + min

DBASE (S[1], Q[1]) = | S[1] – Q[1] | P

Q

Q[2:-]

Q[1]

S

S[2:-]

S[1]

similarity measure 4

6

16

11

12

6

13

9

10

7

10

7

8

6

6

4

5

5

3

2

3

4

1

1

2

S

3

4

3

Q

Similarity Measure (4)
  • Time Warping Distance (3)
    • S = 4,5,6,7,6,6, Q = 3,4,3
    • When using L1 as a DBASE, DTW (S, Q) = 12

| S[i]Q[j] | + min (V1,V2,V3)

S[i]

V2

V3

V1

Q[j]

false alarm and false dismissal
False Alarm and False Dismissal
  • False Alarm
    • Candidates not similar to a query.
    • Minimize false alarms for efficiency
  • False Dismissal
    • Similar sequences not retrieved by index search
    • Avoid false dismissals for correctness

data sequences

candidates

candidates

false alarm

similar

seq.

similar

seq.

false dismissal

contents1
Contents
  • Introduction
  • Whole Sequence Searches
  • Subsequence Searches
  • Segment-Based Subsequence Searches
  • Multi-Dimensional Subsequence Searches
  • Conclusion
problem definition
Problem Definition
  • Input
    • Set of data sequences {S}
    • Query sequence Q
    • Distance tolerance 
  • Output
    • Set of data sequences whose distances to Q are within 
  • Similarity Measure
    • Time warping distance function, DTW
    • L as a distance function for each element pair
    • If the distance of every element pair is within , then DTW(S,Q)  .
previous approaches
Previous Approaches
  • Naïve Scan [Ber96]
    • Read every data sequence from database
    • Apply dynamic programming technique
    • For m data sequences with average length L, O(mL|Q|)
  • FastMap-Based Technique [Yi98]
    • Use FastMap technique for feature extraction
    • Map features into multi-dimensional points
    • Use Euclidean distance in index space for filtering
    • Could not guarantee “no false dismissal”
previous approaches 2
Previous Approaches (2)
  • LB-Scan [Yi98]
    • Read every data sequence from database
    • Apply the lower-bound distance function Dlb which satisfies the following lower-bound theorem:

Dlb (S,Q)    DTW (S,Q)  

    • Faster than the original time warping distance function (O(|S|+|Q|) vs. O(|S||Q|))
    • Guarantee no false dismissal
    • Based on sequential scanning
proposed approach
Proposed Approach
  • Goal
    • No false dismissal
    • High query processing performance
  • Sketch
    • Extract a time-warping invariant feature vector
    • Build a multi-dimensional index
    • Use a lower-bound distance function for filtering
proposed approach 2
Proposed Approach (2)
  • Feature Extraction
    • F(S) =  First(S), Last(S), Max(S), Min(S) 
    • F(S) is invariant to time warping transformation.
  • Distance Function for Feature Vectors

| First(S)  First(Q) |

| Last(S)  Last(Q) |

| Max(S)  Max(Q) |

| Min(S)  Min(Q) |

DFT (F(S), F(Q)) = max

proposed approach 3
Proposed Approach (3)
  • Distance Function for Feature Vectors (2)
    • Satisfies lower-bounding theorem:

DFT (F(S),F(Q))    DTW (S,Q)  

    • More accurate than Dlb proposed in LB-Scan
    • Faster than Dlb (O(1) vs. O(|S|+|Q|))
proposed approach 4
Proposed Approach (4)
  • Indexing
    • Build a multi-dimensional index from a set of feature vectors
    • Index entry  First(S), Last(S), Max(S), Min(S), Identifier(S) 
  • Query Processing
    • Extract a feature vector F(Q)
    • Perform range queries in index space to find data points included in the following query rectangle:

 [ First(Q)  , First(Q) +  ],[ Last(Q)  , Last(Q) +  ],

[ Max(Q)  , Max(Q) +  ], [ Min(Q)  , Min(Q) +  ] 

    • Perform post-processing to discard false alarms
performance evaluation
Performance Evaluation
  • Implementation
    • Implemented with C++ on UNIX operating system
    • R-tree is used as a multi-dimensional index.
  • Experimental Setup
    • S&P 500 stock data set (m=545, L=232)
    • Random walk synthetic data set
    • SunSparc Ultra-5
performance evaluation 2
Performance Evaluation (2)
  • Filtering Ratio
    • Better-than LB-Scan
performance evaluation 3
Performance Evaluation (3)
  • Query Processing Time
    • Faster than LB-Scan and Naïve-Scan
contents2
Contents
  • Introduction
  • Whole Sequence Searches
  • Subsequence Searches
  • Segment-Based Subsequence Searches
  • Multi-Dimensional Subsequence Searches
  • Conclusion
problem definition1
Problem Definition
  • Input
    • Set of data sequences {S}
    • Query sequence q
    • Distance tolerance 
  • Output
    • Set of subsequences whose distances to q are within 
  • Similarity Measure
    • Time warping distance function, DTW
    • Any LP metric as a distance function for element pairs
previous approaches1
Previous Approaches
  • Naïve-Scan [Ber96]
    • Read every data subsequence from database
    • Apply dynamic programming technique
    • For m data sequences with average length n, O(mL2|q|)
previous approaches 21
Previous Approaches (2)
  • ST-Index [Fal94]
    • Assume that the minimum query length (w) is known in advance.
    • Locates a sliding window of size w at every possible location
    • Extract a feature vector inside the window
    • Map a feature vector into a point and group trails into MBR (Minimum Bounding Rectangle)
    • Use Euclidean distance in index space for filtering
    • Could not guarantee “no false dismissal”
proposed approach1
Proposed Approach
  • Goal
    • No false dismissal
    • High performance
    • Support diverse similarity measure
  • Sketch
    • Convert into sequences of discrete symbols
    • Build a sparse suffix tree
    • Use a lower-bound distance function for filtering
    • Apply branch-pruning to reduce the search space
proposed approach 21
Proposed Approach (2)
  • Conversion
    • Generate categories from the distribution of element values
      • Maximum-entropy method
      • Equal-interval method
      • DISC method
    • Convert element to the symbol of the corresponding category
    • Example

A = [0, 1.0], B = [1.1, 2.0], C = [2.1, 3.0], D = [3.1, 4.0]

S = 1.3, 1.6, 2.9, 3.3, 1.5, 0.1

SC = B, B, C, D, B, A

proposed approach 31
Proposed Approach (3)
  • Indexing
    • Extract suffixes from sequences of discrete symbols.
    • Example

From S1C= A, B, B, A,

we extract four suffixes: ABBA, BBA, BA, A

proposed approach 41
Proposed Approach (4)
  • Indexing (2)
    • Build a suffix tree.
      • Suffix tree is originally proposed to retrieve substrings exactly matched to the query string.
      • Suffix tree consists of nodes and edges.
      • Each suffix is represented by the path from the root node to a leaf node.
      • Labels on the path from the root to the internal node Ni represents the longest common prefix of the suffixes under Ni
      • Suffix tree is built with computation and space complexity, O(mL).
proposed approach 42
Proposed Approach (4)
  • Indexing (3)
    • Example : suffix tree from S1C= A, B, B, A and S2C= A, B

A

B

B

B

$

A

A

B

$

$

$

$

A

$

S1C[1:-]

S2C[1:-]

S1C[4:-]

S1C[2:-]

S1C[3:-]

S2C[2:-]

proposed approach 5
Proposed Approach (5)
  • Query Processing

query (q, )

Index Searching

candidates

answers

Post Processing

suffix tree

data sequences

proposed approach 6
Proposed Approach (6)
  • Index Searching
    • Visit each node of suffix tree by depth-first traversal.
    • Build lower-bound distance table for q and edge labels.
    • Inspect the last columns of newly added rows to find candidates.
    • Apply branch-pruning to reduce the search space.
    • Branch-pruning theorem:

If all columns of the last row of the distance table have values larger than a distance tolerance , adding more rows on this table does not yield the new values less than or equal to .

proposed approach 7
Proposed Approach (7)
  • Index Searching (2)
    • Example : q = 2, 2, 1,  = 1.5

N1

A

1

2

2

A

2

2

1

q

…..

N2

B

D

B

1

1

1.1

D

2.1

2.1

4.1

A

1

2

2

N3

N4

A

1

2

2

2

2

1

q

2

2

1

q

…..

…..

proposed approach 8
Proposed Approach (8)
  • Lower-Bound Distance Function DTW-LB

0 if v is within the range of A (A.min  v) P if v is smaller than A.min (v  A.max) P if v is larger than A.max

DBASE-LB (A, v) =

v

A.max

A.max

A.max

v

A.min

A.min

A.min

v

possible minimum

distance = 0

possible minimum

distance = (A.min – v)P

possible minimum

distance = (v – A.max)P

proposed approach 9
Proposed Approach (9)
  • Lower-Bound Distance Function DTW-LB (2)
    • satisfies the lower-bounding theorem

DTW-LB(sC, q)    DTW (s,q)  

    • computation complexity O(|sC||q|)

DTW-LB (sC, q) = DBASE-LB(sC[1], q[1]) +

min

DTW-LB (sC, q[2:-])

DTW-LB (sC[2:-], q)

DTW-LB (sC[2:-], q[2:-])

proposed approach 10
Proposed Approach (10)
  • Computation Complexity
    • m is the number of data sequences.
    • L is the average length of data sequences.
    • The left expression is for index searching.
    • The right expression is for post-processing.
    • RP ( 1) is the reduction factor by branch-pruning.
    • RD ( 1) is the reduction factor by sharing distance tables.
    • n is the number of subsequences requiring post-processing.
proposed approach 11
Proposed Approach (11)
  • Sparse Indexing
    • The index size is linear to the number of suffixes stored.
    • To reduce the index size, we build a sparse suffix tree (SST).
    • That is, we store the suffix SC[i:-] only if SC[i]  SC[i–1].
    • Compaction Ratio
    • Example
      • SC = A, A, A, A, C, B, B
      • store only three suffixes (SC[1:-], SC[5:-], and SC[6:-])
      • compaction ratio C = 7/3
proposed approach 12
Proposed Approach (12)
  • Sparse Indexing (2)
    • When traversing the suffix tree, we need to find non-stored suffixes and compute their distances to q.
    • Assume that k elements of sC have the same value.
    • Then, sC[1:-] is stored but sC[i:-] (i=2,3,…,k) is not stored.
    • For non-stored suffixes,

we introduce another lower-bound distance function.

DTW-LB2 (sC[i:-], q) = DTW-LB(sC, q) – (i – 1)  DBASE-LB(sC[1], q[1])

    • DTW-LB2 satisfies the lower-bounding theorem.
    • DTW-LB2 is O(1) when DTW-LB(sC, q) is given.
proposed approach 13
Proposed Approach (13)
  • Sparse Indexing (3)
    • With sparse indexing, the complexity becomes:
      • m is the number of data sequences.
      • L is the average length of data sequences.
      • C is the compaction ratio.
      • n is the number of subsequences requiring post-processing.
      • RP ( 1) is the reduction factor by branch-pruning.
      • RD ( 1) is the reduction factor by sharing distance tables.
performance evaluation1
Performance Evaluation
  • Implementation
    • Implemented with C++ on UNIX operating system
  • Experimental Setup
    • S&P 500 stock data set (m=545, L=232)
    • Random walk synthetic data set
    • Maximum-Entropy (ME) categorization
    • Disk-based suffix tree construction algorithm
    • SunSparc Ultra-5
performance evaluation 21
Performance Evaluation (2)
  • Comparison with Naïve-Scan
    • increasing distance-tolerances
    • S&P 500 stock data set, |q|=20
performance evaluation 31
Performance Evaluation (3)
  • Scalability Test
    • increasing average length of data sequences
    • random-walk data set, |q|=20,m=200
performance evaluation 4
Performance Evaluation (4)
  • Scalability Test (2)
    • increasing total number of data sequences
    • random-walk data set, |q|=20, L=200
contents3
Contents
  • Introduction
  • Whole Sequence Searches
  • Subsequence Searches
  • Segment-Based Subsequence Searches
  • Multi-Dimensional Subsequence Searches
  • Conclusion
introduction
Introduction
  • We extend the proposed subsequence searching method to large sequence databases.
  • In the retrieval of similar subsequences with time warping distance function,
    • Sequential Scanning is O(mL2|q|).
    • The proposed method is O(mL2|q| / R) (R  1).
    • It makes search algorithms suffer from severe performance degradation when L is very large.
  • For a database with long sequences, we need a new searching scheme linear to L.
sbass
SBASS
  • We propose a new searching scheme: Segment-Based Subsequence Searching scheme (SBASS)
    • Sequences are divided into a series of piece-wise segments.
    • When a query sequence q with k segments is submitted, q is compared with those subsequences which consist of k consecutive data segments.
    • The lengths of segments may be different.
    • SS represents the segmented sequence of S.

S = 4,5,8,9,11,8,4,3 |S| = 8

SS = 4,5,8,9,11, 8,4,3 |SS| = 2

sbass 2
SBASS (2)
  • Only four subsequences of SS are compared with QS.

SS[1],SS[2], SS[2],SS[3], SS[3],SS[4], SS[4],SS[5]

S

SS[3]

SS[2]

SS[1]

SS[4]

SS[5]

SS

qS

qS[1]

qS[2]

sbass 3
SBASS (3)
  • For SBASS scheme, we define the piece-wise time warping distance function (where k = |qS| = |sS|).
  • Sequential scanning for SBASS scheme is O(mL|q|).
  • We introduce an indexing technique with O(mL|q|/R) (R  1).
sketch of proposed approach
Sketch of Proposed Approach
  • Indexing
    • Convert sequences to segmented sequences.
    • Extract a feature vector from each segment.
    • Categorize feature vectors.
    • Convert segmented sequences to sequences of symbols.
    • Construct suffix tree from sequences of symbols.
  • Query Processing
    • Traverse the suffix tree to find candidates.
    • Discard false alarms in post processing.
segmentation
Segmentation
  • Approach
    • Divide at peak points.
    • Divide further if maximum deviation from interpolation line is too large.
    • Eliminate noises.
  • Compaction Ratio (C) = |S| / |SS|

too large deviation

noises

feature extraction
Feature Extraction
  • From each subsequence segment, extract a feature vector:

(V1, VL,L, +, –)

VL

+

–

V1

L

categorization and index construction
Categorization and Index Construction
  • Categorization
    • Group similar feature vectors together using multi-dimensional categorization methods like Multi-attribute Type Abstraction Hierarchy (MTAH).
    • Assign unique symbol to each category
    • Convert segmented sequences to sequences of symbols.

S = 4,5,8,8,8,8,9,11,8,4,3

SS = 4,5,8,8,8,8,9,11, 8,4,3

SF = (4,11,8,2,1), (8,3,3,0,1.5)

SC = A,B

  • From sequences of symbols, construct the suffix tree.
query processing
Query Processing
  • For query processing, we calculate lower-bond distances between symbols and keep them in table.
  • Given the query sequence q and the distance tolerance ,
    • Convert q to qS and then to qC.
    • Search the suffix tree to find those subsequences whose lower-bound distances to qC are within .
    • Discard false alarms in post processing.
query processing 2
Query Processing (2)

Index Searching

candidates

answers

q, 

qS

qC

Post Processing

suffix tree

data sequences

computation complexity
Computation Complexity
  • Sequential scanning is O(mL|q|).
  • Complexity of the proposed search algorithm is :
    • n is the number of subsequences contained in candidates.
    • C is the compaction ratio or the average number of elements in segments.
    • RD ( 1) is the reduction factor by sharing edges of suffix tree.
performance evaluation2
Performance Evaluation
  • Test Set : Pseudo Periodic Synthetic Sequences
  • m = 100, L = 10,000
  • Achieved up to 6.5 times speed-up compared to sequential scanning.

60

50

40

SeqScan

30

time (sec)

20

Our Approach

10

0.2

0.4

0.6

0.8

1.0

distance tolerance

contents4
Contents
  • Introduction
  • Whole Sequence Searches
  • Subsequence Searches
  • Segment-Based Subsequence Searches
  • Multi-Dimensional Subsequence Searches
  • Conclusion
introduction1
Introduction
  • So far, we assumed that elements have single-dimensional numeric values.
  • Now, we consider multi-dimensional sequences.
    • Image Sequences
    • Video Streams

Medical Image Sequence

introduction 2
Introduction (2)
  • In multi-dimensional sequences, elements are represented by feature vectors.

S = S[1], …, S[N], S[i] = (S[i][1], …, S[i][F])

  • Our proposed subsequence searching techniques are extended to the retrieval of similar multi-dimensional subsequences.
introduction 3
Introduction (3)
  • Multi-Dimensional Time Warping Distance

DMTW (S, Q[2:-])

DMTW (S, Q) = DMBASE (S[1], Q[1]) + min DMTW (S[2:-], Q)

DMTW (S[2:-],Q[2:-])

DMBASE (S[1], Q[1]) = ( Wi | S[1][i]  Q[1][i] | )

    • F is the number of features in each element.
    • Wi is the weight of i-th dimension.
sketch of our approach
Sketch of Our Approach
  • Indexing
    • Categorize multi-dimensional element values using MTAH.
    • Assign unique symbols to categories.
    • Convert multi-dimensional sequences into sequences of symbols.
    • Construct suffix tree from a set of sequences of symbols.
  • Query Processing
    • Traverse suffix tree.
    • Find candidates whose lower-bound distances to q are within .
    • Do post processing to discard false alarms.
application to kmed
Application to KMeD
  • In the environment of KMeD, the proposed technique is applied to the retrieval of medical image sequences having similar spatio-temporal characteristics to those of the query sequence.
  • KMeD [CCT:95] has the following features:
    • Query by both image and alphanumeric contents
    • Model temporal, spatial and evolutionary nature of objects
    • Formulate queries using conceptual and imprecise terms
    • Support cooperative processing
application to kmed 2
Application to KMeD (2)
  • Query
    • Medical Image Sequence
    • Attribute names and their relative weights
    • Distance tolerance

DistFromLV (0.6)

Circularity (0.1)

Size

(0.3)

application to kmed 3
Application to KMeD (3)

Query

User Model

Query Analysis

Contour Extraction

Feature Extraction

Distance Function

matching seq.

Visual Presentation

Similarity Searches

feedback

medical image seq.

index structure

contents5
Contents
  • Introduction
  • Whole Sequence Searches
  • Subsequence Searches
  • Segment-Based Subsequence Searches
  • Multi-Dimensional Subsequence Searches
  • Conclusion
summary
Summary
  • Sequence is an ordered list of elements.
  • Similarity search helps in clustering and data mining.
  • For sequences of different lengths or different sampling rates, time warping distance is useful.
  • We proposed the whole sequence searching method with spatial access method and lower-bound distance function.
  • We proposed the subsequence searching method with suffix tree and lower-bound distance functions.
  • We proposed the segment-based subsequence searching method for large sequence databases.
  • We extended the subsequence searching method to the retrieval of similar multi-dimensional subsequences.
contribution
Contribution
  • We proposed the tighter and faster lower-bound distance function for efficient whole sequence searches without false dismissal.
  • We demonstrated the feasibility of using time warping similarity measure on a suffix tree.
  • We introduced the branch pruning theorem and the fast lower-bound distance function for efficient subsequence searches without false dismissal.
  • We applied categorization and sparse indexing for scalability.
  • We applied the proposed technique to the real application (KMeD).
ad