Approximate Similarity Search in Genomic Sequence Databases using Landmark-Guided Embedding - PowerPoint PPT Presentation

Approximate similarity search in genomic sequence databases using landmark guided embedding
1 / 20

  • Uploaded on
  • Presentation posted in: General

Approximate Similarity Search in Genomic Sequence Databases using Landmark-Guided Embedding. Ahmet Sacan and I. Hakki Toroslu email: [ ahmet,toroslu ]@ Computer Engineering Department, Middle East Technical University Ankara, TURKEY. Outline. Background Sequence Alignment

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.

Download Presentation

Approximate Similarity Search in Genomic Sequence Databases using Landmark-Guided Embedding

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript

Approximate similarity search in genomic sequence databases using landmark guided embedding

Approximate Similarity Search in Genomic Sequence Databases usingLandmark-Guided Embedding

AhmetSacan and I. HakkiToroslu

email: [ahmet,toroslu]

Computer Engineering Department,Middle East Technical UniversityAnkara, TURKEY



  • Background

    • Sequence Alignment

    • Blast

  • Embedding Subsequences

    • Fastmap, LMDS

    • Analysis of parameters to achieve stable and accurate mapping

  • Indexing Subsequences

Sequence similarity search

Sequence Similarity Search

  • Sequence similarity search is at the heart of bioinformatics research

    • Similarity information allows: structural, functional, and evolutionary inferences

Sequence alignment

Sequence Alignment

  • Goal: maximize “alignment score”

  • Score of aligning two residues:

    • Substitution matrix

  • Optimal solution: Dynamic Programming

    • Global: Needleman-Wunsch (1970)

    • Local: Smith-Waterman (1981)

Blast basic local alignment search tool

Blast (Basic Local Alignment Search Tool)

  • Popular tool for similarity search in sequence databases

  • Generate “k-tuples” (“k-mers”, “words”) from query


    • CDE  ADE,CDC,CCE, CDE, …

  • Find (exact) matching k-tuples in the database

  • For each candidate sequence, extend the k-tuple match in both directions.

Time accuracy trade off

Time-accuracy trade-off

Proteins (203 tuples)

DNA (411 tuples)

  • Challenge:

    • Allow flexible matching for larger words at reasonable time







Too many k-tuple hits to process

Slows down the extension phase

  • Few/none k-tuple hits

  • Fast execution

  • Exact k-tuple matching not sensitive

  • Too many false negatives

Raising the bar for k

Raising the bar for k

  • Map k-tuples to a vector space

    • Mapping cannot be perfect, thus “approximate results”

  • Use Spatial Access Methods (e.g. R-tree, X-tree) to index and retrieve k-tuples

Mapping k tuples

Mapping k-tuples

  • Requirements:

    • Need to support out of sample extension

    • Speed

  • Candidate methods:

    • Fastmap (Faloutsos, 1995)

    • Landmark MDS (de Silva, 2003)



  • Select two pivots

    • Distant pivots heuristic

  • Obtain projection using

    cosine law

  • Project objects to

    new hyperplane

  • Repeat



  • Fast! O(Nd)

    • N: number of data points

    • d is the target dimensionality

  • For query, need only to calculate distances to set of pivots

  • Unstable (esp. if original space is non-Euclidean)

Landmark mds

Landmark MDS

  • Select n landmarks (pivots)

  • Embed landmarks using classical MDS

  • For the remaining objects, apply distance-based triangulation based on distances to landmarks

Landmark mds1

Landmark MDS

  • Provides stable results

  • Good selection of landmarks is critical.

    • LMDSrandom

    • LMDSmaxmin

      • Add new landmarks that maximizes the minimum distance to already selected landmarks

    • LMDSfastmap

      • Use the same landmarks as found by Fastmap



  • Synthetic datasets

    • Randomly generate k-tuples for a given k and alphabet size σ

  • Real dataset

    • Yeast proteins benchmark (σ=20)

    • 6,341 proteins, 2.9 million residues

    • 103 query proteins, 38-884 residues

  • Weighted Hamming distance

  • CB-EUC substitution matrix (Sacan, 2007)

Target dimensionality d

Target dimensionality (d)

  • Sammon’s metric stress:

  • Breaking point dimensionality

k=5, synthetic dataset, identity matrix

Subsequence length k and alphabet size

Subsequence length (k)and alphabet size (σ)

Number of landmarks

Number of landmarks

k=5, d=7, synthetic dataset, identity matrix

Approximate k tuple search performance

Approximate k-tuple search performance

  • Find all k-tuples within a specified radius from a query k-tuple

k=6, d=8, real dataset, CB-EUC matrix

Homology search

Homology search

k=6, d=8, real dataset, CB-EUC matrix

Search time

Search time

search radius=7

Database size=100,000



  • Applied an embedding-based approach to approximate sequence similarity search for the first time

  • Significant time improvements with negligible degradation in accuracy

  • Achieved more stable embedding with combined pivot selection strategy

  • Defined intrinsic Euclidean dimensionality of the dataset

  • Login