Approximate similarity search in genomic sequence databases using landmark guided embedding
1 / 20

Approximate Similarity Search in Genomic Sequence Databases using Landmark-Guided Embedding - PowerPoint PPT Presentation

  • Uploaded on

Approximate Similarity Search in Genomic Sequence Databases using Landmark-Guided Embedding. Ahmet Sacan and I. Hakki Toroslu email: [ ahmet,toroslu ]@ Computer Engineering Department, Middle East Technical University Ankara, TURKEY. Outline. Background Sequence Alignment

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
Download Presentation

PowerPoint Slideshow about ' Approximate Similarity Search in Genomic Sequence Databases using Landmark-Guided Embedding' - lew

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
Approximate similarity search in genomic sequence databases using landmark guided embedding

Approximate Similarity Search in Genomic Sequence Databases usingLandmark-Guided Embedding

AhmetSacan and I. HakkiToroslu

email: [ahmet,toroslu]

Computer Engineering Department,Middle East Technical UniversityAnkara, TURKEY

Outline using

  • Background

    • Sequence Alignment

    • Blast

  • Embedding Subsequences

    • Fastmap, LMDS

    • Analysis of parameters to achieve stable and accurate mapping

  • Indexing Subsequences

Sequence similarity search
Sequence Similarity Search using

  • Sequence similarity search is at the heart of bioinformatics research

    • Similarity information allows: structural, functional, and evolutionary inferences

Sequence alignment
Sequence Alignment using

  • Goal: maximize “alignment score”

  • Score of aligning two residues:

    • Substitution matrix

  • Optimal solution: Dynamic Programming

    • Global: Needleman-Wunsch (1970)

    • Local: Smith-Waterman (1981)

Blast basic local alignment search tool
Blast (Basic Local Alignment Search Tool) using

  • Popular tool for similarity search in sequence databases

  • Generate “k-tuples” (“k-mers”, “words”) from query


    • CDE  ADE,CDC,CCE, CDE, …

  • Find (exact) matching k-tuples in the database

  • For each candidate sequence, extend the k-tuple match in both directions.

Time accuracy trade off
Time-accuracy trade-off using

Proteins (203 tuples)

DNA (411 tuples)

  • Challenge:

    • Allow flexible matching for larger words at reasonable time







Too many k-tuple hits to process

Slows down the extension phase

  • Few/none k-tuple hits

  • Fast execution

  • Exact k-tuple matching not sensitive

  • Too many false negatives

Raising the bar for k
Raising the bar for k using

  • Map k-tuples to a vector space

    • Mapping cannot be perfect, thus “approximate results”

  • Use Spatial Access Methods (e.g. R-tree, X-tree) to index and retrieve k-tuples

Mapping k tuples
Mapping k-tuples using

  • Requirements:

    • Need to support out of sample extension

    • Speed

  • Candidate methods:

    • Fastmap (Faloutsos, 1995)

    • Landmark MDS (de Silva, 2003)

Fastmap using

  • Select two pivots

    • Distant pivots heuristic

  • Obtain projection using

    cosine law

  • Project objects to

    new hyperplane

  • Repeat

Fastmap using

  • Fast! O(Nd)

    • N: number of data points

    • d is the target dimensionality

  • For query, need only to calculate distances to set of pivots

  • Unstable (esp. if original space is non-Euclidean)

Landmark mds
Landmark MDS using

  • Select n landmarks (pivots)

  • Embed landmarks using classical MDS

  • For the remaining objects, apply distance-based triangulation based on distances to landmarks

Landmark mds1
Landmark MDS using

  • Provides stable results

  • Good selection of landmarks is critical.

    • LMDSrandom

    • LMDSmaxmin

      • Add new landmarks that maximizes the minimum distance to already selected landmarks

    • LMDSfastmap

      • Use the same landmarks as found by Fastmap

Evaluation using

  • Synthetic datasets

    • Randomly generate k-tuples for a given k and alphabet size σ

  • Real dataset

    • Yeast proteins benchmark (σ=20)

    • 6,341 proteins, 2.9 million residues

    • 103 query proteins, 38-884 residues

  • Weighted Hamming distance

  • CB-EUC substitution matrix (Sacan, 2007)

Target dimensionality d
Target dimensionality (d) using

  • Sammon’s metric stress:

  • Breaking point dimensionality

k=5, synthetic dataset, identity matrix

Subsequence length k and alphabet size
Subsequence length (k) usingand alphabet size (σ)

Number of landmarks
Number of landmarks using

k=5, d=7, synthetic dataset, identity matrix

Approximate k tuple search performance
Approximate k-tuple search performance using

  • Find all k-tuples within a specified radius from a query k-tuple

k=6, d=8, real dataset, CB-EUC matrix

Homology search
Homology search using

k=6, d=8, real dataset, CB-EUC matrix

Search time
Search time using

search radius=7

Database size=100,000

Conclusion using

  • Applied an embedding-based approach to approximate sequence similarity search for the first time

  • Significant time improvements with negligible degradation in accuracy

  • Achieved more stable embedding with combined pivot selection strategy

  • Defined intrinsic Euclidean dimensionality of the dataset