Approximate similarity search in genomic sequence databases using landmark guided embedding
Download
1 / 20

Approximate Similarity Search in Genomic Sequence Databases using Landmark-Guided Embedding - PowerPoint PPT Presentation


  • 90 Views
  • Uploaded on

Approximate Similarity Search in Genomic Sequence Databases using Landmark-Guided Embedding. Ahmet Sacan and I. Hakki Toroslu email: [ ahmet,toroslu ]@ ceng.metu.edu.tr Computer Engineering Department, Middle East Technical University Ankara, TURKEY. Outline. Background Sequence Alignment

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Approximate Similarity Search in Genomic Sequence Databases using Landmark-Guided Embedding' - lew


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
Approximate similarity search in genomic sequence databases using landmark guided embedding

Approximate Similarity Search in Genomic Sequence Databases usingLandmark-Guided Embedding

AhmetSacan and I. HakkiToroslu

email: [ahmet,toroslu]@ceng.metu.edu.tr

Computer Engineering Department,Middle East Technical UniversityAnkara, TURKEY


Outline
Outline using

  • Background

    • Sequence Alignment

    • Blast

  • Embedding Subsequences

    • Fastmap, LMDS

    • Analysis of parameters to achieve stable and accurate mapping

  • Indexing Subsequences


Sequence similarity search
Sequence Similarity Search using

  • Sequence similarity search is at the heart of bioinformatics research

    • Similarity information allows: structural, functional, and evolutionary inferences


Sequence alignment
Sequence Alignment using

  • Goal: maximize “alignment score”

  • Score of aligning two residues:

    • Substitution matrix

  • Optimal solution: Dynamic Programming

    • Global: Needleman-Wunsch (1970)

    • Local: Smith-Waterman (1981)


Blast basic local alignment search tool
Blast (Basic Local Alignment Search Tool) using

  • Popular tool for similarity search in sequence databases

  • Generate “k-tuples” (“k-mers”, “words”) from query

    • CDEFG  CDE, DEF, EFG

    • CDE  ADE,CDC,CCE, CDE, …

  • Find (exact) matching k-tuples in the database

  • For each candidate sequence, extend the k-tuple match in both directions.


Time accuracy trade off
Time-accuracy trade-off using

Proteins (203 tuples)

DNA (411 tuples)

  • Challenge:

    • Allow flexible matching for larger words at reasonable time

1

2

3

4

11

k:

Too many k-tuple hits to process

Slows down the extension phase

  • Few/none k-tuple hits

  • Fast execution

  • Exact k-tuple matching not sensitive

  • Too many false negatives


Raising the bar for k
Raising the bar for k using

  • Map k-tuples to a vector space

    • Mapping cannot be perfect, thus “approximate results”

  • Use Spatial Access Methods (e.g. R-tree, X-tree) to index and retrieve k-tuples


Mapping k tuples
Mapping k-tuples using

  • Requirements:

    • Need to support out of sample extension

    • Speed

  • Candidate methods:

    • Fastmap (Faloutsos, 1995)

    • Landmark MDS (de Silva, 2003)


Fastmap
Fastmap using

  • Select two pivots

    • Distant pivots heuristic

  • Obtain projection using

    cosine law

  • Project objects to

    new hyperplane

  • Repeat


Fastmap1
Fastmap using

  • Fast! O(Nd)

    • N: number of data points

    • d is the target dimensionality

  • For query, need only to calculate distances to set of pivots

  • Unstable (esp. if original space is non-Euclidean)


Landmark mds
Landmark MDS using

  • Select n landmarks (pivots)

  • Embed landmarks using classical MDS

  • For the remaining objects, apply distance-based triangulation based on distances to landmarks


Landmark mds1
Landmark MDS using

  • Provides stable results

  • Good selection of landmarks is critical.

    • LMDSrandom

    • LMDSmaxmin

      • Add new landmarks that maximizes the minimum distance to already selected landmarks

    • LMDSfastmap

      • Use the same landmarks as found by Fastmap


Evaluation
Evaluation using

  • Synthetic datasets

    • Randomly generate k-tuples for a given k and alphabet size σ

  • Real dataset

    • Yeast proteins benchmark (σ=20)

    • 6,341 proteins, 2.9 million residues

    • 103 query proteins, 38-884 residues

  • Weighted Hamming distance

  • CB-EUC substitution matrix (Sacan, 2007)


Target dimensionality d
Target dimensionality (d) using

  • Sammon’s metric stress:

  • Breaking point dimensionality

k=5, synthetic dataset, identity matrix


Subsequence length k and alphabet size
Subsequence length (k) usingand alphabet size (σ)


Number of landmarks
Number of landmarks using

k=5, d=7, synthetic dataset, identity matrix


Approximate k tuple search performance
Approximate k-tuple search performance using

  • Find all k-tuples within a specified radius from a query k-tuple

k=6, d=8, real dataset, CB-EUC matrix


Homology search
Homology search using

k=6, d=8, real dataset, CB-EUC matrix


Search time
Search time using

search radius=7

Database size=100,000


Conclusion
Conclusion using

  • Applied an embedding-based approach to approximate sequence similarity search for the first time

  • Significant time improvements with negligible degradation in accuracy

  • Achieved more stable embedding with combined pivot selection strategy

  • Defined intrinsic Euclidean dimensionality of the dataset


ad