Approximate similarity search in genomic sequence databases using landmark guided embedding
Download
1 / 20

Approximate Similarity Search in Genomic Sequence Databases using Landmark-Guided Embedding - PowerPoint PPT Presentation


  • 84 Views
  • Uploaded on

Approximate Similarity Search in Genomic Sequence Databases using Landmark-Guided Embedding. Ahmet Sacan and I. Hakki Toroslu email: [ ahmet,toroslu ]@ ceng.metu.edu.tr Computer Engineering Department, Middle East Technical University Ankara, TURKEY. Outline. Background Sequence Alignment

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about ' Approximate Similarity Search in Genomic Sequence Databases using Landmark-Guided Embedding' - lew


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
Approximate similarity search in genomic sequence databases using landmark guided embedding

Approximate Similarity Search in Genomic Sequence Databases usingLandmark-Guided Embedding

AhmetSacan and I. HakkiToroslu

email: [ahmet,toroslu]@ceng.metu.edu.tr

Computer Engineering Department,Middle East Technical UniversityAnkara, TURKEY


Outline
Outline using

  • Background

    • Sequence Alignment

    • Blast

  • Embedding Subsequences

    • Fastmap, LMDS

    • Analysis of parameters to achieve stable and accurate mapping

  • Indexing Subsequences


Sequence similarity search
Sequence Similarity Search using

  • Sequence similarity search is at the heart of bioinformatics research

    • Similarity information allows: structural, functional, and evolutionary inferences


Sequence alignment
Sequence Alignment using

  • Goal: maximize “alignment score”

  • Score of aligning two residues:

    • Substitution matrix

  • Optimal solution: Dynamic Programming

    • Global: Needleman-Wunsch (1970)

    • Local: Smith-Waterman (1981)


Blast basic local alignment search tool
Blast (Basic Local Alignment Search Tool) using

  • Popular tool for similarity search in sequence databases

  • Generate “k-tuples” (“k-mers”, “words”) from query

    • CDEFG  CDE, DEF, EFG

    • CDE  ADE,CDC,CCE, CDE, …

  • Find (exact) matching k-tuples in the database

  • For each candidate sequence, extend the k-tuple match in both directions.


Time accuracy trade off
Time-accuracy trade-off using

Proteins (203 tuples)

DNA (411 tuples)

  • Challenge:

    • Allow flexible matching for larger words at reasonable time

1

2

3

4

11

k:

Too many k-tuple hits to process

Slows down the extension phase

  • Few/none k-tuple hits

  • Fast execution

  • Exact k-tuple matching not sensitive

  • Too many false negatives


Raising the bar for k
Raising the bar for k using

  • Map k-tuples to a vector space

    • Mapping cannot be perfect, thus “approximate results”

  • Use Spatial Access Methods (e.g. R-tree, X-tree) to index and retrieve k-tuples


Mapping k tuples
Mapping k-tuples using

  • Requirements:

    • Need to support out of sample extension

    • Speed

  • Candidate methods:

    • Fastmap (Faloutsos, 1995)

    • Landmark MDS (de Silva, 2003)


Fastmap
Fastmap using

  • Select two pivots

    • Distant pivots heuristic

  • Obtain projection using

    cosine law

  • Project objects to

    new hyperplane

  • Repeat


Fastmap1
Fastmap using

  • Fast! O(Nd)

    • N: number of data points

    • d is the target dimensionality

  • For query, need only to calculate distances to set of pivots

  • Unstable (esp. if original space is non-Euclidean)


Landmark mds
Landmark MDS using

  • Select n landmarks (pivots)

  • Embed landmarks using classical MDS

  • For the remaining objects, apply distance-based triangulation based on distances to landmarks


Landmark mds1
Landmark MDS using

  • Provides stable results

  • Good selection of landmarks is critical.

    • LMDSrandom

    • LMDSmaxmin

      • Add new landmarks that maximizes the minimum distance to already selected landmarks

    • LMDSfastmap

      • Use the same landmarks as found by Fastmap


Evaluation
Evaluation using

  • Synthetic datasets

    • Randomly generate k-tuples for a given k and alphabet size σ

  • Real dataset

    • Yeast proteins benchmark (σ=20)

    • 6,341 proteins, 2.9 million residues

    • 103 query proteins, 38-884 residues

  • Weighted Hamming distance

  • CB-EUC substitution matrix (Sacan, 2007)


Target dimensionality d
Target dimensionality (d) using

  • Sammon’s metric stress:

  • Breaking point dimensionality

k=5, synthetic dataset, identity matrix


Subsequence length k and alphabet size
Subsequence length (k) usingand alphabet size (σ)


Number of landmarks
Number of landmarks using

k=5, d=7, synthetic dataset, identity matrix


Approximate k tuple search performance
Approximate k-tuple search performance using

  • Find all k-tuples within a specified radius from a query k-tuple

k=6, d=8, real dataset, CB-EUC matrix


Homology search
Homology search using

k=6, d=8, real dataset, CB-EUC matrix


Search time
Search time using

search radius=7

Database size=100,000


Conclusion
Conclusion using

  • Applied an embedding-based approach to approximate sequence similarity search for the first time

  • Significant time improvements with negligible degradation in accuracy

  • Achieved more stable embedding with combined pivot selection strategy

  • Defined intrinsic Euclidean dimensionality of the dataset


ad