1 / 15

Indexing Genome Sequences

Indexing Genome Sequences. Srikanta B. J. Database Systems Lab (DSL) Indian Institute of Science. Genome Sequence Analysis. Hypothesize Function of Proteins Phylogenetic trees Causes of Diseases First step in unraveling the mystery of Life!

quade
Download Presentation

Indexing Genome Sequences

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Indexing Genome Sequences Srikanta B. J. Database Systems Lab (DSL) Indian Institute of Science IITB - Bioinformatics Workshop 2001

  2. Genome Sequence Analysis • Hypothesize • Function of Proteins • Phylogenetic trees • Causes of Diseases • First step in unraveling the mystery of Life! • Sequence Similarity  Structural Similarity  Functional Similarity IITB - Bioinformatics Workshop 2001

  3. Sequence Similarity • Alignment • between two sequences, S1 & S2 (perhaps of unequal length) • Insert spaces, into or at the ends of S1(S2) • Place them so that every character or space in either string is opposite a unique character/space in the other.E.g.,q a c - d b dq a w x - b - • Global & Local Alignments IITB - Bioinformatics Workshop 2001

  4. Alignment • Global • Given two sequences, find best alignment over full length • E.g., between (agtcacaaaact, actcgga) a g t c ac a a a a c t| | | | | | | | | | | |a c t c gg a - - - - - • Local • Look for “islands” of high similarity • E.g., between (agtcacaaaact, actcgga) a g t c a c a a a a c t | | | a c t c g g a O(mn) with Dynamic Programming IITB - Bioinformatics Workshop 2001

  5. Search Process • Given sequence to be studied • Want all similar (global/local) known sequences • Collections of sequences • NCBI-GenBank, SwissProt etc. • Contain millions of sequences IITB - Bioinformatics Workshop 2001

  6. State of the art • Dynamic Programming • Slow but accurate • Never misses a significant alignment • FastA • Faster than Dynamic Programming • Uses statistical heuristics • Reduced sensitivity  False dismissals • BLAST • Fastest and popular • Lower sensitivity than FastA • Requires whole database in memory! IITB - Bioinformatics Workshop 2001

  7. BLAST - on $1,000 Budget! • BODHI experience [DSL, 2001] • ~51,000 DNA sequences in database • CAFÉ Experience [Williams and Zobel, 2001] • ~120,000 DNA sequences in memory • Time - 67.1 seconds/BLAST  10.6 seconds / BLAST IITB - Bioinformatics Workshop 2001

  8. NCBI GenBank Growth • Doubles every 13 months • In 1998, estimated 40,000 sequence similarity queries per day That was 3 years ago!! IITB - Bioinformatics Workshop 2001

  9. We Need Indexes for Sequence Similarity Searching NOW!! IITB - Bioinformatics Workshop 2001

  10. Indexed Searching • Inverted Indexes • RAMdb [Fondrat and Dessen, 1995] • CAFÉ [Williams and Zobel, 2001] • FLASH [Califano and Rigoutsos, 1993] • Multi-Dimensional Indexes • MRS-indexing [Kahveci and Singh, 2001] • Persistent Prefix Tree [Hunt et al., 2001] IITB - Bioinformatics Workshop 2001

  11. RAMdb (Rapid Access Motif db) • Each sequence in repository is indexed by constituent overlapping sequences • 800-fold speedup over Dynamic Programming • Prohibitive index size • No ranking (goodness) of alignments • False dismissals ACTC Seq1, seq2,… Seq1, seq4,… CTCG IITB - Bioinformatics Workshop 2001

  12. CAFÉ • Partitioned Search • Coarse searching with compressed inverted index • Fine searching in small fraction of database, with ranking • 14-fold speedup over BLAST • Compression reduces the index size • Distant sequence relationships are lost • Lower retrieval effectiveness IITB - Bioinformatics Workshop 2001

  13. MRS - Indexing • Uses progressive wavelet coefficients to represent sequence IITB - Bioinformatics Workshop 2001

  14. MRS-Indexing (contd.) • Builds a hierarchy of Multi-Dim. Indexes • Only for edit distances - no general scoring schemes • Not suited for average DNA/Protein query lengths IITB - Bioinformatics Workshop 2001

  15. Summary • Rapid growth in sequence databases • Existing algorithms do not scale • Indexed approach to Sequence Similarity is necessary • Improvements needed in Indexed Searching methods IITB - Bioinformatics Workshop 2001

More Related