1 / 20

Multiple sequence comparison (MSC)

Multiple sequence comparison (MSC). Reading: Setubal/Meidanis, 3.4 Gusfield, Algorithms on Strings, Trees and Sequences, chapter 14. Why care about similarity?. Similar sequences have similar structure. Similar structure -> similar sequence?. No, the converse is not true!

jasia
Download Presentation

Multiple sequence comparison (MSC)

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Multiple sequence comparison (MSC) Reading: Setubal/Meidanis, 3.4 Gusfield, Algorithms on Strings, Trees and Sequences, chapter 14

  2. Why care about similarity? • Similar sequences have similar structure

  3. Similar structure -> similar sequence? • No, the converse is not true! • Convergent evolution. Outwardly similar solutions to similar problems may be internally different. • Tiger and ‘Tasmanian tiger’. Fish and dolphin. Bat and bird. • Same is true of molecular ‘species’ and ‘anatomies’!

  4. Sequence --> function • Similar sequences have similar function • ‘[T]he same genes that work in flies are the ones that work in humans.’ -- Eric Wieshaus 1995 Nobel for drosophila work

  5. Common origins • Similar sequences have common origins • ‘Descent with modification’ is Nature’s design mechanism • Strong similarity may imply recent common origin (what do we mean by ‘strong’ and ‘recent’?) • Strong similarity may imply strong conservation of sequence or motif

  6. Is multiple sequence comparison a generalization? • From cs point of view, we’re going from two strings to many strings, a generalization • Yes, in that it helps detect faint similarities • No, in that we go from known biological similarity to suspected sequence similarity

  7. ‘Big’ uses for MSC • Represent protein families • Identify conserved sequence features • Deduce evolutionary history

  8. Profile representation • Definition Given a multiple alignment of a set of strings, a profile specifies for each column the frequency of each character

  9. Profile example Alignment a b c - a a b a b a a c c b - c b - b c Profile C1 C2 C3 C4 C5 a .75 .25 .50 b .75 .75 c .25 .25 .50 .25 d .25 .25 .25

  10. Fit string S to profile P • Given a profile P and a string S, what is the best alignment (fit) of S to P? • Example: S: A a b - b c P: 1 - 2 3 4 5

  11. Two key issues • How to score an alignment of a string to a profile • How to compute an optimal alignment, given a scoring system

  12. Scoring and alignment of profile • Scoring Assuming letter-to-letter scores are given, use the weighted sum for each column • Optimal alignment By DP, similar to S-S optimal alignment • Q: How would you do profile-to-profile scoring and alignment?

  13. Signature (motif) representation • A motif is a regular expression (re) • Example: a helicase motif[&H][&AD[DE]xn[TSN][x4][QK]Gx7[&A], where • [abc] = any of a,b,c • & = [ILVMFYW] • x = any amino • a3 = up to 3 a’s • an = any number of a’s • Find a motif by grep-ing

  14. Finding optimal MS alignment • Need a scoring system • Given a scoring system, an (efficient) method of calculation • If no efficient method of getting the right answer, an efficient way of getting a plausible answer

  15. Need MSC measure • Desirable characteristics: • variable number of sequences • column-wise calculation • order independence MQPILLL MLR-LL- MK-ILLL MPPVLIL

  16. Sum-of-pairs (SP) measure • Column score = sum pairwise scores • k Choose 2 pairs • Reduces to pairwise alignment when k = 2 • Need to assign (-,-) value • May compute in either row or column order

  17. DP approach • Generalization of two-sequence comparison • k-dimensional array • space complexity is O(nk) • MSC with SP measure is NP-complete

  18. MSA speedup heuristic • This ‘heuristic’ guarantees the right answer! • But .. it doesn’t guarantee the speedup • General idea: • find a lower bound on L • if value for a cell exceeds L, it cannot enter into opt solution

  19. Commonly method -- iterative • Simplest implementation • Begin with Si and Sj which are pairwise closest • Iteratively merge in additional string with smallest edit distance from any in multiple alignment • Equivalent to finding MSP on edit tree

  20. Clustering method • Almost any clustering algorithm can be adapted to MSC • Usually start with small clusters and build big ones • Also possible start with big cluster, and divide-and-conquer • Not clear which method is best

More Related