1 / 89

Recent Progress in Multiple Sequence Alignments: A Survey

Recent Progress in Multiple Sequence Alignments: A Survey. Cédric Notredame. Our Scope. What are The existing Methods ?. How Do They Work: -Assemby Algorithms -Weighting Schemes. When Do They Work ?. Which Future ?. Outline. - Introduction. - A taxonomy of the existing Packages.

nalani
Download Presentation

Recent Progress in Multiple Sequence Alignments: A Survey

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Recent Progress in Multiple Sequence Alignments:A Survey Cédric Notredame

  2. Our Scope What are The existing Methods? How Do They Work: -Assemby Algorithms -Weighting Schemes. When Do They Work ? Which Future?

  3. Outline -Introduction -A taxonomy of the existing Packages -A few algorithms… -Performance Comparison using BaliBase

  4. Introduction

  5. LIKE ANYMODEL What Is A Multiple Sequence Alignment? A MSA is a MODEL It Indicates the RELATIONSHIP between residues of different sequences. It REVEALS -Similarities -Inconsistencies

  6. How Can I Use A Multiple Sequence Alignment? chite ---ADKPKRPLSAYMLWLNSARESIKRENPDFK-VTEVAKKGGELWRGLKD wheat --DPNKPKRAPSAFFVFMGEFREEFKQKNPKNKSVAAVGKAAGERWKSLSE trybr KKDSNAPKRAMTSFMFFSSDFRS----KHSDLS-IVEMSKAAGAAWKELGP mouse -----KPKRPRSAYNIYVSESFQ----EAKDDS-AQGKLKLVNEAWKNLSP ***. ::: .: .. . : . . * . *: * chite AATAKQNYIRALQEYERNGG- wheat ANKLKGEYNKAIAAYNKGESA trybr AEKDKERYKREM--------- mouse AKDDRIRYDNEMKSWEEQMAE * : .* . : Extrapolation Motifs/Patterns Multiple Alignments Are CENTRAL to MOST Bioinformatics Techniques. Profiles Phylogeny Struc. Prediction

  7. How Can I Use A Multiple Sequence Alignment? Multiple Alignments Is the most INTEGRATIVE Method Available Today. We Need MSA to INCORPORATE existing DATA

  8. BIOLOGY:What is A Good Alignment COMPUTATIONWhat is THE Good Alignment chite ---ADKPKRPLSAYMLWLNSARESIKRENPDFK-VTEVAKKGGELWRGLKD wheat --DPNKPKRAPSAFFVFMGEFREEFKQKNPKNKSVAAVGKAAGERWKSLSE trybr KKDSNAPKRAMTSFMFFSSDFRS----KHSDLS-IVEMSKAAGAAWKELGP mouse -----KPKRPRSAYNIYVSESFQ----EAKDDS-AQGKLKLVNEAWKNLSP ***. ::: .: .. . : . . * . *: * Why Is It Difficult To Compute A multiple Sequence Alignment? A CROSSROAD PROBLEM

  9. Why Is It Difficult To Compute A multiple Sequence Alignment ? BIOLOGY COMPUTATION CIRCULAR PROBLEM.... Good Good Alignment Sequences

  10. A Taxonomy of Multiple Sequence Alignment Methods

  11. Grouping According to the assembly Algorithm

  12. Simultaneous As opposed to Progressive [Simultaneous: they simultaneously use all the information] Exact As opposed to Heursistic [Heuristics: cut corners like Blast Vs SW] [Heuristics: do not guarranty an optimal solution] StochasticAs opposed to Determinist [Stochastic: contain an element of randomness] [Stochastic: Example of a Monte Carlo Surface estimation ] Iterative As opposed to Non Iterative [Iterative: run the same algorithm many times] [Iterative: Most stochastic methods are iterative]

  13. Simultaneous Progressive Clustal DCA T-Coffee Combalign MSA POA Dialign Iteralign GAs GA SAGA Prrp SAM HMMer Non tree based HMMs Iterative OMA PralineMAFFT

  14. Simultaneous Progressive Clustal DCA T-Coffee Combalign MSA POA Dialign OMA PralineMAFFT Iteralign GA Prrp SAM HMMer Iterative SAGA Stochastic

  15. NEARLY EVERY OPTIMISATIONALGORITHMHAS BEEN APPLIED TO THEMSA PROBLEM!!!

  16. Grouping According to the Objective Function

  17. Scoring an Alignment: Evolutionary based methods BIOLOGY How many events separate my sequences? Such an evaluation relies on a biological model. COMPUTATION Every position musd be independant

  18. A A C AA A C C REAL Tree Model: ALL the sequences evolved from the same ancestor A C Tree: Cost=1 A A C PROBLEM: We do not know the true tree

  19. AA A C C STAR Tree Model: ALL the sequences have the same ancestor A C Star Tree: Cost=2 A A C A PROBLEM: the tree star is phylogenetically wrong

  20. [s(a,b): matrix] AA A C C A C [i: column i] Sums of Pairs: Cost=6 [k, l: seq index] A C A Sums of Pairs Model=Every sequence is the ancestor of every sequence PROBLEM: -over-estimation of the mutation costs -Requires a weighting scheme

  21. Cost= 5*N*(N-1)/2 [5: Leucine Vs Leucine with Blosum50] LL L L L Cost=5*N*(N-1)/2-(5)*(N-1) - (-4)*(N-1) G [glycine effect] Cost=5*N*(N-1)/2-(9)*(N-1) Sums of Pairs: Some of itslimitations (Durbin, p140)

  22. 2*(9)*(N-1) (9) Delta= = 5*N*(N-1) 5*N Delta G LL L L L N Sums of Pairs: Some of its limitations (Durbin, p140) Conclusion: The more Leucine, the less expensive it gets to add a Glycin to the column...

  23. AA A C C Enthropy based Functions Model: Minimize the enthropy (variety) in each Column [number of Alanine (a) in column i] [Score of column i] [a: alphabet] [P can incorporate pseudocounts] S=0 if the column is conserved PROBLEM: -requires a simultaneous alignment -assumes independant sequences

  24. AA A C C Consistency based Functions Model: Maximise the consistency (agreement) with a list of constraints (alignments) [kand l are sequences, i is a column] [the two residues are found aligned in the list of constraints] PROBLEM: -requires a list of constraints

  25. Weighted Sums of Pairs Concistency Based Combalign MSA T-Coffee ClustalPOA Praline DCA Dialign Prrp Iteralign SAGA MAFFTOMA GIBBS SAM HMMer Enthropy

  26. A few Multiple Sequence Alignment Algorithms

  27. A Few Algorithms MSA and DCA POA ClustalW MAFFT Dialign II Prrp SAGA GIBBS Sampler

  28. Simultaneous: MSA and DCA

  29. Simultaneous Alignments : MSA 1) Set Bounds on each pair of sequences (Carillo and Lipman) 2) Compute the Maln within the Hyperspace -Few Small Closely Related Sequence. -Memory and CPU hungry -Do Well When They Can Run.

  30. chite ---ADKPKRPLSAYMLWLNSARESIKRENPDFK-VTEVAKKGGELWRGLKD wheat --DPNKPKRAPSAFFVFMGEFREEFKQKNPKNKSVAAVGKAAGERWKSLSE trybr KKDSNAPKRAMTSFMFFSSDFRS----KHSDLS-IVEMSKAAGAAWKELGP mouse -----KPKRPRSAYNIYVSESFQ----EAKDDS-AQGKLKLVNEAWKNLSP chite ---ADKPKRPLSAYMLWLNSARESIKRENPDFK-VTEVAKKGGELWRGLKD trybr KKDSNAPKRAMTSFMFFSSDFRS----KHSDLS-IVEMSKAAGAAWKELGP MSA: the carillo and Lipman bounds ( ) S = ) ( S chite ---ADKPKRPLSAYMLWLNSARESIKRENPDFK-VTEVAKKGGELWRGLKD wheat --DPNKPKRAPSAFFVFMGEFREEFKQKNPKNKSVAAVGKAAGERWKSLSE + ) ( S … [Pairwise projection of sequences k and l]

  31. Upper ? Lower a(k,m) â(k,m) a(k,l) â(k,l) MSA: the carillo and Lipman bounds a(k,l)=score of the projection k l in the optimal MSA S(a(x,y))=score of the complete multiple alignment â(k,l)=score of the optimal alignment of k l

  32. â(k,l) ? LM+ â(k,l)-S(â(x,y)) a(k,l) â(k,l) MSA: the carillo and Lipman bounds LM: a lower bound for the complete MSA LM<=S(â(x,y)) - (â(k,l)-a(k,l)) a(k,l)>=LM +â(k,l)-S(â(x,y))

  33. MSA: the carillo and Lipman bounds â(k,l) LM+ â(k,l)-S(â(x,y)) a(k,l) ä(k,l) â(k,l) LM: can be measured on ANY heuristic alignment LM = S(ä(x,y)) The better LM, the tighter the bounds…

  34. MSA: the carillo and Lipman bounds N N 0 0 Best( M-i, N-j) Best( 0-i, 0-j) + M M Forward backward

  35. Simultaneous Alignments : MSA 1) Set Bounds on each pair of sequences (Carillo and Lipman) 2) Compute the Maln within the Hyperspace -Few Small Closely Related Sequence. -Memory and CPU hungry -Do Well When They Can Run.

  36. -Few Small Closely Related Sequence, but less limited than MSA -Memory and CPU hungry, but less than MSA -Do Well When Can Run. Simultaneous Alignments : DCA

  37. Simultaneous With a New Sequence Representaion: POA-Partial Ordered Graph

  38. POA POA makes it possible to represent complex relationships: -domain deletion -domain inversions

  39. Progressive: ClustalW

  40. Progressive Alignment: ClustalW Feng and Dolittle, 1988; Taylor 198ç Clustering

  41. Dynamic Programming Using A Substitution Matrix Progressive Alignment: ClustalW

  42. Tree based Alignment : Recursive Algorithm Align ( Node N) { if ( N->left_child is a Node) A1=Align ( N->left_child) else if ( N->left_child is a Sequence) A1=N->left_child if (N->right_child is a node) A2=Align (N->right_child) else if ( N->right_child is a Sequence) A2=N->right_child Return dp_alignment (A1, A2) } C A D F G E B

  43. Progressive Alignment : ClustalW -Depends on the CHOICE of the sequences. -Depends on the ORDER of the sequences (Tree). • -Depends on the PARAMETERS: • Substitution Matrix. • Penalties (Gop, Gep). • Sequence Weight. • Tree making Algorithm.

  44. Progressive Alignment : ClustalW Weighting Weighting Within ClustalW

  45. Progressive Alignment : ClustalW GOP Position Specific GOP

  46. Progressive Alignment : ClustalW 2 2 3 -Scales Well: N, N L -Greedy Heuristic (No Guarranty). -Fast ClustalW is the most Popular Method

  47. Progressive Alignment With a Heuristic DP: MAFFT

More Related