1 / 17

MCALIGN Monte Carlo Align A sequence evolution model based alignment method

MCALIGN Monte Carlo Align A sequence evolution model based alignment method Keightley PD, Johnson T ., MCALIGN: stochastic alignment of noncoding DNA sequences based on an evolutionary model of sequence evolution, Genome Res. 2004 Mar;14(3):442-50. Non-coding DNA and heuristics.

Download Presentation

MCALIGN Monte Carlo Align A sequence evolution model based alignment method

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. MCALIGN Monte Carlo Align A sequence evolution model based alignment method Keightley PD, Johnson T., MCALIGN: stochastic alignment of noncoding DNA sequences based on an evolutionary model of sequence evolution, Genome Res. 2004 Mar;14(3):442-50.

  2. Non-coding DNA and heuristics • Aligning divergent Non-coding DNA is more difficult than aligning coding DNA and even more difficult than aligning protein sequences. • Alignments with too many gaps or over fragmented gaps tend to have too few nucleotide differences • Alignments with too few gaps tend to have too many differences • Alignment 1: • TTATA - - - - CAG three nucleotide differences • TTAGCTAAGCCG • Alignment 2: • TTA - - TA - - CAG one nucleotide difference • TTAGCTAAGCCG

  3. Why a model based method • Heuristic methods produce alignments by minimizing/maximising a scoring functions which are chosen more or less arbitrarily • Inferences like estimates of sequence divergence/convergence (evolutionary distance) based on such alignments are biased • This makes the relation between parameters of DNA sequence evolution and relative penalties for substitution and indels unclear • Therefore Explicit model based approaches are desirable • This method is for global alignment of noncoding DNA sequences that are homologous

  4. Let a = variable describing the alignment t = parameter of sequence evolution over time S = observed sequence data Statistical framework Inference about a alone when t is a “nuisance parameter” Unconditional = constant The key term in both equations is and needs to be computed

  5. Probability model of sequence evolution is Probability of indel pattern (alignment) is Probability of observed seq given this indel pattern Parameter of sequence evolution

  6. The phylogeny of Drosophila species closely related to D. simulans (sim), including D. sechellia (sec), D. melanogaster (mel), and D. yakuba (yak)

  7. Assumptions • The two sequences were identical to that of the common ancestor, to start with • There were no indels • Insertions and deletions occurred independently at a rate • Probability of an indel is per interbase site • The proportion of an indel of length i is wi ,such that • An alignment is characterised by gaps of length and … sites at which indels could have occurred (non-indels)

  8. Probability of a given alignement with gaps of length and m non-indels is given by To derive Jukes-Cantor model of nucleotide substitution was used 1 .n = # of nucleotides differences, l = # of nucleotides not aligned to a gap, .u = # of nucleotides aligned to a gap where

  9. Jukes and Cantor, 1969 • It is the simplest substitution model. There are several assumptions. It assumes equal base frequencies ( ) and equal mutation rates. The only parameter of this model is therefore μ, the overall substitution rate.

  10. Alignment Algorithm Characteristics • Monte Carlo hill climbing algorithm. Transitions between local minima. • Searches for highest probability • Approximation for three-way alignments where is t that maximizes Pr(a,t|s) and C(S) is some constant. • Approximation looks at height of peak of the function distribution instead of integration over the range over all t for a given a.

  11. Alignment Algorithm • Initial alignment is a heuristic “divide-and-conquer” algorithm. Best alignment selected from a series of alignments scored with different scoring functions. • New alignment(a2) generated as transformation of current alignment(a1), then accepted with a randomized probability. • Transformation is one of following chosen randomly: • Add gap pair in random sites • Remove random gap pair or parts thereof • Move gap within sequence • Split gap within sequence • Merge a pair of adjacent gaps

  12. Algorithm cont.. • New alignment accepted with probability • Fraction of proposals accepted ~0.4. • holds alignment with max. probability. • Pr(ai|S) < 0.01 Pr(amax|S) for more than 100 iterations resets alignment to amax. • Search stopped after preset iterations from amaxwithout increasing amax.

  13. Indel evolution parameters • Drosophila sps. data was used to estimate q and t. • Seqs. Of length ~ 6300 bases found to have 193 substitutions and 44 indels (Sg) with 198 bases of indelans 6328 non-indel sites. • This gives a nucleotide difference of 0.0306(t) which in turn gives a q of 0.225. • Proportions of indels ( wx ) for 1-bp(0.455) and 2-bp(0.182) indels adopted from data. • Indels in range 3-40 assigned w from a function where b is a constant and a is estimated to be 1.167.

  14. Performance evaluation • Evaluated over range of t and q values with 200 replicates for each set. • Alignement performance decreases with increasing seq. divergence. • 3 seq. alignments perfom comparably to 2 seqs. • If evolution model qe is lower than assumed qa, estimates for t are marginally lower. If qe is higher estimates of t substantially higher. • Execution time increases non-linearly with seq. length and as a function of t.

  15. Comparison • MCALIGN performs better at higher seq. divergences • Caveat being that a priori estimate of t was used. • Fraction of correctly aligned bases if comparable across all methods.

  16. Discussion • Criticisms • Requires appreciable homology between seqs. • Based on Jukes-Cantor model of nucleotide substitution. • Parameters q and t derived from training data limited to that genera. • Search space can have unreachable states. • Seqs. longer than 1.5 kb cannot be aligned in reasonable time. • Strengths • Tackles difficult problem of aligning non coding regions better than heuristic alignments. • Evaluates large number of alignments. • Returns estimates of sequence divergence and nucleotide substitution in addition to most probable alignment.

More Related