MCALIGN Monte Carlo Align A sequence evolution model based alignment method

MCALIGN Monte Carlo Align A sequence evolution model based alignment method Keightley PD, Johnson T., MCALIGN: stochastic alignment of noncoding DNA sequences based on an evolutionary model of sequence evolution, Genome Res. 2004 Mar;14(3):442-50.

Non-coding DNA and heuristics • Aligning divergent Non-coding DNA is more difficult than aligning coding DNA and even more difficult than aligning protein sequences. • Alignments with too many gaps or over fragmented gaps tend to have too few nucleotide differences • Alignments with too few gaps tend to have too many differences • Alignment 1: • TTATA - - - - CAG three nucleotide differences • TTAGCTAAGCCG • Alignment 2: • TTA - - TA - - CAG one nucleotide difference • TTAGCTAAGCCG

Why a model based method • Heuristic methods produce alignments by minimizing/maximising a scoring functions which are chosen more or less arbitrarily • Inferences like estimates of sequence divergence/convergence (evolutionary distance) based on such alignments are biased • This makes the relation between parameters of DNA sequence evolution and relative penalties for substitution and indels unclear • Therefore Explicit model based approaches are desirable • This method is for global alignment of noncoding DNA sequences that are homologous

Let a = variable describing the alignment t = parameter of sequence evolution over time S = observed sequence data Statistical framework Inference about a alone when t is a “nuisance parameter” Unconditional = constant The key term in both equations is and needs to be computed

Probability model of sequence evolution is Probability of indel pattern (alignment) is Probability of observed seq given this indel pattern Parameter of sequence evolution

The phylogeny of Drosophila species closely related to D. simulans (sim), including D. sechellia (sec), D. melanogaster (mel), and D. yakuba (yak)

Assumptions • The two sequences were identical to that of the common ancestor, to start with • There were no indels • Insertions and deletions occurred independently at a rate • Probability of an indel is per interbase site • The proportion of an indel of length i is wi ,such that • An alignment is characterised by gaps of length and … sites at which indels could have occurred (non-indels)

Probability of a given alignement with gaps of length and m non-indels is given by To derive Jukes-Cantor model of nucleotide substitution was used 1 .n = # of nucleotides differences, l = # of nucleotides not aligned to a gap, .u = # of nucleotides aligned to a gap where

Jukes and Cantor, 1969 • It is the simplest substitution model. There are several assumptions. It assumes equal base frequencies ( ) and equal mutation rates. The only parameter of this model is therefore μ, the overall substitution rate.

Alignment Algorithm Characteristics • Monte Carlo hill climbing algorithm. Transitions between local minima. • Searches for highest probability • Approximation for three-way alignments where is t that maximizes Pr(a,t|s) and C(S) is some constant. • Approximation looks at height of peak of the function distribution instead of integration over the range over all t for a given a.

Alignment Algorithm • Initial alignment is a heuristic “divide-and-conquer” algorithm. Best alignment selected from a series of alignments scored with different scoring functions. • New alignment(a2) generated as transformation of current alignment(a1), then accepted with a randomized probability. • Transformation is one of following chosen randomly: • Add gap pair in random sites • Remove random gap pair or parts thereof • Move gap within sequence • Split gap within sequence • Merge a pair of adjacent gaps

Algorithm cont.. • New alignment accepted with probability • Fraction of proposals accepted ~0.4. • holds alignment with max. probability. • Pr(ai|S) < 0.01 Pr(amax|S) for more than 100 iterations resets alignment to amax. • Search stopped after preset iterations from amaxwithout increasing amax.

Indel evolution parameters • Drosophila sps. data was used to estimate q and t. • Seqs. Of length ~ 6300 bases found to have 193 substitutions and 44 indels (Sg) with 198 bases of indelans 6328 non-indel sites. • This gives a nucleotide difference of 0.0306(t) which in turn gives a q of 0.225. • Proportions of indels ( wx ) for 1-bp(0.455) and 2-bp(0.182) indels adopted from data. • Indels in range 3-40 assigned w from a function where b is a constant and a is estimated to be 1.167.

Performance evaluation • Evaluated over range of t and q values with 200 replicates for each set. • Alignement performance decreases with increasing seq. divergence. • 3 seq. alignments perfom comparably to 2 seqs. • If evolution model qe is lower than assumed qa, estimates for t are marginally lower. If qe is higher estimates of t substantially higher. • Execution time increases non-linearly with seq. length and as a function of t.

Comparison • MCALIGN performs better at higher seq. divergences • Caveat being that a priori estimate of t was used. • Fraction of correctly aligned bases if comparable across all methods.

Discussion • Criticisms • Requires appreciable homology between seqs. • Based on Jukes-Cantor model of nucleotide substitution. • Parameters q and t derived from training data limited to that genera. • Search space can have unreachable states. • Seqs. longer than 1.5 kb cannot be aligned in reasonable time. • Strengths • Tackles difficult problem of aligning non coding regions better than heuristic alignments. • Evaluates large number of alignments. • Returns estimates of sequence divergence and nucleotide substitution in addition to most probable alignment.

MCALIGN Monte Carlo Align A sequence evolution model based alignment method

MCALIGN Monte Carlo Align A sequence evolution model based alignment method

Presentation Transcript

Lecture 2 – Monte Carlo method in finance

Parallel Monte Carlo Method

The Monte Carlo Method!!!

Monte Carlo

CUDA - Based Sequence Alignment

Stanislaw Ulam and Monte Carlo Method

Regulatory sequence analysis based on a probabilistic model of evolution

The Markov Chain Monte Carlo Method

A Tree Sequence Alignment-based Tree-to-Tree Translation Model

The Monte Carlo method

Inverse Lighting with a Monte Carlo Method

Evaluation-Function Based Monte-Carlo LOA

A Monte Carlo Model of Tevatron Operations

Monte Carlo Atmosphere Model

A Compressing Method for Genome Sequence Cluster using Sequence Alignment

Molecular Monte Carlo Method

Advanced Topics in Molecular Monte Carlo Method

The Monte Carlo method

Maximum a posteriori sequence estimation using Monte Carlo particle filters

MONTE CARLO NUMERICAL METHOD

Basic Statistics and Monte-Carlo Method -2

PDF uncertainties using a Monte Carlo method