1 / 19

Genome Research 2004 By Peter D. Keightly & Toby Johnson

MCALIGN: Stochastic Alignment of Non-coding DNA Sequences based on an Evolutionary Model of Sequence Evolution. Genome Research 2004 By Peter D. Keightly & Toby Johnson. Majid Kazemian. Motivations.

karis
Download Presentation

Genome Research 2004 By Peter D. Keightly & Toby Johnson

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. MCALIGN: Stochastic Alignment of Non-coding DNA Sequences based on an Evolutionary Model of Sequence Evolution Genome Research 2004 By Peter D. Keightly & Toby Johnson Majid Kazemian

  2. Motivations • The genomes of higher eukaryotes contain large amounts of non-coding DNA seq. (intergenic DNA & introns) • Seq. alignment is the major issue for the evolutionary analysis of noncoding DNA • Aligning non-coding DNA is more difficult than aligning protein-coding DNA Seqs • Where indels almost occur in multiples of three bp and rarely cross codon boundaries

  3. Which alignment is correct? • Alignments with few gaps tend to have many differences • Alignments with too many fragmented gaps tend to have too few nucleotide differences Alignment 1: TTATA - - - - CAG TTAGCTAAGCCG Alignment 2: TTA - - TA - - CAG TTAGCTAAGCCG

  4. Why a model based alignment method? • Heuristic methods produce alignments by optimizing a parametric scoring function, whose values are chosen in a more or less arbitrary fashion • How the penalties of substitution and indels events relate to DNA sequence evolution is unclear • So, Inferences based on such alignments, such as estimates of sequence divergence or conservation are almost biased. • Therefore Explicit model based approaches are desirable

  5. Features of MCALIGN • It assumes a model that allows an arbitrary distribution of indel lengths. • The distribution of indel lengths is derived empirically from additional data • It uses stochastic hill-climbing algorithm to search for more probable alignment • It is intended for global alignment • Simulations vs. real data for examining the statistical properties • It uses Jukes-Cantor model, the simplest model of nucleotide substitution.

  6. Statistical framework • Let a = variable describing the alignment • t = parameter of sequence evolution over time • S = observed sequence data • Unconditional = constant • In both equation we need to compute Inference about a alone when t is a “nuisance parameter”

  7. Probability model of sequence evolution • For two sequences there is a single parameter of sequence evolution t=(t12) where t12 is time of evolution between two sequences • For three sequences a second parameter is added that t vector is t=(t12,t(12)3) is Probability of alignment (indel pattern) is Probability of observed sequence given this indel pattern

  8. Assumptions • In the common ancestor the two sequences were identical to the ancestral sequence • There were no indels in the alignment • Insertions and deletions occurred independently at a rate ө • Probability of an indel өt12 is per interbase site • The proportion of an indel of length i is wi ,such that • An alignment α is characterized by gi gaps of length i and m sites at which indels could have occurred but did not (non-indels)

  9. Probability of a given alignement α with gi gaps of length i and m non-indels is given by • The parameters ө and wi are treated as known (they must be estimated from external data) Total # indels

  10. - l n = - 1 1 1 n u Pr( S | a , t ) ( k ) [ ( 1 k )] ( ) 1 12 12 2 4 4 • To derive Jukes-Cantor model of nucleotide substitution was used .n = # of nucleotides differences, l = # of nucleotides not aligned to a gap, u = # of nucleotides aligned to a gap where

  11. Jukes and Cantor, 1969 • The simplest substitution model • Markov chain with four states: a,c,g,t • Transition matrix P given by:

  12. Jukes and Cantor, 1969 (cont.) • As a function of time n, we get Pr(x -> y) = 0.25 + 0.75 (1-4)n if x = y and Pr(x -> y) = 0.25 - 0.25 (1-4)n otherwise

  13. Alignment Algorithm • Start from an initial alignment a1 • A heuristic method similar to “divide-and-conquer” algorithm • Generate a2 from a1 by randomly selecting one of the following transformation: • Add gap pair in random sites • Remove random gap pair or parts thereof • Move gap within sequence • Split gap within sequence • Merge a pair of adjacent gaps

  14. Alignment Algorithm (cont.) • Accept a2 with the following probability: • Store the alignment with max. probability • Reset alignment if Pr(ai|S) < 0.01 Pr(amax|S) for more than 100 iterations • Stop after preset iterations from amaxwithout increasing amax

  15. Three-way alignment • Approximation for three-way alignment • where maximizes Pr(a,t|s) and C(S) is a constant. • Adjust transformation appropriately

  16. Parameter estimations • Drosophila sps. data was used to estimate q and t • Seqs. Of length ~ 6300 bases found to have 193 substitutions and 44 indels and 6328 non-indels • This gives a nucleotide difference of t= 0.0306 and therefore q = 0.225 • Proportions of indels ( wx ) for 1-bp(0.455) and 2-bp(0.182) indels adopted from data. • Indels in range 3-40 assigned w from • where b is a constant and a is estimated to be 1.167.

  17. Results • Fraction of correctly aligned bases across all methods

  18. Conclusion & Discussion • aligns non coding regions better than heuristic alignments • returns estimates of sequence divergence and nucleotide substitution in addition to most probable alignment. • Based on Jukes-Cantor model of nucleotide substitution • Parameters q and t derived from training data limited to that genera • Search space can have unreachable states. • Execution time increases non-linearly with seq. length and as a function of t.

  19. Thanks for your attention

More Related