1 / 17

Substitution models and evolutionary distances

This presentation explores the concepts of substitution models and evolutionary distances in DNA sequences, including multiple substitutions at the same site, Markov chains, rate matrices, and time-reversible models.

liak
Download Presentation

Substitution models and evolutionary distances

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Substitution models and evolutionary distances Xuhua Xia xxia@uottawa.ca http://dambe.bio.uottawa.ca

  2. Key Concepts • Multiple substitutions at the same site. • Markov chains • Transition probability matrix • Rate matrix • Rate and frequency parameters, equilibrium frequencies • Time-reversible • Stationary • Evolutionary distances derived from Markov chain Slide 2

  3. Nucleotide substitutions C ACACTCGGATTAGGCT coincidental single convergent parallel ATACTCAGGTTAAGCT Observed sequences ACACTCGGATTAGGCT ACAATCCGGTTAAGCT multiple T back C From WHL ACACTCGGATTAGGCT • Actual number of substitutions during the evolution of the two daughter sequences: 12 • Observed number of substitutions between the two daughter sequences: 3. • Substitution models are for correcting multiple hits. Slide 3

  4. Markov Chain • Markov property (memoryless):P(Xn+1=j|Xn=iand A)= P(Xn+1=j|Xn=i) • Stationary property: P(Xn+1=j|Xn=i) is constant for all n >0. {1, 2, …, N} Q = Nucleotide substitution model Q = {A, C, G, T}

  5. Markov Chain Nuc. Freq. PAA PAG PAC PAT PGA PGG PGC PGT PCA PCG PCC PCT PTA PTG PTC PTT [At+1 Gt+1 Ct+1 Tt+1 ] = [At Gt Ct Tt ] P(t+1) = P(t)*M 0.970.01 0.01 0.01 0.01 0.97 0.01 0.01 0.01 0.01 0.97 0.01 0.01 0.01 0.01 0.97 [1 0 0 0] = [0.97 0.01 0.01 0.01] 0.970.01 0.01 0.01 0.01 0.97 0.01 0.01 0.01 0.01 0.97 0.01 0.01 0.01 0.01 0.97 = [0.9412 0.0196 0.0196 0.0196] [0.97 0.01 0.01 0.01]

  6. Markov Chain P(t) = P(0)*Mt If substitution rate is really as high as 0.01 per generation per site, then the sequence will almost reach full substitution saturation in just about 100 generations. How far can we trace back history with realistic neutral substitution rate?

  7. Realistic neutral substitution rate With neutral substitution rate of 10-8, the sequence will reach almost full substitution saturation in about 100000000 generations. Early simple organisms likely to have many generations per year, and may well have higher substitution rate because of less efficient DNA repair mechanisms In particular, neutral substitution rate is typically much higher than 10-8.

  8. Substitution Models: JC69 A = C= G= T= 1/4 Q = How to obtain transition probabilities? Three methods (following slides)

  9. Method 1 solve rate equations: JC69 The equilibrium frequency is obtained when dPA(t)/dt = 0, i.e., -4αPA(t) +α = 0 PA(t) = 1/4 (e) [1 – Pii(t)] / 3 Slide 9

  10. Method 2: probability thinking α (b) α (a) A G C T (c) After time t, the expected number of substitution is 4αt Poisson distribution: P(x=0,α,t) = e-4αt, P(x>0,α,t) = 1- e-4αt α α • Each nucleotide has a rate α of being substituted by any of the 4 nucleotides (d) As a nucleotide (say A) can change to 3 others, we have Because we have four nucleotides, each gets ¼ of P(x≥1,α,t) (f) Changed to itself Nothing changed (e)

  11. Method 3 (general): matrix exponential with(LinearAlgebra); Q:=Matrix([[-3*a,a,a,a],[a,-3*a,a,a],[a,a,-3*a,a],[a,a,a,-3*a]]); MatrixExponential(Q); Slide 13

  12. Method 1: K80 Solve this set of four equations will yield: (e) The equilibrium frequency (i.e., when frequencies change no more) is obtained when Slide 14

  13. K80 PA(t)+ PG(t)+ 2PY(t)= 1, otherwise the derivations must be wrong. Slide 15

  14. + Method 2: probability thinking =α (a) (b) + =α A G After time t, the expected number of changes is 2(α+)t   C T Focus on nucleotide A: Event 1 (e1): A has a rate  of being substituted by any of the 4 nucleotides. Event 2 (e2): A has an additional rate  of changing to G or to itself. (g) (c) Poisson distribution: P(e1,e2=0,t) = e-2(α+)t, P(e1≥1,t) = 1- e-4t P(e2≥1,e1=0,t)=1- P(e1,e2=0,t) - P(e1≥1,t) = e-4t - e-2(α+)t (f) (d) (e)

  15. Method 3 (general) matrix exponential with(LinearAlgebra); Q:=Matrix([[-(a+2*b),a,b,b],[a,-(a+2*b),b,b],[b,b,-(a+2*b),a],[b,b,a,-(a+2*b)]]); MatrixExponential(Q); Slide 17

  16. Calculating distance SP1 AAG CCT CGG GGC CCT TAT TTT TTG || | ||| ||| | ||| ||| || SP2 AAT CTC CGG GGC CTC TAT TTT TTT What are P and Q? P = 4/24, Q = 2/24 Comparison of distance: P=0.25 DJC69=0.304099 DK80=0.3150786 Slide 18

  17. TN93 Distance • By solving equations (1),(2),(3), we have • , , Slide 22

More Related