Lecture 3: Markov models of sequence evolution

1 / 36

Lecture 3: Markov models of sequence evolution - PowerPoint PPT Presentation

Lecture 3: Markov models of sequence evolution. Alexei Drummond. Friday quiz : How many bacterial cells are there in an average adult human?. 10 12 (1 trillion) 10 13 (10 trillion) 10 14 (100 trillion) 10 15 (1000 trillion)

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.

PowerPoint Slideshow about 'Lecture 3: Markov models of sequence evolution' - susan

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

Lecture 3: Markov models of sequence evolution

Alexei Drummond

Friday quiz: How many bacterial cells are there in an average adult human?
• 1012 (1 trillion)
• 1013 (10 trillion)
• 1014 (100 trillion)
• 1015 (1000 trillion)

Hint: There are about 1014 human cells in the average adult human.

CS369 2007

Modeling genetic change
• Given two or more aligned nucleotide or amino acid sequences, usually the first goal is to calculate some measure of sequence similarity (or conversely distance)
• The simplest way to estimate genetic distances is the p-distance (number of differences between two sequences divided by the sequence length)
• The p-distance is the hamming distance normalized by the length of the sequence. Therefore it is the proportion of positions at which the sequences differ.
• The p-distance can also be consider the probability that the two sequences differ at a random position (site).

CS369 2007

AATCTGTGTA

* *

ATCCTGGGTT

* * **

Modeling genetic change

AACCTGTGCA

Seq1 AATCTGTGTA

seq2 ATCCTGGGTT

** * *

CS369 2007

P-distance

Seq1 AATCTGTGTA

seq2 ATCCTGGGTT

** * *

p-distance=0.4

proportion of # nt between two sequences

Usually underestimate the true distance:

genetic (or evolutionary) distance d

CS369 2007

Multiple, parallel, and back-substitutions

AACCTGTGCA

AACCTGTGCA

TAA

C

AACCAGTGAA

* *

AACCTGTGCA

TGA

C

ACCCGGTGAA

* *

CS369 2007

Transition probabilities
• Definition: Let Pxy(t) be the probability that a nucleotide x evolves to a nucleotide y in time t. If x = y then this evolutionary pathway could involve 0, 2, 3 or more substitutions. If xy the the pathway could involve 1, 2, 3 or more substitutions.
• P(t) is then a square transition probability matrix of size 4 by 4.

CS369 2007

i = A, C, G, T

PGG(t) and PGA(t)

Independent from i

G

PGG(t) PGA(t)

t

G A

Modeling nucleotide substitutions as a time-homogeneous time-continuous stationary Markov process (1)

Markov property

• At any given site in a sequence the rate of change from base i to base j is independent from the base that occupied that site prior i

CS369 2007

Modeling nt substitutions as a time-homogeneous time-continuous stationary Markov process (2)
• Homogeneity
• Substitution rates do not change over time
• Stationarity
• The relative frequencies of A, C, G, and T(pA, pC, pG, pT) are at equilibrium, i.e. remain constant.

CS369 2007

Models of DNA Substitution

1. Base frequencies are equal and

all substitutions are equally likely

(Jukes-Cantor)

Simplest

2. Base frequencies are equal but transitions and

transversions occur at different rates

(Kimura 2 parameter)

3. Unequal base frequencies and transitions and

transversions occur at different rates

(Hasegawa-Kishino-Yano)

4. Unequal base frequencies and all

substitution types occur at different rates

(General Reversible Model)

Most complex

CS369 2007

The Q-matrix (instantaneous rate matrix)

A

C

G

T

pifrequency of nt i

a, b, c, etc. relative rate parameters

non-diagonal entries: rate flow from nucleotide i to nucleotide j

diagonal entries: total rate flow that leaves nucleotide i (rate at which nt i disappear per site per sequence).

 scale factor so total output per unit time = 1.0

CS369 2007

The Q-matrix

A

C

G

T

A

C

G

T

CS369 2007

General Time Reversible (GTR) Models

A

C

G

T

Substitutions from nucleotide i to nucleotide j have the same rate of substitutions from nucleotide j to nucleotide i.

In general:f = 1 anda, b, c, d, eare estimated from the data via maximum likelihood

CS369 2007

Time-reversibility

x

z

equivalent

x

y

y

CS369 2007

Evolutionary meaning of the Q-matrix for the JC model

m = rate per unit time of nucleotide i (i =A, C, G, T) replacement during evolution: nt substitutions per sequence per site per unit time

mt= nt substitutions per site between two sequences that are separated by time t = d

CS369 2007

Estimating transition probabilities
• As soon as the Q matrix, and thus the evolutionary model, is specified, it is possible to calculate the probabilities of change from any base to any other during the evolutionary time t, P(t), by computing the matrix exponential

CS369 2007

Jukes and Cantor (JC) model solution

By computing

P(t)=exp(Qt)

with Qaccording to the JC model

Pi=j(t) = probability of nt i to end up with the same character after time t

Pij(t) = probability of nt i ending up as a different character after time t

CS369 2007

Estimating the genetic distances(1)
• The total probability of two sequences sharing the same nucleotide at a position isPi=j(t) and therefore the probability of the two sequences being different, p = 1 - Pi=i(t) = Pij(t)
• p = 3/4 (1 - exp(-4/3t))
• An estimator of p is the observed proportion of different sites between two sequences ( p-distance).

CS369 2007

Estimating the genetic distances(2)
• Solving for mt we getmt = - 3/4 ln (1- 4/3 p). Substituting mt with d we finally obtain the Jukes-Cantor correction formula for the genetic distance d between two sequences:
• d = - 3/4 ln (1- 4/3 p)
• It can also be demonstrated that the variance V(d) will be given by
• V(d) = 9p(1-p)/(3-4p)2

CS369 2007

Calculating JC distance

Seq1 AATCTGTGTA

seq2 ATCCTGGGTT

** * *

p-distance = 0.4

d (JC model) = - 3/4 ln [1- 4/3 (0.4)] = 0.5716

CS369 2007

Calculating JC distance

AACCTGTGCA

AATCTGTGTA

* *

ATCCTGGGTT

* * **

p-distance = 0.4

d (JC model) = - 3/4 ln [1- 4/3 (0.4)] = 0.5716

CS369 2007

F81 model correction formula
• p = observed distance
• When pA= pT= pC= pG=0.25,  = 3/4, and the formula becomes equivalent to the one obtained for the JC model

CS369 2007

Q-matrix for the Kimura-2p (K80) model

Transversions

Transitions

CS369 2007

Average SEQUENCE COMPOSITION (HIV-O/HIV-M full pol)

5% chi-square test p-value

SE8538a passed 97.80%

97TZ02a passed 94.59%

BOLO122b passed 99.94%

CAM1b passed 96.73%

NY5CGb passed 97.64%

98IN022c passed 99.44%

94IN112c passed 98.68%

93IN101c passed 99.61%

VI850f passed 97.09%

X138g passed 86.61%

SE6165g passed 95.73%

VI991h passed 98.23%

SE9173j passed 96.17%

SE92809j passed 96.50%

MP535k passed 69.92%

92UG001d passed 86.20%

HIVO passed 77.48%

pA = 39.0%

pC = 16.6%

pG = 22.8%

pT = 21.6%

Average Ti/Tv=2.6

CS369 2007

Average SEQUENCE COMPOSITION (SIV/HIV full envelope)

5% chi-square test p-value

MVP5180 passed 14.60%

SIVcpzUS passed 48.09%

SIVcpzGAB passed 51.77%

92UG037a passed 84.58%

92UG975g passed 99.73%

92RU131g passed 97.45%

93IN905c passed 77.15%

92BRO25c passed 59.51%

92UG021d passed 94.89%

92UG024d passed 92.60%

BSSG3b passed 97.86%

SFMHS20b passed 92.40%

91TH652b passed 92.86%

MBC18R01b passed 99.59%

pA = 34.5%

pC = 17.4%

pG = 23.4%

pT = 24.7%

Average Ti/Tv=1.5

CS369 2007

Q-matrix for the F84 model

(very similar to the HKY85 model)

(Transversions)

(Transitions)

CS369 2007

A

C

G

T

A

C

G

T

Nucleotide substitution patterns in HIV/SIV

To

Average frequency of changes between states

346.2 697.4 290.3

SIV/HIV-1 envelope

241.9 123 320.8

From

515.4 126.6 117.1

Transitions

Transversions

215.6 371 144.6

Average Ti/Tv=1.5

CS369 2007

More complex models…
• More complex models, like Tamura-Nei (TN93), or the general time reversible (GTR) model usually requires numerical algorithms in order to calculate d.
• Several software packages exist that can estimate genetic distances between nucleotide sequences according to different evolutionary models
• MEGA3,
• PAUP*,
• PHYLIP,
• TREE-PUZZLE,
• DAMBE,
• Geneious 2.5.4

CS369 2007

HIV-O

SIVcpz

HIV-1C

0.391 (.008) 0.552 (.018) 0.560 (.019) 0.572 (.019)

0.266 (.009) 0.337 (.009) 0.340 (.010) 0.427 (.013)

0.163 (.008) 0.184 (.008) 0.187 (.008) 0.189 (.008)

Estimating HIV genetic distances: env gene

HIV-1B vs HIV-O/SIVcpz/HIV-1C

full envelope

CS369 2007

Estimating HIV genetic distances: pol gene

HIV-1B vs HIV-O/HIV-1C

full pol

p-distance JC69 K80 Tajima-Nei

HIV-O

HIV-1C

0.257 (.007) 0.315 (.010) 0.318 (.011) 0.324 (.011)

0.103 (.005) 0.111 (.005) 0.113 (.006) 0.114 (.006)

CS369 2007

Conclusions
• The genetic distance between two sequences can be estimated using a Markov model of DNA substitution.
• Different models will estimate different genetic distances
• We have focused on DNA models, but it is possible to consider models for proteins and models that take into account codons and the genetic code.
• Markov model approaches to estimating genetic distance do not dealwith indels, and presuppose an alignment
• These models assume that all positions in a DNA sequence mutate at the same rate. We will talk about how to relax this assumption in later lectures.

CS369 2007