modeling molecular substitution l.
Skip this Video
Loading SlideShow in 5 Seconds..
Modeling Molecular Substitution PowerPoint Presentation
Download Presentation
Modeling Molecular Substitution

Loading in 2 Seconds...

play fullscreen
1 / 26

Modeling Molecular Substitution - PowerPoint PPT Presentation

  • Uploaded on

Modeling Molecular Substitution Von Bing Yap Statistics Department, UC Berkeley Variations on a theme Human genome project furnishes sequences from a few individuals, but is relevant to all of us: we are very similar. We are also slightly different: polymorphisms.

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
Download Presentation

PowerPoint Slideshow about 'Modeling Molecular Substitution' - andrew

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
modeling molecular substitution
Modeling Molecular Substitution

Von Bing Yap

Statistics Department, UC Berkeley

variations on a theme
Variations on a theme

Human genome project furnishes sequences from a few individuals, but is relevant to all of us: we are very similar.

We are also slightly different: polymorphisms.

Species variations.

how did variations arise
How did variations arise?

Mutation: (a) Inherent: DNA replication errors are not always corrected. (b) External: exposure to chemicals and radiation.

Selection: Deleterious mutations are removed quickly. Neutral and rarely, advantageous mutations, are tolerated and stick around …

Fixation: It takes time for a new variant to be established (having a stable frequency) in a population.

why study molecular substitution
Why study molecular substitution?

(Relatively) early days … to test hypotheses from population genetics. Do rodents evolve faster than primates (the generation time hypothesis) ? Which sites in a protein are under much selective pressure?

More recently (from 70’s) …to derive better substitution matrices like the PAM and BLOSUM series for sequence alignment and building phylogenetic trees.

modeling dna base substitution
ModelingDNA basesubstitution

Strictly speaking, only applicable to regions undergoing little selection.


1. Site independence.

2. Site homogeneity.

3. Markovian: given current base, future substitutions independent of past.

4. Temporal homogeneity: stationary Markov chain.

markov chain on a c g t
Markov chain on {A,C,G,T}

A stationary Markov chain is a family of transition probability matrices P(t), t  0, satisfying

P(t)P(s) = P(t+s), t,s  0.

P(t,a,b) = Pr(b at time s+t | a at time s).

Let Q be a rate matrix, i.e., it has positive off-diagonal entries and each row sum is 0.

Eg., Q(1,2) is the instantaneous rate of A going to C.

Q defines an MC by

P(t) = exp(Qt) = I + Q t + Q2t2/2! + …

jukes cantor model 1969
Jukes-Cantor model(1969)

Let Q =

Then P(t) =

r = (1+3exp(-4t))/4, s = (1- exp(-4t))/4.

the stationary distribution
The stationary distribution

A probability distribution on {A,C,G,T}, , is a stationary distribution if

a(a) P(t,a,b) = (b), b, t  0,

or  P(t) = , t  0, or  Q = 0 (global balance).

Facts: MC built from our Q has a unique stationary distribution. Let X(t) be the base at time t. If X(0) ~ , then X(t) ~ . Given any initial distribution, the distribution of X(t)   as t .

For the JC model,  is the uniform distribution.

a pair of homologous bases
A pair of homologous bases


T years





Typically, ancestor is unknown.

more assumptions
More assumptions

5. Qh = sh Q and Qm = sm Q, for some positive

sh and sm, and some rate matrix Q.

6. The ancestor is sampled from the stationary distribution  of Q.

7. Q is reversible:

(a) P(t,a,b) = P(t,b,a) (b), b, t  0 (detailed balance).

new picture
New picture

ancestor ~ 

shT PAMs



smT PAMs



joint probability of a and c
Joint probability of A and C

Under the model, the joint probability is

a (a) P(shT,a,A) P(smT,a,C)

= a (A) P(shT,A,a) P(smT,a,C)

= (A) P(shT+ smT,A,C) = F(t,A,C).

The matrix F(t) is symmetric. It is equally valid to view A as the ancestor of C or vice versa.

t = shT+ smT is the “distance” between A and C. Note: Q and t are identifiable from F(t), but sh , sm and T are not.

joint probability of a pair of homologous sequences
Joint probability of a pair of homologous sequences


= k F(t,ak,bk)

= a,b F(t,a,b)c(a,b),

where c(a,b) = # {k : ak = a, bk = b}.

the choice of q
The choice of Q

By convention (Dayhoff 1978), Q is chosen to satisfy

–a (a) Q(a,a) = 0.01,

or the expected number of substitutions in 1 time unit is 0.01 per site.

This new time scale is called evolutionary time, measured in PAM ( Point Accepted Mutation).

m pairs of homologous sequences
M pairs of homologous sequences

The kth pair of sequences, separated by tk PAMs, gives the count matrix c(k). Assuming that the pairs are independent and underwent the same MC defined by Q over pair-specific distances tk’s.

Pr(all pairs) = k a,b F(tk,a,b)c(k,a,b).

The pair-specific distances allow for different rates of substitution across pairs.

Parameters : Q, t1, t2,…,tM.

Maximum likelihood estimation by numerical methods.

previous work
Previous work

This model is general enough to include almost all models in use as special cases.

For DNA base, there are JC, Kimura and Hasegawa-Kishino-Yano models.

For amino acid, the PAM (Dayhoff 1978) and BLOSUM (Henikoff 1993) substitution matrices are derived based on the model. Müller and Vingron (2000) presents an interesting method of estimation.

For codon, Z Yang and collaborators use constrained versions to estimate ratio of syn/nonsyn substitutions, among other things.


a formula for a reversible q
A formula for a reversible Q

: stationary distribution of Q; ,…,: positive constants.

strand symmetry
Strand symmetry

It seems plausible that the substitution process in noncoding regions is strand-symmetric, i.e., Q(A,C) = Q(T,G), Q(C,C) = Q(G,G), etc.

From the point of view of sequence alignment, SS implies that the same substitution matrix can be used for aligning either strand.

Strand-symmetry (of the lack of) is well-studied in bacteria (Francino, Ochman).

a formula for a reversible ss q
A formula for a reversible SS Q

: stationary distribution; 2((1) + (2)) = 1. ,…,: positive constants.

chromosome bands and human mouse alignment
Chromosome bands and human-mouse alignment

Staining chromosomes with the Giemsa dye produces characteristic banding patterns. G-bands: G1, G2, G3 or G4 (Francke,1994). The unstained regions are H3- or H3+.

Data: 22,490 local alignments (4.4 Mbp) of human chromosome 22 with homologous mouse reads, produced by blastz, and approximate boundary coordinates of the bands on chr 22, obtained from the UCSC Human genome browser.

data analysis
Data analysis

Only alignments in noncoding regions are used. Alignments that are closer than 10 bases are glued. Of the resultant alignments, those longer than 100 bases are used to fit band-specific substitution models.




amino acid substitution
Amino acid substitution

Exactly the same formulation applies to modeling amino acid substitution.

Fast-evolving and slow-evolving proteins are known to have different substitution rates. How can we summarize such observations?

Buried amino acids evolve differently from exposed residues. HMM with 2 hidden state?

d melanogastor vs virilis
D. melanogastor vs virilis

Bergman and Kreitman (2001) showed that the substitution patterns between intergenic and intronic regions are similar. The combined estimated rate matrix is

future work
Future work

Study other human chromosomes, and isochore-specific substitution patterns.

Multiple species. Holmes (2002) has a neat EM algorithm for estimating substitution rates if the process is reversible. What to do if irreversible?

Even for 2 species, irreversible substitution models are not fully explored.


Terry Speed

Webb Miller

David Haussler

Terry Furey

Casey Bergman

Anne Yap