- 223 Views
- Updated On :
- Presentation posted in: General

Evolutionary Models. CS 498 SS Saurabh Sinha. Models of nucleotide substitution. The DNA that we study in bioinformatics is the end(??)-product of evolution Evolution is a very complicated process Very simplified models of this process can be studied within a probabilistic framework

Evolutionary Models

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Evolutionary Models

CS 498 SS

Saurabh Sinha

- The DNA that we study in bioinformatics is the end(??)-product of evolution
- Evolution is a very complicated process
- Very simplified models of this process can be studied within a probabilistic framework
- Allows testing of various hypotheses about the evolutionary process, from multi-species data

Source: Ewens and Grant, Chapter 14.

- There IS genetic variation between individuals in a population
- But relatively little variation at nucl. level
- E.g., two humans differ at the nucl. level at one in 500 to 1000 nucls.
- Roughly speaking, a single nucleotide dominates the population at a particular position in the genome

- Over long time periods, the nucleotide at a given position remains the same
- But periodically, this nucleotide changes (over the entire population)
- This is called “substitution”, i.e., replacement of the predominant nucl. for that position with another predominant nucl.

- Markov chain to describe the substitution process at a position
- States are “a”, “c”, “g”, “t”
- The chain “runs” in certain units of time, i.e., the state may change from one time point to the next time point
- The unit of time (difference between successive time points) may be arbitrary, e.g., 20000 generations.

- A symbol such as “pag” is the probability of a change from “a” to “g” in one unit of time
- When studying two extant species, the evolutionary model has to provide the joint probability of the two species’ data
- Sometimes, this is done by computing probability of the ancestor, starting from one extant species, and then the probability of the other extant species, starting from the ancestor
- If we want to do this, the evolutionary process (model) must be “time reversible”: P(x)P(x->y) = P(y)P(y->x)

- Markov chain with four states: a,c,g,t
- Transition matrix P given by:

- is a parameter depending on what a “time unit” means. If time unit represents more #generations, will be larger
- must be less than 1/3 though

- Whatever the current nucl is, each of the other three nucls are equally likely to substitute for it

- Consider a transition matrix P, and a probability vector v (a row vector)
- What does w = vP represent ?
- If v is the probability distribution of the 4 nucls (at a position) now, w is the prob. distr. at the next time step.

- Suppose we can find a vector such that P =
- If the probability distribution is , it will continue to remain at future times
- This is called the stationary distribution of the Markov Chain

- Check that = (0.25, 0.25, 0.25, 0.25) satisfies P =
- Therefore, if a position evolves as per this model, for long enough, it will be equally likely to have any of the 4 nucls!
- This is the very long term prediction, but can we write down what the position will be as a function of time (steps) ?

- Recall that we found a such that
P =

- Such a vector is called an “eigenvector” of P, and the corresponding “eigenvalue” is 1.
- In general, if v P = v (for scalar ), is called an eigenvalue, and v is a left eigenvector of P

- Similarly, if P uT = uT, then u is called a right eigenvector
- In general, there may be multiple eigenvalues jand their corresponding left and right eigenvectors vjand uj
- We can write P as

- Then, for any positive integer, it is true that
- Why is Pninteresting to us ?
- Because it tells us what the probability distribution will be after n time steps
- If we started with v, then Pnv will be the prob. distr. after n steps

- We reasoned that = (.25,.25,.25,.25) is a left eigenvector for the eigenvalue 1.
- Actually, the J-C transition matrix has this eigenvalue and the eigenvalue (1-4), and if we do the math we get the spectral decomposition of P as:

- So, if we started with (1,0,0,0), i.e., an “a”, the probability that we’ll see an “a” at that position after n time steps is:
0.25+0.75(1-4)n

- And the probability that the “a” would have mutated to say “c” is:
0.25 - 0.25(1-4)n

- As a function of time n, we therefore get
- Pr(x -> y) = 0.25 + 0.75 (1-4)n if x = y
- and = 0.25 - 0.25 (1-4)n otherwise
- If n ->, we get back our (0.25, 0.25, 0.25, 0.25) calculation

- The J-C model made highly “symmetric” assumptions, in its formulation of the transition matrix P
- In reality, for example, “transitions” are more common than “transversions”
- What are these? Purine = A or G. Pyrimidine = C or T. Transition is substitution in the same category; transversion is substitution across categories
- Purines are similarly sized, and pyrimidines are similarly sized. More likely to be replaced by similar sized nucl.

- The “Kimura” model captures this transition/transversion bias

- This of course is the transition probability matrix P of the Markov chain
- Two parameters now, instead of one.

- Again, one of the eigenvalues is 1, and the left eigenvector corresponding to it is = (.25,.25,.25,.25)
- So again, the stationary distribution is uniform
- P(x -> x) = .25+.25(1-4)n+.5(1-2( +))n
- P(x -> y) = .25+.25(1-4)n+.5(1-2( +))nif x is a purine and y is the other purine

- Get to greater levels of realism
- Kimura model still has a uniform stationary distribution, which is not true of real data
- One extension: purine to pyrimidine subst. prob. is different from pyrimidine to purine subst. prob.
- This leads to a non-uniform stationary probability

Transition probability proportional to the stationary

probability of the target nucleotide. Stationary distribution is

(a, g, c, t)

- Many inference procedures require that the evolutionary model be time reversible
- What does this mean?

Looks like time has been reversed. That is, if we can find

a such that

The models we have seen today all have this property.

Source: Wikipedia