Accelerated Statistical Alignment with TKF91 Model for Homology Testing

Statistical Alignment: Computational Properties, Homology Testing and Goodness-of-Fit J. Hein, C. Wiuf, B. Knudsen, M.B. Moller and G. Wibling

Main Objective of the paper • To show how to accelerate the statistical alignment algorithms several orders of magnitude using the model of insertion and deletions by Thorne, Kishino, and Felsenstein in 1991 (TKF91 model). • To propose a new homology test based on the model. • To describe a goodness-of-fit test that allows testing the proposed insertion-deletion process inherent to the model.

Why isn’t statistical alignment popular? • Computationally VERY SLOW • Authors of the paper accelerated the statistical alignment algorithms several orders of magnitude compared with the TKF91 algorithm. • Lack of user-friendly software? • Usually written in Fortran or C, or the compiled program only works in UNIX environment, but most biologists don’t know much about it. • Authors of the paper have provided a web interface to the program

parsimony and similarity alignments • Parsimony strategy: minimizing the distance • For example: • Similarity strategy: maximizing the similarity score • For example: BLAST

TKF91 model of substitutions • continuous time Markov model on the state space of nucleotides or amino acids • Rate matrix Q is specified • Describes the intensity of different substitution events over an infinitesimal time period. • Probability that i has changed to j after time t is • The process is assumed to be time reversible:

TKF91 model of the indel process • Can be view as a Markov model with all sequences as possible states • indel part of the model • links connecting the letters of the sequences • each has a mortal link on the right • left end has an immortal link • For example:  A  G  G  • If the type of the nucleotide is ignored, can be represented as    

TKF91 model • mortal link can give birth to a new mortal link or die out • immortal link can also give birth but would not die • Therefore, the rates can be written as:  A  G  G  I0 S1 I1/D1 S2 I2/D2 S3 I3/D3 where I is the birth rate D is the death rate, D>I S is the substitution rate

TKF91 model To calculate the probability of a particular alignment: s(1):  A  T  - s(2):  C  T  G  P(s(1), s(2), alignment) = (p1’’)(AP1 PAC)(T P2 PTT G)

Calculating the probability of two sequences • Without conditioning on the alignment, it is necessary to sum over all alignments weighted with their probabilities according to the TKF91 process. • Confine likelihood calculations to a band close to the similarity based alignment allows an efficient numerical optimization algorithm for finding the maximum likelihood estimate • The recursions originally presented by Thorne, Kishino and Felsenstein can be simplified.

Accelerated Statistical Alignment with TKF91 Model for Homology Testing

Accelerated Statistical Alignment with TKF91 Model for Homology Testing

Presentation Transcript

Goodness Of Fit

“Goodness of Fit” and Coping Strategies

Nonparametrics and goodness of fit

Goodness-of-Fit Tests

Goodness of Fit (GoF)

Frequency distributions: Testing of goodness of fit and contingency tables

Homology and sequence alignment.

χ 2 and Goodness of Fit

PROBABILITY AND GOODNESS OF FIT:

GOODNESS OF FIT

14.1 Goodness of Fit

Goodness of Fit

Goodness of Fit Tests

Goodness of Fit Tests

Homology and sequence alignment.

Test of Goodness of Fit

Nonparametrics and goodness of fit

Goodness of Fit Tests

11.2 Goodness of Fit

χ 2 and Goodness of Fit

Nonparametrics and goodness-of-fit

χ 2 and Goodness of Fit