Stochastic Models For Heterogeneous DNA Sequences

Stochastic Models For Heterogeneous DNA Sequences Churchill, G.A. Bulletin of Mathematical Biology Vol. 51, pp. 79-94,1989. Presented By: Matthew McCall

DNA Representation • A typical representation of a DNA sequence is a single strand written from the 5’ to 3’ end. • Consider two binary representations – the purine-pyrimidine (AG-TC) and strong-weak hydrogen-bonding (GC-AT). • Together these two yield the same information as the single strand representation.

Stochastic Modeling • Stochastic Process - A non-deterministic process in that the next state of the environment is not fully determined by the previous state of the environment. • This is a method for extracting information. • It doesn’t try to mimic the process followed in nature, which is far too complex for any simple model.

Markov Chains • Markov Chains have traditionally been used to model sequences. • They work well when the sequence composition is fairly homogeneous but not when the composition is heterogeneous. • Compositional variation between segments of the sequence is likely to reflect functional or structural differences.

Proposed Solution • Sequences are assumed to be composed of homogeneous regions, which differ from one another. • As such, each region can be classified into one of a finite number of states. • These states represent the underlying structure of the sequence in that region and are assumed to develop according to a hidden Markov process.

The General Case • A sequence of random variables {Yi: i=1,…,n} • Corresponding unobservable states {Si} • Denote the sequence of observed outcomes up to time t by yt={y1,…,yt} • And similarly the states st={s1,…,st} • Then the probability of an observation given the current state and past observations is: Pr(yt|st,yt-1) This is called the observation equation.

The General Case (cont.) • The sequence of states, st, cannot be observed however they can be inferred from the observations. • The states are assumed to evolve according to a set of system equations with the Markov property: Pr(st|st-1) = Pr(st|st-1) • The problem addressed here is how to estimate the states {Si} given the observations {Yi}. • The result, the smoothed estimate at time t, will be denoted Pr(st|yn).

Parameter Estimation • The previous algorithm requires that the parameters of the observation and system equations be specified. • These parameters are typically estimated from the data. • The paper suggests an EM algorithm for determining the approximate maximum likelihood estimate of these parameters.

Application to DNA Sequences • Bases of a sequence are viewed as the observed outcomes {Yi}. • The states {Si} are assumed to be fixed and finite in number. • Each region then has a specific state.

The Simplest Case • Think back to the binary representations (strong-weak hydrogen-bonding). • A sequence of independent binary outcomes yt{0,1} which depend on the underlying states st{0,1}. • Then the observation equation would be binomial: Pr(yt|st=j) = (pj^yt) * (1-pj)^(1-yt) , j{0,1} where p0 = Pr(yt=1|st=0) and p1 = Pr(yt=1|st=1)

The Simplest Case (cont.) • The system equations would be: Pr(st=j|st-1=i) = ij • Define the transition probabilities:  = Pr(st=1|st-1=0)  = Pr(st=0|st-1=1) • Both these are assumed to be small, so states tend to persist. • In this case, the size of the different regions will have a geometric distribution with means equal to the reciprocals of the transition probabilities.

GC Clusters in Yeast mtDNA • The mitochondrial genome of yeast is 85 kb, which appears to have three distinct types of regions with differing GC content. • Intergenic segments consist of large stretches of DNA with less than 5% GC content intermingled with short stretches with greater than 50% GC content. • The genes themselves have GC content of 18 to 28%.

GC Clusters in Yeast mtDNA • Respiration deficient mutants, p-, come about through a deletion of most of the wild-type DNA (p+). • The small amount left is amplified in tandem repeats, replicated, and maintained in the mitochondrion. • Some of the p- mutants displace the p+ DNA in all the diploid descendants when mating with wild-type strains. • These are called hypersuppressive p- genomes.

GC Clusters in Yeast mtDNA • This paper considers two hypersuppressive and two non-suppressive phenotypes. • Because the segments they consider contain no coding regions, they use the binary model. • The model parameters are set at: p0=0.9, p1=0.5, ==0.01

Yeast mtDNA Results • The profiles of the hypersuppressive segments share a general structure of GC-rich regions and AT-rich regions that differ from the non-suppressive segments. • This suggests that the GC content is in some way related to the function of these hypersuppressive sequences.

Conclusion • In its simplest form, this algorithm is useful for studying compositional heterogeneity in DNA. • The stochastic models proposed are useful in extracting information from large and complex data sets such as DNA sequence data. • This algorithm could be used to study relationships between DNA primary structure and global organization of entire chromosomes.

Stochastic Models For Heterogeneous DNA Sequences

Stochastic Models For Heterogeneous DNA Sequences

Presentation Transcript

Troubleshooting DNA Sequences:

Optimization Models for Heterogeneous Protocols

Calibrating Stochastic Models for DFA

Models for DNA substitution

Stochastic Frontier Models

DNA sequences alignment measurement

DNA Sequences

Using DNA sequences

Stochastic Frontier Models

Stochastic Frontier Models

: Determining DNA sequences

Stochastic Models for Communication Networks

Reading DNA Sequences

Stochastic Control of Heterogeneous Networks

: Determining DNA sequences

4.7 STOCHASTIC MODELS

Stochastic Climate Models

STOCHASTIC MODELS IN NEUROSCIENCE

Bibliography: Stochastic models

Stochastic Models for Operating Rooms Planning

Bibliography: Stochastic models

Stochastic models - time series.