170 likes | 325 Views
Stochastic Models For Heterogeneous DNA Sequences. Churchill, G.A. Bulletin of Mathematical Biology Vol. 51, pp. 79-94,1989. Presented By: Matthew McCall. DNA Representation. A typical representation of a DNA sequence is a single strand written from the 5’ to 3’ end.
E N D
Stochastic Models For Heterogeneous DNA Sequences Churchill, G.A. Bulletin of Mathematical Biology Vol. 51, pp. 79-94,1989. Presented By: Matthew McCall
DNA Representation • A typical representation of a DNA sequence is a single strand written from the 5’ to 3’ end. • Consider two binary representations – the purine-pyrimidine (AG-TC) and strong-weak hydrogen-bonding (GC-AT). • Together these two yield the same information as the single strand representation.
Stochastic Modeling • Stochastic Process - A non-deterministic process in that the next state of the environment is not fully determined by the previous state of the environment. • This is a method for extracting information. • It doesn’t try to mimic the process followed in nature, which is far too complex for any simple model.
Markov Chains • Markov Chains have traditionally been used to model sequences. • They work well when the sequence composition is fairly homogeneous but not when the composition is heterogeneous. • Compositional variation between segments of the sequence is likely to reflect functional or structural differences.
Proposed Solution • Sequences are assumed to be composed of homogeneous regions, which differ from one another. • As such, each region can be classified into one of a finite number of states. • These states represent the underlying structure of the sequence in that region and are assumed to develop according to a hidden Markov process.
The General Case • A sequence of random variables {Yi: i=1,…,n} • Corresponding unobservable states {Si} • Denote the sequence of observed outcomes up to time t by yt={y1,…,yt} • And similarly the states st={s1,…,st} • Then the probability of an observation given the current state and past observations is: Pr(yt|st,yt-1) This is called the observation equation.
The General Case (cont.) • The sequence of states, st, cannot be observed however they can be inferred from the observations. • The states are assumed to evolve according to a set of system equations with the Markov property: Pr(st|st-1) = Pr(st|st-1) • The problem addressed here is how to estimate the states {Si} given the observations {Yi}. • The result, the smoothed estimate at time t, will be denoted Pr(st|yn).
Computing Pr(st|yn) • Pr(st|yn) = Pr(st|yt) ∫ [Pr(st+1|yn)Pr(st+1|st)/Pr(st+1|yt) dst+1 • We can compute each of these terms, so the smoothed estimate at time t can be expressed in terms of the system equations, Pr(st+1|st), quantities derived from filtering, Pr(st|yt) and Pr(st+1|yt), and the smoothed estimate at time t+1, Pr(st+1|yn). • So we can employ a recursive algorithm to compute the smoothed estimate at each time t.
Parameter Estimation • The previous algorithm requires that the parameters of the observation and system equations be specified. • These parameters are typically estimated from the data. • The paper suggests an EM algorithm for determining the approximate maximum likelihood estimate of these parameters.
Application to DNA Sequences • Bases of a sequence are viewed as the observed outcomes {Yi}. • The states {Si} are assumed to be fixed and finite in number. • Each region then has a specific state.
The Simplest Case • Think back to the binary representations (strong-weak hydrogen-bonding). • A sequence of independent binary outcomes yt{0,1} which depend on the underlying states st{0,1}. • Then the observation equation would be binomial: Pr(yt|st=j) = (pj^yt) * (1-pj)^(1-yt) , j{0,1} where p0 = Pr(yt=1|st=0) and p1 = Pr(yt=1|st=1)
The Simplest Case (cont.) • The system equations would be: Pr(st=j|st-1=i) = ij • Define the transition probabilities: = Pr(st=1|st-1=0) = Pr(st=0|st-1=1) • Both these are assumed to be small, so states tend to persist. • In this case, the size of the different regions will have a geometric distribution with means equal to the reciprocals of the transition probabilities.
GC Clusters in Yeast mtDNA • The mitochondrial genome of yeast is 85 kb, which appears to have three distinct types of regions with differing GC content. • Intergenic segments consist of large stretches of DNA with less than 5% GC content intermingled with short stretches with greater than 50% GC content. • The genes themselves have GC content of 18 to 28%.
GC Clusters in Yeast mtDNA • Respiration deficient mutants, p-, come about through a deletion of most of the wild-type DNA (p+). • The small amount left is amplified in tandem repeats, replicated, and maintained in the mitochondrion. • Some of the p- mutants displace the p+ DNA in all the diploid descendants when mating with wild-type strains. • These are called hypersuppressive p- genomes.
GC Clusters in Yeast mtDNA • This paper considers two hypersuppressive and two non-suppressive phenotypes. • Because the segments they consider contain no coding regions, they use the binary model. • The model parameters are set at: p0=0.9, p1=0.5, ==0.01
Yeast mtDNA Results • The profiles of the hypersuppressive segments share a general structure of GC-rich regions and AT-rich regions that differ from the non-suppressive segments. • This suggests that the GC content is in some way related to the function of these hypersuppressive sequences.
Conclusion • In its simplest form, this algorithm is useful for studying compositional heterogeneity in DNA. • The stochastic models proposed are useful in extracting information from large and complex data sets such as DNA sequence data. • This algorithm could be used to study relationships between DNA primary structure and global organization of entire chromosomes.