330 likes | 509 Views
DNA Segmentation. Presented by Ming-Te Cheng IFT 6299 - Algorithmique de l’ADN Autumn 2004 November 15, 2004 . Overview. Introduction Segmentation Models Segmentation Methods Discussion. Introduction. Statistical analysis of DNA sequences are motivated by 3 areas of exploration
E N D
DNA Segmentation Presented by Ming-Te Cheng IFT 6299 - Algorithmique de l’ADN Autumn 2004 November 15, 2004
Overview • Introduction • Segmentation Models • Segmentation Methods • Discussion
Introduction • Statistical analysis of DNA sequences are motivated by 3 areas of exploration • DNA sequence data offer an extremely fine view where traditional methods of variation analysis can be extended • DNA sequence data allow fine-tuning and organization of genetic process • Comparison of sequences between species demands methods of determining similarities in evolution or function
Introduction • Large chunks of the genome are sequenced, where the functionality of many sequences are unknown • Scientists rely on homology (similarity) to analyze unknown sequences with previously well studies small sequences • Need methods of describing and assessing sequences that provide useful characterizations • Ex: Segmentation models
Segmentation Models • Assumption: Sequences can be partitioned into a number of segments • Each segment has a certain degree of internal homogeneity (or similarity) • Ex: Isochores • Large segments (> 300 kb) of DNA belonging to a number of classes defined by different [G+C] levels and by fairly homogeneous base compositions
Segmentation Methods • Common techniques • Moving Window • Maximum Likelihood Estimation • Hidden Markov Models • Recursive Segmentation
Segmentation Methods Moving Window • Most commonly used algorithm in biology community • Straightforward implementation • Calculate density of a sequence feature of interest within a window • Move window along sequence • Recalculate density again
Segmentation MethodsMoving Window • Drawbacks • Arbitrary choice of window size and moving distance • If window size is too large, local fluctuations that contain significant biological information may be averaged out • If moving distance is too long, one domain can be split between two windows and its distinctive feature may be lost
Segmentation MethodsMaximum-Likelihood Estimation • Algorithm that computes the maximum likelihood estimate for the number of changed segments
Segmentation Methods Maximum-Likelihood Estimation • Let X1,…,Xn represent a sequence of independent random letters from an alphabet • Let every Xi be one of two known distributions, specified by the probabilities and • Changed segment is a segment [a,b] of indices where for all • Unchanged segment is a segment [a,b] of indices where for all
Segmentation Methods Maximum-Likelihood Estimation • Let xi be the observed values of Xi for sequence • Let C be a non-intersecting set of hypothetical changed segments • Let z = (z1,…,zn) be the indicator vector for C
Segmentation Methods Maximum-Likelihood Estimation • Likelihood function is can be written as • Log-likelihood can be written as • First term represents log-likelihood of null hypothesis that there are no changed segments • Second term represents log-likelihood ratio of the alternative hypothesis
Segmentation MethodsHidden Markov Models • Example of Markov Model
Segmentation MethodsHidden Markov Models • Example of Hidden Markov Model
Segmentation MethodsHidden Markov Models • Assumes that different segments can be classified into a finite set of state, where the nucleotide data in each state follows a probability distribution
Segmentation MethodsHidden Markov Models • Let the finite number of r states underlying the observations be denoted by Si • Let the states follows a Markov process with transition matrix • System of equations for the hidden chain can be written as
Segmentation MethodsHidden Markov Models • Likewise, system of equations for the observations can be written as where yi = (yi,1,…,yi,m) represent vector of m possible observed outcomes, and where each observation is associated with one of the states
Segmentation MethodsHidden Markov Models • With the system equations for hidden chain and observations, the smoothing equations can be derived and be used to plot the homogeneous regions in the sequence
Segmentation MethodsRecursive Segmentation • Assumes that sequences exhibit hierarchical patterns (possibility of subdomains) • It is possible to apply a filter to convert the original four-base DNA sequence into k-symbol sequence • Ex: S(strong) = {C,G} and W(weak) = {A,T}
Segmentation MethodsRecursive Segmentation • Divide-and-conquer approach is applied • For k-symbol sequence of length N, calculate each position i (0 < i < N) the entropy H of the whole sequence, entropy Hl of the subsequence on the left side of the partition point, and entropy Hr of the subsequence on the right side.
Segmentation MethodsRecursive Segmentation • Entropy equations as defined by (Shannon 1948) where Nj, Nj,l, and Nj,r are the counts of symbol j in the whole, left, and right sequence, respectively
Segmentation MethodsRecursive Segmentation • Maximized Jensen-Shannon divergence was chosen to measure the heterogeneity of the sequence • If divergence is large enough, the sequence is heterogeneous and should be segmented • Equation is recursively applied for both the left and the right subsequence, as long as the calculated divergence value stays above the given threshold (similar to constructing a binary tree)
Segmentation MethodsRecursive Segmentation • Alternate approach to determining stopping criterion involves finding a model at the border between underfitting models (those that do not fit the data well) and overfitting models (those that fit the data too well by using too many parameters) • Bayesian Information Criterion (BIC) was used to balance goodness-of-fit of the model to data
Segmentation MethodsRecursive Segmentation • Alternate approach to determining stopping criterion involves finding a model at the border between underfitting models (those that do not fit the data well) and overfitting models (those that fit the data too well by using too many parameters) • Bayesian Information Criterion (BIC) was used to balance the “goodness-of-fit” of the model to data L is the likelihood of the model, K the number of free parameters, and N the sample size
Segmentation MethodsRecursive Segmentation • Two models can be compared: • Modelling the sequence as one single random sequence • Modelling it as two random subsequences with different base compositions • In order for recursive segmentation to continue, the following must apply where k is the number of different symbols in the sequence
Segmentation MethodsRecursive Segmentation • Alternate recursive segmentation algorithm condition can be used to define the segmentation strength s, i.e. • Recursive segmentation process can be continued as long as s > s0, where s0 is predefined by the user
Discussion • DNA sequences can be assumed to have segments where each has a degree of homogeneity • A number of statistical methods can be used to identify and analyse these segments • Isochores • CpG islands • Replication origin and terminus • Complex patterns in telomeres • Coding-noncoding borders • Other statistical methods for analysing DNA segmentation do exist, each with varying degrees of success • Bayesian approach • Walking Markov • Change-point methods
References • Braun J.V., Müller H.-G. “Statistical methods for DNA sequence segmentation,” Statistical Science, 13:142-162, 1998. • Duda R.O., Hart P.E., Stork D.G. (2001) Pattern Classification, New York: John Wiley & Sons, Inc. • Churchill, G.A. “Stochastic models for heterogeneous DNA sequences,” Bulletin of Mathematical Biology, 51:79-94, 1989. • Csürös M. “Algorithms for finding maximal-scoring segment sets,” Proc. WABI, 2004. • Li W., et al. “Applications of recursive segmentation to the analysis of DNA sequences,” Computational Chemistry, 26:491-510, 2002.