DNA Segmentation

DNA Segmentation Presented by Ming-Te Cheng IFT 6299 - Algorithmique de l’ADN Autumn 2004 November 15, 2004

Overview • Introduction • Segmentation Models • Segmentation Methods • Discussion

Introduction • Statistical analysis of DNA sequences are motivated by 3 areas of exploration • DNA sequence data offer an extremely fine view where traditional methods of variation analysis can be extended • DNA sequence data allow fine-tuning and organization of genetic process • Comparison of sequences between species demands methods of determining similarities in evolution or function

Introduction • Large chunks of the genome are sequenced, where the functionality of many sequences are unknown • Scientists rely on homology (similarity) to analyze unknown sequences with previously well studies small sequences • Need methods of describing and assessing sequences that provide useful characterizations • Ex: Segmentation models

Segmentation Models • Assumption: Sequences can be partitioned into a number of segments • Each segment has a certain degree of internal homogeneity (or similarity) • Ex: Isochores • Large segments (> 300 kb) of DNA belonging to a number of classes defined by different [G+C] levels and by fairly homogeneous base compositions

Segmentation Methods • Common techniques • Moving Window • Maximum Likelihood Estimation • Hidden Markov Models • Recursive Segmentation

Segmentation Methods Moving Window • Most commonly used algorithm in biology community • Straightforward implementation • Calculate density of a sequence feature of interest within a window • Move window along sequence • Recalculate density again

Segmentation MethodsMoving Window • Drawbacks • Arbitrary choice of window size and moving distance • If window size is too large, local fluctuations that contain significant biological information may be averaged out • If moving distance is too long, one domain can be split between two windows and its distinctive feature may be lost

Segmentation MethodsMaximum-Likelihood Estimation • Algorithm that computes the maximum likelihood estimate for the number of changed segments

Segmentation Methods Maximum-Likelihood Estimation • Let X1,…,Xn represent a sequence of independent random letters from an alphabet • Let every Xi be one of two known distributions, specified by the probabilities and • Changed segment is a segment [a,b] of indices where for all • Unchanged segment is a segment [a,b] of indices where for all

Segmentation Methods Maximum-Likelihood Estimation • Let xi be the observed values of Xi for sequence • Let C be a non-intersecting set of hypothetical changed segments • Let z = (z1,…,zn) be the indicator vector for C

Segmentation Methods Maximum-Likelihood Estimation • Likelihood function is can be written as • Log-likelihood can be written as • First term represents log-likelihood of null hypothesis that there are no changed segments • Second term represents log-likelihood ratio of the alternative hypothesis

Segmentation Methods Maximum-Likelihood Estimation

Segmentation MethodsHidden Markov Models • Example of Markov Model

Segmentation MethodsHidden Markov Models • Example of Hidden Markov Model

Segmentation MethodsHidden Markov Models • Assumes that different segments can be classified into a finite set of state, where the nucleotide data in each state follows a probability distribution

Segmentation MethodsHidden Markov Models • Let the finite number of r states underlying the observations be denoted by Si • Let the states follows a Markov process with transition matrix • System of equations for the hidden chain can be written as

Segmentation MethodsHidden Markov Models • Likewise, system of equations for the observations can be written as where yi = (yi,1,…,yi,m) represent vector of m possible observed outcomes, and where each observation is associated with one of the states

Segmentation MethodsHidden Markov Models • With the system equations for hidden chain and observations, the smoothing equations can be derived and be used to plot the homogeneous regions in the sequence

Segmentation MethodsHidden Markov Models

Segmentation MethodsRecursive Segmentation • Assumes that sequences exhibit hierarchical patterns (possibility of subdomains) • It is possible to apply a filter to convert the original four-base DNA sequence into k-symbol sequence • Ex: S(strong) = {C,G} and W(weak) = {A,T}

Segmentation MethodsRecursive Segmentation • Divide-and-conquer approach is applied • For k-symbol sequence of length N, calculate each position i (0 < i < N) the entropy H of the whole sequence, entropy Hl of the subsequence on the left side of the partition point, and entropy Hr of the subsequence on the right side.

Segmentation MethodsRecursive Segmentation • Entropy equations as defined by (Shannon 1948) where Nj, Nj,l, and Nj,r are the counts of symbol j in the whole, left, and right sequence, respectively

Segmentation MethodsRecursive Segmentation • Maximized Jensen-Shannon divergence was chosen to measure the heterogeneity of the sequence • If divergence is large enough, the sequence is heterogeneous and should be segmented • Equation is recursively applied for both the left and the right subsequence, as long as the calculated divergence value stays above the given threshold (similar to constructing a binary tree)

Segmentation MethodsRecursive Segmentation • Alternate approach to determining stopping criterion involves finding a model at the border between underfitting models (those that do not fit the data well) and overfitting models (those that fit the data too well by using too many parameters) • Bayesian Information Criterion (BIC) was used to balance goodness-of-fit of the model to data

Segmentation MethodsRecursive Segmentation • Alternate approach to determining stopping criterion involves finding a model at the border between underfitting models (those that do not fit the data well) and overfitting models (those that fit the data too well by using too many parameters) • Bayesian Information Criterion (BIC) was used to balance the “goodness-of-fit” of the model to data L is the likelihood of the model, K the number of free parameters, and N the sample size

Segmentation MethodsRecursive Segmentation • Two models can be compared: • Modelling the sequence as one single random sequence • Modelling it as two random subsequences with different base compositions • In order for recursive segmentation to continue, the following must apply where k is the number of different symbols in the sequence

Segmentation MethodsRecursive Segmentation • Alternate recursive segmentation algorithm condition can be used to define the segmentation strength s, i.e. • Recursive segmentation process can be continued as long as s > s0, where s0 is predefined by the user

Segmentation MethodsRecursive Segmentation

Discussion • DNA sequences can be assumed to have segments where each has a degree of homogeneity • A number of statistical methods can be used to identify and analyse these segments • Isochores • CpG islands • Replication origin and terminus • Complex patterns in telomeres • Coding-noncoding borders • Other statistical methods for analysing DNA segmentation do exist, each with varying degrees of success • Bayesian approach • Walking Markov • Change-point methods

References • Braun J.V., Müller H.-G. “Statistical methods for DNA sequence segmentation,” Statistical Science, 13:142-162, 1998. • Duda R.O., Hart P.E., Stork D.G. (2001) Pattern Classification, New York: John Wiley & Sons, Inc. • Churchill, G.A. “Stochastic models for heterogeneous DNA sequences,” Bulletin of Mathematical Biology, 51:79-94, 1989. • Csürös M. “Algorithms for finding maximal-scoring segment sets,” Proc. WABI, 2004. • Li W., et al. “Applications of recursive segmentation to the analysis of DNA sequences,” Computational Chemistry, 26:491-510, 2002.

DNA Segmentation

DNA Segmentation

Presentation Transcript

Segmentation

Segmentation

Segmentation

Segmentation II, International Segmentation

Segmentation

Segmentation

Segmentation

Segmentation

Segmentation

SEGMENTATION

Segmentation

Segmentation

Segmentation

Segmentation

SEGMENTATION

DNA Diagnostics Market Segmentation, Analysis and Forecast 2020

Segmentation

Segmentation

Segmentation