1 / 33

DNA Segmentation

DNA Segmentation. Presented by Ming-Te Cheng IFT 6299 - Algorithmique de l’ADN Autumn 2004 November 15, 2004 . Overview. Introduction Segmentation Models Segmentation Methods Discussion. Introduction. Statistical analysis of DNA sequences are motivated by 3 areas of exploration

cricket
Download Presentation

DNA Segmentation

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. DNA Segmentation Presented by Ming-Te Cheng IFT 6299 - Algorithmique de l’ADN Autumn 2004 November 15, 2004

  2. Overview • Introduction • Segmentation Models • Segmentation Methods • Discussion

  3. Introduction • Statistical analysis of DNA sequences are motivated by 3 areas of exploration • DNA sequence data offer an extremely fine view where traditional methods of variation analysis can be extended • DNA sequence data allow fine-tuning and organization of genetic process • Comparison of sequences between species demands methods of determining similarities in evolution or function

  4. Introduction • Large chunks of the genome are sequenced, where the functionality of many sequences are unknown • Scientists rely on homology (similarity) to analyze unknown sequences with previously well studies small sequences • Need methods of describing and assessing sequences that provide useful characterizations • Ex: Segmentation models

  5. Segmentation Models • Assumption: Sequences can be partitioned into a number of segments • Each segment has a certain degree of internal homogeneity (or similarity) • Ex: Isochores • Large segments (> 300 kb) of DNA belonging to a number of classes defined by different [G+C] levels and by fairly homogeneous base compositions

  6. Segmentation Methods • Common techniques • Moving Window • Maximum Likelihood Estimation • Hidden Markov Models • Recursive Segmentation

  7. Segmentation Methods Moving Window • Most commonly used algorithm in biology community • Straightforward implementation • Calculate density of a sequence feature of interest within a window • Move window along sequence • Recalculate density again

  8. Segmentation MethodsMoving Window • Drawbacks • Arbitrary choice of window size and moving distance • If window size is too large, local fluctuations that contain significant biological information may be averaged out • If moving distance is too long, one domain can be split between two windows and its distinctive feature may be lost

  9. Segmentation MethodsMaximum-Likelihood Estimation • Algorithm that computes the maximum likelihood estimate for the number of changed segments

  10. Segmentation Methods Maximum-Likelihood Estimation • Let X1,…,Xn represent a sequence of independent random letters from an alphabet • Let every Xi be one of two known distributions, specified by the probabilities and • Changed segment is a segment [a,b] of indices where for all • Unchanged segment is a segment [a,b] of indices where for all

  11. Segmentation Methods Maximum-Likelihood Estimation • Let xi be the observed values of Xi for sequence • Let C be a non-intersecting set of hypothetical changed segments • Let z = (z1,…,zn) be the indicator vector for C

  12. Segmentation Methods Maximum-Likelihood Estimation • Likelihood function is can be written as • Log-likelihood can be written as • First term represents log-likelihood of null hypothesis that there are no changed segments • Second term represents log-likelihood ratio of the alternative hypothesis

  13. Segmentation Methods Maximum-Likelihood Estimation

  14. Segmentation MethodsHidden Markov Models • Example of Markov Model

  15. Segmentation MethodsHidden Markov Models • Example of Hidden Markov Model

  16. Segmentation MethodsHidden Markov Models • Assumes that different segments can be classified into a finite set of state, where the nucleotide data in each state follows a probability distribution

  17. Segmentation MethodsHidden Markov Models • Let the finite number of r states underlying the observations be denoted by Si • Let the states follows a Markov process with transition matrix • System of equations for the hidden chain can be written as

  18. Segmentation MethodsHidden Markov Models • Likewise, system of equations for the observations can be written as where yi = (yi,1,…,yi,m) represent vector of m possible observed outcomes, and where each observation is associated with one of the states

  19. Segmentation MethodsHidden Markov Models • With the system equations for hidden chain and observations, the smoothing equations can be derived and be used to plot the homogeneous regions in the sequence

  20. Segmentation MethodsHidden Markov Models

  21. Segmentation MethodsRecursive Segmentation • Assumes that sequences exhibit hierarchical patterns (possibility of subdomains) • It is possible to apply a filter to convert the original four-base DNA sequence into k-symbol sequence • Ex: S(strong) = {C,G} and W(weak) = {A,T}

  22. Segmentation MethodsRecursive Segmentation • Divide-and-conquer approach is applied • For k-symbol sequence of length N, calculate each position i (0 < i < N) the entropy H of the whole sequence, entropy Hl of the subsequence on the left side of the partition point, and entropy Hr of the subsequence on the right side.

  23. Segmentation MethodsRecursive Segmentation • Entropy equations as defined by (Shannon 1948) where Nj, Nj,l, and Nj,r are the counts of symbol j in the whole, left, and right sequence, respectively

  24. Segmentation MethodsRecursive Segmentation • Maximized Jensen-Shannon divergence was chosen to measure the heterogeneity of the sequence • If divergence is large enough, the sequence is heterogeneous and should be segmented • Equation is recursively applied for both the left and the right subsequence, as long as the calculated divergence value stays above the given threshold (similar to constructing a binary tree)

  25. Segmentation MethodsRecursive Segmentation • Alternate approach to determining stopping criterion involves finding a model at the border between underfitting models (those that do not fit the data well) and overfitting models (those that fit the data too well by using too many parameters) • Bayesian Information Criterion (BIC) was used to balance goodness-of-fit of the model to data

  26. Segmentation MethodsRecursive Segmentation • Alternate approach to determining stopping criterion involves finding a model at the border between underfitting models (those that do not fit the data well) and overfitting models (those that fit the data too well by using too many parameters) • Bayesian Information Criterion (BIC) was used to balance the “goodness-of-fit” of the model to data L is the likelihood of the model, K the number of free parameters, and N the sample size

  27. Segmentation MethodsRecursive Segmentation • Two models can be compared: • Modelling the sequence as one single random sequence • Modelling it as two random subsequences with different base compositions • In order for recursive segmentation to continue, the following must apply where k is the number of different symbols in the sequence

  28. Segmentation MethodsRecursive Segmentation • Alternate recursive segmentation algorithm condition can be used to define the segmentation strength s, i.e. • Recursive segmentation process can be continued as long as s > s0, where s0 is predefined by the user

  29. Segmentation MethodsRecursive Segmentation

  30. Segmentation MethodsRecursive Segmentation

  31. Segmentation MethodsRecursive Segmentation

  32. Discussion • DNA sequences can be assumed to have segments where each has a degree of homogeneity • A number of statistical methods can be used to identify and analyse these segments • Isochores • CpG islands • Replication origin and terminus • Complex patterns in telomeres • Coding-noncoding borders • Other statistical methods for analysing DNA segmentation do exist, each with varying degrees of success • Bayesian approach • Walking Markov • Change-point methods

  33. References • Braun J.V., Müller H.-G. “Statistical methods for DNA sequence segmentation,” Statistical Science, 13:142-162, 1998. • Duda R.O., Hart P.E., Stork D.G. (2001) Pattern Classification, New York: John Wiley & Sons, Inc. • Churchill, G.A. “Stochastic models for heterogeneous DNA sequences,” Bulletin of Mathematical Biology, 51:79-94, 1989. • Csürös M. “Algorithms for finding maximal-scoring segment sets,” Proc. WABI, 2004. • Li W., et al. “Applications of recursive segmentation to the analysis of DNA sequences,” Computational Chemistry, 26:491-510, 2002.

More Related