1 / 23

Panagiotis Papapetrou*, Gary Benson ** and George Kollios * *Department of Computer Science

Discovering Frequent Poly-Regions in DNA Sequences. Panagiotis Papapetrou*, Gary Benson ** and George Kollios * *Department of Computer Science **Departments of Biology and Computer Science Boston University. Introduction and Motivation (1/3).

Download Presentation

Panagiotis Papapetrou*, Gary Benson ** and George Kollios * *Department of Computer Science

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Discovering Frequent Poly-Regions in DNA Sequences Panagiotis Papapetrou*, Gary Benson** and George Kollios* *Department of Computer Science **Departments of Biology and Computer Science Boston University

  2. Introduction and Motivation (1/3) • In cells, DNA forms up long chains made up of four chemical units, known as nucleotides. • A number of important regions (known functional regions) at both large and small scales, contain a high occurrence of one or more nucleotides (calledpoly-regions). • Example: • Poly-A: a region “rich” in nucleotide A. • Poly-(C,T): a region “rich” in nucleotides C and T. • Many methods have addressed the problem. However, they only focus on specific types of poly-regions.

  3. Introduction and Motivation (2/3) • Isochores: • Multi-megabase regions of genomic sequence. • Specifically, GC-rich or GC-poor. • CpG islands: • Regions of several hundreds of nucleotides rich in the dinucleotide CpG. • The level of methylation of the cystine (C) is associated with gene expression in nearby genes. • Protein binding regions: • Tens of nucleotides long. • DNA flexibility: dinucleotide, base-step composition.

  4. Introduction and Motivation (3/3) Chromosome: (Nucleodite A) (Nucleodite C) (Nucleodite G) 3’ 5’ Position in the Gene

  5. Related Work • Statistical Methods: • Maximum Likelihood Estimation (MLE) of segments (Auger et. al. 1989, Bement et. al 1977, Fu et. al.1990). • Hidden Markov Chain Model (Churchill et. al. 1992). • Walking Markov Model (Ficket et. al. 1992). • Change-points: (Carlstein et. al. 1994, Braun et. al. 1998, et. al. 2000). • Hierarchical Segmentation (Grosse et. al. 2002, Galvan et. al. 2002 Zhang et. al. 2005).

  6. Main Contributions • Formal definition of the problem of detecting poly-regions in a sequence. • Application of an existing recursive segmentation technique to solve the problem. • Development of an efficient algorithm based on multiple sliding windows. • Application of an efficient arrangement mining algorithm to extract the complete set of frequent arrangements of these regions. • Extensive experimental evaluation of our algorithms on the dog gemone.

  7. Preliminaries (1/4) • Sequence: S = {s1, s2, …, sm}, an ordered set of items, defined over an alphabet. • In our case, sicorresponds to a nucleotide. • k Poly-Region: Hd,k = {I, pstart, pend} • k: number of items in the region. • d: density of the region. • starts and end with one of the k items. • each of the k items has at least (d/k)% frequencuy in the region.

  8. Preliminaries (2/4) • Example of a k Poly-Region: • (1) poly-A • A: 8/10 • (2) poly-(A,C) • A: 4/10 • C: 4/10

  9. Preliminaries (3/4) • Different types of relations1 can occur among Poly-Regions. 1. J. F. Allen and G. Ferguson. “Actions and events in interval temporal logic”. Technical Report 521, The University of Rochester, July 1994”.

  10. Preliminaries (4/4) • k-Arrangement: a set of k temporally correlated events in an e-sequence, denoted as A = {E , R}, where: • E : the set of labels of the event intervals in the arrangement. • R : the set of temporal relations between the events in E. • where is the temporal relation between Eiand Ej.

  11. Problem Statement • Given: • A sequence database S. • A minimum density constraint d. • A range [min, max]. • Find the complete set of maximal poly-regions H of size [min, max], and density of at least d %. • Given: • The set of maximal poly-regions H. • A minimum support threshold min_sup. • Extract the complete set of frequent arrangements of poly-regions in H.

  12. Recursive Segmentation (1/2) • Recursively segment the sequence • Homogeneity difference between each segment in maximized with respect to a measure λ. • In our case we use the Jensen-Shannon Entropy (JSE). • Split point is chosen, where JSE is maximized. • Segmentation of a subsequence is stopped when minimum poly-region size is reached. • Expand each segment to define poly-regions.

  13. Recursive Segmentation (2/2) • To improve the efficiency of the segmentation: • When looking for H-regions of two nucleotides replace the rest of the nucleotides with a single literal. • Example: • S = AAACCCAGGTAGCT • Looking for poly-(A,C): • Snew = AACCCAXXXAXCX

  14. Sliding Windows (1/3) • Define a set of sliding windows W = {w1, w2, …, wN} over the sequence • # of windows: N = max – min + 1. • size of window i : min + i -1. • Each window keeps statistics of a segment: • For each nucleotide: the # of occurrences in the segment. • Each window w = {C, Start, End} • C: set of statistics. • Start: pointer to the start point of the segment in the sequence. • End: pointer to the end point of the segment in the sequence.

  15. Sliding Windows (2/3)

  16. Sliding Windows (3/3) • Heuristic: • NCi = # of items of type C in window i. • C is dense in wi if NCi / |wi| >= d. • Observe: the maximum size of the window where items of type C can fit and fulfill the density constraint is NCi / d. • This indicates which windows of the lower level should be searched for a candidate poly-region. • Start with the window of size max, and for each literal apply the heuristic. • Move to the lower levels.

  17. Frequent Arrangement Mining Algorithm • Use a sliding window W of size M>>max. • At each position of W find the set of arrangements in W. • Keep a global frequency of all arrangements. • Update after each slide. • For the enumeration, the arrangement enumeration tree is used.

  18. Experimental Setup • 39 Chromosomes. • Organism: Canis Familiaris (Dog). • Two phases: • Extract poly-regions. • Discover frequent arrangements of poly-regions in the DNA sequences. • Density constraint d varied between 40-80%. • H-region size varied between 8-64 nucleotides.

  19. Performance Analysis • Recursive Segmentation: • Has accuracy of 85-90%. • Performs better in smaller sequences. • Sliding Window: • Extracts the complete set of poly-regions. • Faster in terms of run time.

  20. Sample Results (1/2) Chromosome 1 (Canis Familiaris)

  21. Sample Results (2/2) Frequent Arrangements

  22. Conclusions • The problem of discovering poly-regions and their frequent arrangements in DNA sequences has been introduced. • Two efficient methods for solving the problem have been discussed. • Recursive Segmentation: approximate. • Sliding Windows: exact. • An efficient algorithm for mining frequent arrangements of intervals has been applied to the extracted poly-regions.

  23. Future Work • Generalize the definition of a poly-region. • Poly-regions of dinucleotides/trinucleotides. • Poly-patterns. • Detect arrangements of poly-regions that occur frequently over: • Coding regions (Genes). • Nucleosomes.

More Related