1 / 1

svinga@itqb.unl.pt almeidaj@musc

2D-CGR/USM representation of DNA. 1. CGR/USM representation of DNA. Each iteration goes half the distance towards the corner representing the next symbol. Each point x i corresponds to one symbol in its context. Chaos Game Representation/Universal Sequence Map (CGR/USM)

flint
Download Presentation

svinga@itqb.unl.pt almeidaj@musc

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. 2D-CGR/USM representation of DNA 1. CGR/USM representation of DNA Each iteration goes half the distance towards the corner representing the next symbol Each point xi corresponds to one symbol in its context Chaos Game Representation/Universal Sequence Map (CGR/USM) Maps discrete sequences onto continuous maps. The CGR/USM mapping of a N-length DNA sequence is: C T Suffix property – strings ending in a specific suffix are in the sub-square labeled with that suffix A G 2. Rényi quadratic entropy a=2 Definition of Rényi continuous entropy of a function 3. Parzen’s Method 4. Results and Discussion • Parzen’s pdf estimation Too large less detail Non-parametric method for the estimation of a probability density function (pdf). For given sample , pdf estimation with Gaussian spherical kernel is: DNA testset Too small Bad data generalization Good estimation iid Sample pointsai ~ f CGR/USM estimation Definition of DNA entropy based on CGR/USM and Parzen’s Method with parameter s - variance of Gaussian function used. Simplification! -ATC- Motif detected Rényi entropy of DNA sequences where Simplification: Integral  Sum Convolution of two Gaussians is Gaussian All pairwise squared Euclidean distances between CGR/USM coordinates xi Rényi continuous quadratic entropy for the sequence DNA dataset and random sequences obtained by Montecarlo simulation.Representation of entropies for the dataset described in Table above as a function of the logarithm of the Gaussian kernel variance used in the Parzen’s Method. The lower the value of entropy H2, the less random or more structured the sequence is. Graph has theoretically demonstrated asymptotes for given by line and for , line J-9 Susana Vinga(1), Jonas S Almeida(1,2) • Biomathematics Group - Instituto de Tecnologia Química e Biológica, Univ. Nova de Lisboa (ITQB/UNL) - Oeiras, Portugal. • Dept. Biostatistics, Bioinformatics and Epidemiology - Medical Univ. South Carolina - Charleston SC 29425, USA svinga@itqb.unl.ptalmeidaj@musc.edu 1. Introduction Entropy estimation of DNA sequences provides a measure of their complexity and randomness level. Shannon's L-block discrete entropy definition is based on counting all length-L overlapping words in a sequence and has finite size convergence problems. 3. Methods and Algorithms 2. Objectives The Rényi continuous quadratic entropy here proposed generalizes former concepts without some of the problems encountered in Shannon's formalism. Furthermore, the continuity of Rényi's measure allows great flexibility and the extraction of new features in sequences. • Shannon’s discrete entropy http://bioinformatics.musc.edu/renyi Measures randomness or predictability Equivalent to Rényi entropy, order a=1 Rényi continuous entropy of DNA sequence representation Random DNA – Montecarlo simulation Median values Length N 5. Conclusions and Future Work Rényi continuous quadratic entropy H2 is a good measure of randomness of DNA sequences Simplifications with Parzen method allow straightforward computation Method will hopefully provide new tools for the study of motifs and repeatability in biological sequences Explore more theoretical properties of H2 Optimize algorithm to accommodate longer sequences Choose variances values that have special significance Acknowledgments S.Vinga and J.S.Almeida thankfully acknowledge the financial support by grants SFRH/BD/3134/2000 and SAPIENS/34794/99 from Fundação para a Ciência e a Tecnologia (FCT) of the Portuguese Ministério da Ciência e do Ensino Superior. References S.Vinga and J.S.Almeida, Rényi continuous entropy of DNA sequences Journal of Theoretical Biology 2004 (accepted).

More Related