Problems for Effective   Analysis of Biological Data:
Download
1 / 36

C. Karanikas, N. Atreas, P. Polychronidou, A. Bakalakos Department of Informatics - PowerPoint PPT Presentation


  • 94 Views
  • Uploaded on

Problems for Effective Analysis of Biological Data: Discrete Transforms on Symbolic Sequences for String-Matching, Pattern-Recognition and Grammar Detection. C. Karanikas, N. Atreas, P. Polychronidou, A. Bakalakos Department of Informatics Aristotle University of Thessaloniki. Abstract.

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about ' C. Karanikas, N. Atreas, P. Polychronidou, A. Bakalakos Department of Informatics' - corby


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

Problems for Effective Analysis of Biological Data:Discrete Transforms on Symbolic Sequences for String-Matching, Pattern-Recognition and Grammar Detection

C. Karanikas, N. Atreas,

P. Polychronidou, A. Bakalakos

Department of Informatics

Aristotle University of Thessaloniki


Abstract
Abstract

  • We draw our inspiration from the basic operations that nature performs in biological sequences (Replication, Dilation, Translation, Splicing, etc).

  • We introduce a variety of new discrete (linear or non linear) invertible transforms on symbolic sequences:

    • The Cyclic Class Transform

    • The Stern Brocot Transform

    • The Generalisation of Haar Transform

    • The Haar-Riesz Product

  • Our main target with these transforms is

    • to encode-decode localinformation on strings

    • to make fast non-exact string matching and pattern recognition.

    • to identify some of the grammatical rules of the string-collections.

  • Thus we deal with the notion of similarity and distances (such as the edit distance), i.e. distances measuring the number of the operations delete, insert and substitute required to identify two strings.


New era in security related research
New Era in Security - related Research

  • Global Security depends on Information Security

  • Mobility, Heterogeneity, Size, Complexity of Information Systems make it difficult to detect, prevent, respond to, overcome security threats.

  • We deal with a fundamental problem: Detect, recognize, interpret and ultimately assign a meaning to symbolic information such as:

    • Biological/Biometric data

    • Intelligence Information

    • Any other form of information

  • We formulate the problem in mathematical/information theoretic terms: Given a pattern and a string find parts of the string similar to the pattern

  • Or, find strings that are identical.

  • Or, find strings that are similar.

  • Also, make the algorithms work in a “noisy” environment (I.e. “forgive” a few accidental errors on a big string)

  • Also, find the underlying grammar of the string.

    ****We need new mathematical tools for effective analysis of data.



Sample of an rna
Sample of an RNA (3,000,000,000 bases A,C,G,T)

  • CGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGAT


3d simulation of a protein
3D-Simulation of a protein (3,000,000,000 bases A,C,G,T)

  • Each protein can be written in 3D as a string in an alphabet of 20 letters.


A protein
A protein (3,000,000,000 bases A,C,G,T)


Coding using prime numbers
Coding using Prime Numbers (3,000,000,000 bases A,C,G,T)

  • Let {x1 ,…,xn} be a symbolic sequence in an alphabet {ε1,…, εk} (w.l.g. consider epsilons primes)

  • Let p1,…,pn,… , be an increasing sequence of primes.

  • The map {x1 ,…,xn}  Σxi (pi/pi+1) provides an invertible coding for the collection of all symbolic sequences (as above)

  • Similar maps: {x1 ,…,xn}  Σi (1-xi/pi) or

  • {x1 ,…,xn}  Πi (1-xi/pi)


The prime coding
The Prime Coding (3,000,000,000 bases A,C,G,T)


Decoding
Decoding (3,000,000,000 bases A,C,G,T)


Given a collection of (3,000,000,000 bases A,C,G,T) strings (genes/proteins written in an alphabet of four/twenty letters), one can assign a number to pairs of this collection, which tell one how distant or how similar or how close they are.

The Edit (Levenshtein) distance counts the minimum number of the operations: Delete, Insert and Substitute to make to strings identical. E.g. the Edit distance of {1,0,1,1,0,1} and {1,1,1,0,1,0}is 2. (delete the second digit, insert 0 in 5 th position)

One dramatic way thatmathematics has come in handy for the study of the genome involves the concept of distance


Levenshtein or edit distance of 1 1 0 0 1 0 1 and 1 0 0 0 1 0 0 1
Levenshtein (3,000,000,000 bases A,C,G,T) (or Edit) Distance of{1,1,0,0,1,0,1} and {1,0,0,0,1,0,0,1}


A new distance on strings based on their cyclic classes and the distribution of characters
A new distance on strings based on their cyclic classes and the distribution of characters

  • The problem given two strings : ACCGHAAGGHCC , ACGHHAGGHACC

  • Replace A  0, C1,H2 and G  3 and consider the corresponding numbers.

  • Find a new (fast) transform on symbolic sequences and a measure on them, which is “invariant” under a small number of changes as insert, delete and replace. Next we provide a new distance suitable for biodata (biological or biometric)


Stern brocot transform
Stern-Brocot Transform the distribution of characters

  • Consider the matrices T(0) and T(1) respectively

  • Each string {ε1,…,εn}, where εi = 0 or 1, corresponds to the (dot) product of matrices:

  • T(ε1).T(ε2)…T(εn ). For example the string {1,0,1,0,1}T(1).T(0).T(1).T(0).T(1) =

  • The pair {8,13} (the sum of each row) is unique and so {1,0,1,0,1} {8,13}


Stern brocot 1880 transform for an alphabet of 3 symbols 0 1 2
Stern-Brocot (1880) Transform for an alphabet of 3 symbols {0,1,2}

  • Now T(0),T(1) and T(2) are

  • So we have

  • {1,2,1,0,2}

  • = {5, 31 ,17}

  • By a simple algorithm we get {5, 31 ,17} 

  • {1,2,1,0,2}


Inspired by the stern brocot transform we develop a measure of similarity for strings
Inspired by the Stern-Brocot transform we develop a measure of similarity for strings

  • The second string below, has two differences with the first one, the triples of corresponding numbers are similar. We used successfully this idea to find similar parts of genes.

  • {B,A,A,B,A,A,B,G,A,B,G,G,G,B,B}

  • {B,A,B,B,A,A,G,A,B,G,G,G,B,B}

  • {0.740721, 0.0476564, 0.211623}

  • {0.735667,0.0487892,0.215544}


P adic dilation and splicing translation operations on matrices and vectors
P-adic Dilation and Splicing/Translation operations on matrices and vectors

  • Definition 1: Let Mn,m the space of all n x m matrices, we define the p-adic dilation operation Dp: Mn,m --> Mn,pm , p=2, 3, …. such that:

  • Dp(M)={ Mi,[j/p], i=,1,…,n, j = 1, …,m p}

  • where [x] is the largest integer greater than or equal to the number x.



In Nature Everything is Unique provide polymorphism The Small Preens has right to say that his roses are unique in all the world


Initial idea for discrete transforms coding local information
Initial idea for discrete transforms coding local information

  • A mathematical transform mimicking antigen processing must code local information (as. the Haar wavelet transform) cutting data in pieces (peptides). Note that a protein (symbolic sequence in an alphabet of 20 letters) is a union of peptides.

  • Introduce transforms/ computational tools for string marching, for pattern recognition of grammars, for detecting dynamical systems as hidden Markov process and Riesz products.

  • Results and experience from reserch projects:

  • European Project: IST-2000-26016, Immunocomputing

  • GSRT 2005-6: Mathematics for Bioinformatical applications – Multiresolution methods for the study of biodata.

  • Greek-Bulgarian project (2006-7), on Application of Wavelet Theory in Bioinformatics


Splicing vectors vector translations cyclic classes of numbers
Splicing vectors / vector translations /cyclic classes of numbers

  • Definition 2 Let 0m= {0,0,…,0} is in M1,mThe p-adic vector translations Tp: M1,m -> M1,pm, p=2, 3, …. such that: Tp(v)= Join[0km,v, 0jm], such that k+j+1=p, where Join means splicing vectors together.



The case p 3
The case p=3 numbers


Sparse matrices spmm
Sparse Matrices (SPMM) numbers

  • The matrices above called SPMM, are iteratively generated by dilation and translation of block sub-matrices ( work by N. Atreas, C.K. and P. Polychronidou).

  • The determinant of SPMM is ± 1

  • The Inverse of SPMM is iteratively generated too.




The Gram Schmidt orthonormalization process of SPMM matrices for p = 2 give the for n = 1,2,3 the Haar matrices:


Spmm and haar matrices p 3 n 2
SPMM and Haar matrices p =3 , n=2 for p = 2 give the for n = 1,2,3 the Haar matrices:


The haar orthonormal system of l 2 0 1 in base p 3
The Haar orthonormal system of L for p = 2 give the for n = 1,2,3 the Haar matrices: 2[0,1] in base p =3


Theorem given t t k k 1 p n find the coefficients x k k 1 p n of hrp
Theorem: Given t={t(k), k = 1,…p for p = 2 give the for n = 1,2,3 the Haar matrices: n} find the coefficients {x(k), k = 1,…pn} of HRP

  • Theorem Let rj ,j = 1,…, pn be the j row of the pn x pn Haar matrix, and R(p,n) = ∏ ( 1+ x(k) rk ) its Haar Riesz product with coefficients {x(k), k = 1,…pn} , If t={t(k), k = 1,…pn} is any non-negative data the system: R(p,n) . rj = t . rj has a unique solution w.r.t. {x(k), k = 1,…pn} .


Riesz haar coefficients for the data t 1 t 2 t 3 t 4 t 5
Riesz Haar coefficients for the data {t for p = 2 give the for n = 1,2,3 the Haar matrices: 1, t2, t3, t4, t5}


Expanding parentheses of haar riesz product we get t 1 t 2 t 3 t 4 t 5
Expanding parentheses of Haar Riesz product we get : {t for p = 2 give the for n = 1,2,3 the Haar matrices: 1, t2, t3, t4, t5}


The system for case p=3 and n=2. for p = 2 give the for n = 1,2,3 the Haar matrices: The coefficients of the system are elements of the corresponding matrix.


Find hidden markov structure of genes
Find Hidden Markov structure of genes for p = 2 give the for n = 1,2,3 the Haar matrices:


Application: Given a collection of “Cantor Strings” the Riesz-Haar Coefficients reveal the Cantor Grammar

  • Input 729 (3^6) samples of a Cantor collection. Output the Riesz Haar co-efficients. Observe that the collection is orthogonal on certain rows of the Haar matrix.


Thanks for listening
Thanks for listening Riesz-Haar Coefficients reveal the Cantor Grammar


ad