Visualization of K-Tuple Distribution in Prokaryote Complete Genomes and Their Randomized Counterpar...
Download
1 / 47

XIE Huimin ( ??? ) Department of Mathematics, Suzhou University and HAO Bailin ( ??? ) - PowerPoint PPT Presentation


  • 140 Views
  • Uploaded on

Visualization of K-Tuple Distribution in Prokaryote Complete Genomes and Their Randomized Counterparts. XIE Huimin ( 谢惠民 ) Department of Mathematics, Suzhou University and HAO Bailin ( 郝柏林 ) T- Life Research Center, Fudan University Beijing Genomics Institute, Academia Sinica

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'XIE Huimin ( ??? ) Department of Mathematics, Suzhou University and HAO Bailin ( ??? )' - benard


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
Xie huimin department of mathematics suzhou university and hao bailin

Visualization of K-Tuple Distribution in Prokaryote Complete Genomes and Their Randomized Counterparts

XIE Huimin (谢惠民)

Department of Mathematics, Suzhou University

and

HAO Bailin (郝柏林)

T-Life Research Center, Fudan University

Beijing Genomics Institute, Academia Sinica

Institute of Theoretical Physics, Academia Sinica


Xie huimin department of mathematics suzhou university and hao bailin

Prokaryote Complete Genomes ( PCG ) Genomes and Their Randomized Counterparts

K 

Biology-inspired

mathematics

Avioded and Rare

K-strings

K=6-9,15,18

Phylogeny

Based on PCG

Compositional

Distance

( Success)

Species-specificity

of avoidance

Combinatorics

Goulden-Jackson

cluster method

Decomposition

and

Reconstruction of

AA sequences

Factorizable

Language

Phylogeny

( Failure )

Graph theory:

Euler paths


Xie huimin department of mathematics suzhou university and hao bailin

1. 2D Histogram of K-Tuples Genomes and Their Randomized Counterparts


Xie huimin department of mathematics suzhou university and hao bailin

g Genomes and Their Randomized Counterparts

c

t

a


Thermoanaerobacter tengchongenic k 8
Thermoanaerobacter tengchongenic Genomes and Their Randomized Counterparts(K = 8)

g

c

a

t


The algorithm hao histogram implemented at
The Algorithm Genomes and Their Randomized Counterparts (Hao Histogram) Implemented at:

  • National Institute for Standard and Technology (NIST)

    http://math.nist.gov/~FHunt/GenPatterns/

  • European Bioinformatics Institute (EBI)

    http://industry.ebi.ac.uk/openBSA/bsa_viewers

    However, 2D only, no 1D histograms.


Two mathematical problems
Two Mathematical Problems Genomes and Their Randomized Counterparts

  • Dimensions of the complementary sets of portraits of tagged strings.

  • Number of true and redundant missing strings.

  • The two problems turn out to be one and the same, the first being graphic representation of the second.


Two methods to solve the problem
Two Methods to Solve Genomes and Their Randomized Counterpartsthe Problem

  • Combinatorial solution: Goulden-Jackson cluster method (1979); number of dirty and clean words.

  • Language theory solution: factorizable language, minimal deterministic finite-state automaton.


2 1d histogram of k tuples
2. 1D Histogram of K-Tuples Genomes and Their Randomized Counterparts

  • Collect those K-tuples whose count fall in a bin from to ,

  • Plot the number of such K-tuples versus the counts,

  • This is a 1D histogram or

  • An expectation curve.


Xie huimin department of mathematics suzhou university and hao bailin

The effect of c+g content in 2D histograms of Genomes and Their Randomized Counterparts

original genome and randomized sequence:


Xie huimin department of mathematics suzhou university and hao bailin

Escherichia coli Genomes and Their Randomized Counterparts original genome


Xie huimin department of mathematics suzhou university and hao bailin

Escherichia coli Genomes and Their Randomized Counterparts randomized sequence


Xie huimin department of mathematics suzhou university and hao bailin

Haemophilus influenzae Genomes and Their Randomized Counterparts randomized sequence


Xie huimin department of mathematics suzhou university and hao bailin

Mycobacterium leprae Genomes and Their Randomized Counterparts original genome


Xie huimin department of mathematics suzhou university and hao bailin

Mycobacterium laprae Genomes and Their Randomized Counterparts randomized sequence


Xie huimin department of mathematics suzhou university and hao bailin

Mycobacterium tuberculosis Genomes and Their Randomized Counterparts original genome


Xie huimin department of mathematics suzhou university and hao bailin

Mycobacterium tuberculosis Genomes and Their Randomized Counterparts randomized sequence


G c content of some bacteria
G+C Content of Some Bacteria Genomes and Their Randomized Counterparts


3 three artificial models generating sequences
3. Three Artificial Models Generating Sequences Genomes and Their Randomized Counterparts

  • Eiid: equal-probability independently and identically distributed model.

  • Niid: nonequal-probability independently and identically distributed model.

  • MMn: Markov model of order n


Monte carlo method estimation of expectation ex and standard deviation sd for an niid model
Monte Carlo Method Genomes and Their Randomized Counterpartsestimation of expectation (ex) and standard deviation (sd) for an niid model

(the compositions of a,c,g,t are 15:35:35:15, the length of

sequence is , the value of K=8.)


Xie huimin department of mathematics suzhou university and hao bailin
Validation about Genomes and Their Randomized Counterpartsthe Robustness of K-Histograms: a comparison of absolute error from ex in an experiment with sd as reference


Xie huimin department of mathematics suzhou university and hao bailin
Compare the population of shuffling a given sequence and the population of sequence generated from a stochastic model.

F-test

t-test


4 a theory for the expectation curve 1
4. A Theory for the Expectation Curve (1) population of sequence generated from a stochastic model.

Definition. For each , define a random variable

(1)

Where random variable takes value 1 if the i-th K-tuple occurs

exactly n times in the sequence, or takes value 0 if it does not

occur.


A theory for the expectation curve 2
A Theory for the Expectation Curve (2) population of sequence generated from a stochastic model.

Theorem. For each , the mathematical expectation

of random variable is given by

(1)

Where the random variable is the occurrence number of

K-tuples of I-th type.


The exact computation of expectation curve
The Exact Computation of Expectation Curve population of sequence generated from a stochastic model.

In order to compute the expectation curve we need to know the probability for each and .

The Goulden-Jackson cluster method can be used successfully for the model of eiid.

It is still difficult to do the computation for other models.


Xie huimin department of mathematics suzhou university and hao bailin

Two Experiments (for the model of eiid): population of sequence generated from a stochastic model.

compare with a K-histogram compare with Monte Carlo method

the red curves are the standard deviation estimation

obtained by Monte Carlo method.


Poisson approximation for the expectation curve

Poisson Approximation for population of sequence generated from a stochastic model.the Expectation Curve

For each K-tuple calculate its expected number of appearing in sequence of length N, then use the formula of probability function of Poisson distribution and sum them for all K-tuples:

Remark. This follows from a theorem in Percus and Whitlock, ACM

Transaction on Modeling and Computer Simulation, 5 (1995) 87—100

(the model, however, can only be eiid, and the tuples must be overlapless).


Comparison of poisson approximation with k histogram for u urealyticum
Comparison of Poisson population of sequence generated from a stochastic model.approximation with K-histogram for U. urealyticum


Comparison of poisson approximation with 7 histogram for haemophilus influenzae
Comparison of Poisson population of sequence generated from a stochastic model.approximation with 7-histogram for Haemophilus influenzae


Comparison of poisson approximation with 8 histogram for haemophilus influenzae
Comparison of Poisson population of sequence generated from a stochastic model.approximation with 8-histogram for Haemophilus influenzae


A comparison of poisson approximation with monte carlo method
A comparison of Poisson approximation with population of sequence generated from a stochastic model.Monte Carlo method

In this computation the model is an niid, in which the parameters

are taken from the randomized sequence of H. influenzae.


5 analysis of the mechanism of multi modal k histograms
5. Analysis of the Mechanism of Multi-Modal K-histograms population of sequence generated from a stochastic model.

An example for H. influenzae. The length of its genome is

1830023. Under the simplified conditions of

for , there are only 9 types of different of as shown

in the following list.


The following map shows the nine individual probability functions and their sum
The following map shows the nine individual probability functions and their sum

Notice that the effect from the ratio of successive modes:


For e coli the ratio is 0 968931 hence the result is quite different
For E. coli the ratio is functions and their sum0.968931, hence the result is quite different


6 analysis of short range correlation by k histograms
6. Analysis of Short-Range Correlation by K-Histograms functions and their sum

Two 8-histograms for E. coli,

the left one is from its genome,

and the right one is from its

Markov model of order 1.



Xie huimin department of mathematics suzhou university and hao bailin

Using Markov model of order 5 and Monte Carlo method 2—7 for

to compare the 8-histogram of E. coli’s complete genome

sequence with the ex and sd of MM5.

this is the ratio curve

for

the red curve is the expectation

curve estimated by doing 50

times of simulation.


Reference
Reference: 2—7 for

Huimin Xie, Bailin Hao, “Visualization of K-tuple distribution in prokaryote complete genomes and their randomized counterparts”, CSB2002: IEEE Computer Systems Bioinformatics Conference Proceedings, IEEE Computer Society, Los Alamitos, 2002, 31-42.


7 discussion
7. Discussion 2—7 for

Most of the results shown above are of experimental nature, many problems are left for future study.

  • How to select reasonably the value of K.

  • How to use 1D visualization to protein?

  • What are the properties of random variables ?

  • How to compute exactly the expectation curve for the model of niid and MMn?

  • Why the Poisson approximation is effective without considering the overlap of K-tuples?


Xie huimin department of mathematics suzhou university and hao bailin

Thanks! 2—7 for