Visualization of K-Tuple Distribution in Prokaryote Complete Genomes and Their Randomized Counterparts. XIE Huimin ( 谢惠民 ) Department of Mathematics, Suzhou University and HAO Bailin ( 郝柏林 ) Ｔ－ Life Research Center, Fudan University Beijing Genomics Institute, Academia Sinica
XIE Huimin (谢惠民)
Department of Mathematics, Suzhou University
HAO Bailin (郝柏林)
Ｔ－Life Research Center, Fudan University
Beijing Genomics Institute, Academia Sinica
Institute of Theoretical Physics, Academia Sinica
Prokaryote Complete Genomes ( PCG ) Genomes and Their Randomized Counterparts
Avioded and Rare
Based on PCG
( Failure )
1. 2D Histogram of K-Tuples Genomes and Their Randomized Counterparts
g Genomes and Their Randomized Counterparts
However, 2D only, no 1D histograms.
The effect of c+g content in 2D histograms of Genomes and Their Randomized Counterparts
original genome and randomized sequence:
Escherichia coli Genomes and Their Randomized Counterparts original genome
Escherichia coli Genomes and Their Randomized Counterparts randomized sequence
Haemophilus influenzae Genomes and Their Randomized Counterparts randomized sequence
Mycobacterium leprae Genomes and Their Randomized Counterparts original genome
Mycobacterium laprae Genomes and Their Randomized Counterparts randomized sequence
Mycobacterium tuberculosis Genomes and Their Randomized Counterparts original genome
Mycobacterium tuberculosis Genomes and Their Randomized Counterparts randomized sequence
(the compositions of a,c,g,t are 15:35:35:15, the length of
sequence is , the value of K=8.)
Definition. For each , define a random variable
Where random variable takes value 1 if the i-th K-tuple occurs
exactly n times in the sequence, or takes value 0 if it does not
Theorem. For each , the mathematical expectation
of random variable is given by
Where the random variable is the occurrence number of
K-tuples of I-th type.
In order to compute the expectation curve we need to know the probability for each and .
The Goulden-Jackson cluster method can be used successfully for the model of eiid.
It is still difficult to do the computation for other models.
Two Experiments (for the model of eiid): population of sequence generated from a stochastic model.
compare with a K-histogram compare with Monte Carlo method
the red curves are the standard deviation estimation
obtained by Monte Carlo method.
For each K-tuple calculate its expected number of appearing in sequence of length N, then use the formula of probability function of Poisson distribution and sum them for all K-tuples:
Remark. This follows from a theorem in Percus and Whitlock, ACM
Transaction on Modeling and Computer Simulation, 5 (1995) 87—100
(the model, however, can only be eiid, and the tuples must be overlapless).
In this computation the model is an niid, in which the parameters
are taken from the randomized sequence of H. influenzae.
An example for H. influenzae. The length of its genome is
1830023. Under the simplified conditions of
for , there are only 9 types of different of as shown
in the following list.
Notice that the effect from the ratio of successive modes:
Two 8-histograms for E. coli,
the left one is from its genome,
and the right one is from its
Markov model of order 1.
to compare the 8-histogram of E. coli’s complete genome
sequence with the ex and sd of MM5.
this is the ratio curve
the red curve is the expectation
curve estimated by doing 50
times of simulation.
Huimin Xie, Bailin Hao, “Visualization of K-tuple distribution in prokaryote complete genomes and their randomized counterparts”, CSB2002: IEEE Computer Systems Bioinformatics Conference Proceedings, IEEE Computer Society, Los Alamitos, 2002, 31-42.
Most of the results shown above are of experimental nature, many problems are left for future study.
Thanks! 2—7 for