Kolmogorov Complexity for analysis of DNA sequence

Kolmogorov Complexityfor analysis of DNA sequence Shijun Tang ThiraphatMeesumrarn GaithAlbadarin

Outline • Kolmogorov Complexity The Complexity of DNA Methods • Quantum Kolmogorov Complexity Qubit and Definition of QKC

Kolmogorov Complexity The Kolmogorov complexity of any string x ∈ {0, 1}∗ is defined as: C(x) := min{ℓ(p) | U(p) = x} The Kolmogorov complexity of x : the length of the shortest program which produces x as its output

The Complexity of DNA • “genetic language” in DNA sequences (A, C, G, and T) • heterogeneity in DNA sequences (not random) • the long-range correlation • Compression

Methods • Entropy • Spectral Analysis • Kolmogorov Complexity

Entropy Clausius Entropy Boltzmann Entropy Shannon Entropy Kolmogorov Entropy Tsallis Entropy -- Approximate Entropy ---Sample Entropy ---Multiscale Entropy …………….

Entropy • Jensen-Shannon distance == the difference between the entropy calculated from the whole system and the weighted sum of entropies calculated from the subsystems • Jensen-Shannon distance D(i) for each possible partition point ialong the DNA sequence

3 1 2 4 near » 189,000 230,208 nucleotides

the bigger the difference of the two subsequences as partitioned at point i, and the more ideal to choose that point to partition the sequence • the average value of D(i) of random sequence is at least 10 times lower than that for the yeast sequence. • These ups and downs in D(i) for the random sequence are purely random fluctuations

Spectral Analysis • Power spectrum -- > to represent the correlation structure in a sequence according to wavelength (or frequency f =c/wavelength). • The power at a given frequency, P(f), is the contribution from that frequency component to the total variance of the fluctuation in the sequence. • A random sequence lacks correlation at any length scale, and its power spectra is flat

Kolmogorov Complexity for Analysis of DNA The search for DNA regions with low complexity is one of the pivotal tasks of modern structural analysis of complete genomes. The low complexity may be preconditioned by strong inequality in nucleotide content (biased composition), by tandem or dispersed repeats or by palindrome- hairpin structures, as well as by a combination of all these factors.

Four types of repeat differing by orientation and localization in direct or complementary chains are considered: direct, symmetric, inverted and direct complementary. Direct and inverted repeats as standard prototypes. Symmetric (the repeated sequence is oppositely oriented on the same DNA strand) Direct complementary (a direct repeat on the complementary DNA strand),

Nucleotide Sequence : the AP2 transcription factor binding site, GTGCCCCGCGGGAACCCCGC. Black and gray arrows mark the copied fragments and their prototypes. A tandem repeat characterized by partial overlapping of the prototype on the copied fragment is marked by a dotted line. In this decomposition, the first one-lettered components, G and T, are produced by an operation generating a novel symbol. The complexity of this 20-lettered sequence = 10 [the number of components in H(S)].

Lempel-Ziv complexity S, Q represents two string, respectively. SQ=S+Q. SQP=SQ(deleted last letter) V(SQP) is all subset of SQP Now c(n)=1, assume S=s1s2….sr Q=Sr+1 If QϵV(SQP), S same, Q=Sr+1Sr+2 Until Q V(SQP), So Q=sr+1sr+2…sr+i is not the subset of s1s2..srsr+1sr+2..sr+i-1, c(n)+1 Update S= s1s2..srsr+1sr+2..sr+I and Q=sr+i+1 Until Q take the final letter

b(n) is complexity value of random sequence @ n infinite b(n) = = nlog2n Thus, CLZN = c(n)/b(n) the complexity of random ---- > 1 the complexity of order sequence ---- > 0 The smaller the complexity, the slower the speed of variation === > the change of data is regular, and has good periodic time.

The calculation of c(n) (Lempel-Ziv complexity) Lempel-Ziv Complexity 1976 For sequence S=(10101010) S=s1=1, Q=s2=0, SQ=10, SQP=1, Q V(SQP), Q insertion, SQ=1● 0 S= s1s2=10, Q=s3=1, SQ=101, SQP=10, Q ϵV(SQP), Q duplication, SQ=1● 0 ●1 S= s1s2=10, Q=s3 s4=10, SQ=1010, SQP=101, QϵV(SQP), Q duplication, SQ=1● 0●10 • Repeated 2) and 3), Q duplication , S=1●0●101010 , c(n)=3 • b(8)=8log28=24. So normalized complexly: CLNZ =c(8)/b(8)=3/24=0.125 • Thus, results show that the sequence is low because this sequence is periodical one.

Other estimates of text complexity The evaluation of complexity in a text region CWF by Wootton and Federhen (7) is given by the formula

Linguistic complexity can also be defined as the ratio of the sum of numbers of words occurring in a sequence analyzed to the maximum possible number of such words (12):

Implementation and Results Calculation mode in a sliding window • a single extended sequence • a group of relatively short sequences up to 1 kb in length. A table of complexity values is constructed for a window, of ordered size N, Sliding along the sequence. The sequence complexity is assigned to the window center. The calculation mode in a sliding window (complexity profile) is demonstrated here using the example of the Borreliaburgdorferigenome. In Figure 2, complexity profiles for a window sliding along the sequence are illustrated.

http://news.bbc.co.uk/2/hi/8236943.stm

Quantum Kolmogorov Complexity Are quantum computers more powerful than classical computers? Quantum Entanglement Quantum Factorization of Integers Quantum computers can solve some problems faster than classical computers (→ Shor’s factoring algorithm).

8678234894390455325820387658927648846728276488478357578857990101745939579360286782348943904553258203876589276488467282764884783575788579901017459395793602 38757578689764649202092923747567520379898000084773622344552626377837477477476 46575868799898899995311906422876539300576869504869503847565674385565746488765 89005088573342257947602867756958696986758959511122344756900768768779957500472 66789953304578677748765778369119087568204639293000027264558393685793948745688 47637479496316118939005409586870347636374856969975765356445784995969976655610 98443348899046881020480568572231018209586704589944806808908069887677575969061 234894390498988999953119064799010174593957936 >> 2^(129/2) = 2.6088e+019 >> 2^125 = 4.2535e+037 >> 2^200 = 1.6069e+060

Prime factorization of large number Principle of Current Cryptography In 1994, 1600 workstation with super speed obtained prime factors of L=129 in about 8 months. If L=250, 800,000 years Factorization of Integers The number N---- approximate length L bits ---(0 ~ 2L-1) The number N has a factor in the range (1, ) Try each number in this range to find a factor of N---At least steps S~ =2L/2 But, for Shor’s Quantum Computation S ~ poly(log(N))

The Factoring Firestorm 18819881292060796383869723946165043 98071635633794173827007633564229888 59715234665485319060606504743045317 38801130339671619969232120573403187 9550656996221305168759307650257059 Peter Shor 1994 3980750864240649373971 2550055038649119906436 2342526708406385189575 946388957261768583317 4727721461074353025362 2307197304822463291469 5302097116459852171130 520711256363590397527 Best classical algorithm takes time Shor’s quantum algorithm takes time An efficient algorithm for factoring breaks the RSA public key cryptosystem

Qubit • Pure state of a qubit • Basis • Superposition of states and

Qubit: • The element of carrying information------- The quantum state • |0>, |1> and any linear combination (superposition) c1|0>+c2|1> Definition (Qubit Strings) A qubit string σ is a state vector or density operator

Quantum Parallelism A quantum computer can perform 2n operations at the same time due to superposition : However we get only one answer when we measure the result: F[000] F[001] F[010] . . F[111] Only one answer F[a,b,c]

The Discrete Fourier Transform • Assume Lqubits hold any number x, from 0 to 2L-1 • Any number x can be expressed as the state • |x> = |xL-1xL-2 …x1 x0 >= |xL-1 > |xL-2 > ….|x1 > |x0 > • Where x= and a tensor product • Ajacts only on the qubit representedby j-th atom • The operator |ij><kj|on the state |nj> • |ij><kj||nj> = |ij> • Aj|0j> = (|0j>+|1j>) Aj|0j> = (|0j>-|1j>) • Bj|0jk> = |0jk> Bj|1jk> = |1jk> Bj|2jk> = |2jk> • Bj|3jk> =exp( )|3jk>

A0B01B02A1B12A0|x> = {(|0>+|4>)-(|2>+|6>)+i(|1>+|5>)- i(|3>+|7>) = |x> == > A0B01B02A1B12A0 perform a discrete Fourier transform!

Shor’salgothrim [1] Quantum Fourier Transform | > = = == > Finding the period of a periodic function [2] = , then find Greatest Common Divisor and

N-gate

Quantum computers U : input qubit string σ → output qubit string U(σ) • Definition (Quantum Kolmogorov Complexity) Let U be a universal quantum computer and δ > 0. Then, for every qubit string ρ, define QCδ(ρ) = min{ℓ(σ) | ||ρ − U(σ)||Tr ≤ δ} • the difference between two qubit strings, it is natural to use the trace distance which is defined as ||ρ − U(σ)||Tr:= (1/2)Tr|ρ − σ|

References: • Y. L. Orlov and V. N. Potapov, Complexity: an internet resource for analysis of DNA sequence complexity, Nucleic Acids Research, 2004, Vol. 32 • Fabio Benatti, TyllKrüger,Markus Müller, Rainer Siegmund-Schultze, Arleta Szkoła. Entropy and Quantum Kolmogorov Complexity: A Quantum Brudno’sTheorem, Communications in Mathematical Physics 265, 437–461 (2006)

Kolmogorov Complexity for analysis of DNA sequence