Kolmogorov Complexity for analysis of DNA sequence

1 / 39

# Kolmogorov Complexity for analysis of DNA sequence - PowerPoint PPT Presentation

Kolmogorov Complexity for analysis of DNA sequence. Shijun Tang Thiraphat Meesumrarn Gaith Albadarin. Outline. Kolmogorov Complexity The Complexity of DNA Methods Quantum Kolmogorov Complexity Qubit and Definition of QKC. Kolmogorov Complexity.

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.

## PowerPoint Slideshow about 'Kolmogorov Complexity for analysis of DNA sequence' - race

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

### Kolmogorov Complexityfor analysis of DNA sequence

Shijun Tang

ThiraphatMeesumrarn

Outline
• Kolmogorov Complexity

The Complexity of DNA

Methods

• Quantum Kolmogorov Complexity

Qubit and Definition of QKC

Kolmogorov Complexity

The Kolmogorov complexity of any string

x ∈ {0, 1}∗ is defined as:

C(x) := min{ℓ(p) | U(p) = x}

The Kolmogorov complexity of x :

the length of the shortest program which produces

x as its output

The Complexity of DNA
• “genetic language” in DNA sequences (A, C, G, and T)
• heterogeneity in DNA sequences (not random)
• the long-range correlation
• Compression
Methods
• Entropy
• Spectral Analysis
• Kolmogorov Complexity
Entropy

Clausius Entropy

Boltzmann Entropy

Shannon Entropy

Kolmogorov Entropy

Tsallis Entropy

-- Approximate Entropy

---Sample Entropy

---Multiscale Entropy

…………….

Entropy
• Jensen-Shannon distance == the difference between the entropy calculated from the whole system and the weighted sum of entropies calculated from the subsystems
• Jensen-Shannon distance D(i) for each possible partition point ialong the DNA sequence

3

1

2

4

near » 189,000

230,208 nucleotides

the bigger the difference of the two subsequences as partitioned at point i, and the more ideal to choose that point to partition the sequence

• the average value of D(i) of random sequence is at least 10 times lower than that for the yeast sequence.
• These ups and downs in D(i) for the random sequence are purely random fluctuations
Spectral Analysis
• Power spectrum -- > to represent the correlation structure in a sequence according to wavelength (or

frequency f =c/wavelength).

• The power at a given frequency, P(f), is the contribution from that frequency component to the total variance

of the fluctuation in the sequence.

• A random sequence lacks correlation at any length

scale, and its power spectra is flat

Kolmogorov Complexity for Analysis of DNA

The search for DNA regions with low complexity is

one of the pivotal tasks of modern structural analysis

of complete genomes.

The low complexity may be preconditioned by strong

inequality in nucleotide content (biased composition),

by tandem or dispersed repeats or by palindrome-

hairpin structures, as well as by a combination of all

these factors.

Four types of repeat differing by orientation and

localization in direct or complementary chains

are considered: direct, symmetric, inverted and

direct complementary.

Direct and inverted repeats as standard prototypes.

Symmetric (the repeated sequence is oppositely oriented on

the same DNA strand)

Direct complementary (a direct repeat on the complementary

DNA strand),

Nucleotide Sequence : the AP2 transcription factor

binding site, GTGCCCCGCGGGAACCCCGC.

Black and gray arrows mark the copied fragments and

their prototypes. A tandem repeat characterized by

partial overlapping of the prototype on the copied

fragment is marked by a dotted line. In this

decomposition, the first one-lettered components, G and

T, are produced by an operation generating a novel

symbol. The complexity of this 20-lettered

sequence = 10 [the number of components in H(S)].

Lempel-Ziv complexity

S, Q represents two string, respectively.

SQ=S+Q. SQP=SQ(deleted last letter)

V(SQP) is all subset of SQP

Now c(n)=1, assume S=s1s2….sr Q=Sr+1

If QϵV(SQP), S same, Q=Sr+1Sr+2

Until Q V(SQP), So Q=sr+1sr+2…sr+i is not the subset of s1s2..srsr+1sr+2..sr+i-1, c(n)+1

Update S= s1s2..srsr+1sr+2..sr+I and Q=sr+i+1

Until Q take the final letter

b(n) is complexity value of random sequence @ n infinite

b(n) = = nlog2n

Thus,

CLZN = c(n)/b(n)

the complexity of random ---- > 1

the complexity of order sequence ---- > 0

The smaller the complexity, the slower the speed of variation === > the change of data is regular, and has good periodic time.

The calculation of c(n) (Lempel-Ziv complexity)

Lempel-Ziv Complexity 1976

For sequence S=(10101010)

S=s1=1, Q=s2=0, SQ=10, SQP=1, Q V(SQP), Q insertion, SQ=1● 0

S= s1s2=10, Q=s3=1, SQ=101, SQP=10, Q ϵV(SQP), Q duplication, SQ=1● 0 ●1

S= s1s2=10, Q=s3 s4=10, SQ=1010, SQP=101, QϵV(SQP), Q duplication, SQ=1● 0●10

• Repeated 2) and 3), Q duplication , S=1●0●101010 , c(n)=3
• b(8)=8log28=24. So normalized complexly: CLNZ =c(8)/b(8)=3/24=0.125
• Thus, results show that the sequence is low because this sequence is periodical one.
Other estimates of text complexity

The evaluation of complexity in a text region CWF by Wootton

and Federhen (7) is given by the formula

Linguistic complexity can also be defined as the ratio of

the sum of numbers of words occurring in a sequence

analyzed to the maximum possible number of such

words (12):

Implementation and Results

Calculation mode in a sliding window

• a single extended sequence
• a group of relatively short sequences up to 1 kb in length. A table of complexity values is constructed for a window, of ordered size N, Sliding along the sequence. The sequence complexity is assigned to the window center. The calculation mode in a sliding window (complexity profile) is demonstrated here using the example of the Borreliaburgdorferigenome. In Figure 2, complexity profiles for a window sliding along the sequence are illustrated.

Quantum Kolmogorov Complexity

Are quantum computers more powerful than

classical computers?

Quantum Entanglement

Quantum Factorization of Integers

Quantum computers can solve some problems faster than classical computers (→ Shor’s factoring algorithm).

8678234894390455325820387658927648846728276488478357578857990101745939579360286782348943904553258203876589276488467282764884783575788579901017459395793602

38757578689764649202092923747567520379898000084773622344552626377837477477476

46575868799898899995311906422876539300576869504869503847565674385565746488765

89005088573342257947602867756958696986758959511122344756900768768779957500472

66789953304578677748765778369119087568204639293000027264558393685793948745688

47637479496316118939005409586870347636374856969975765356445784995969976655610

98443348899046881020480568572231018209586704589944806808908069887677575969061

234894390498988999953119064799010174593957936

>> 2^(129/2) = 2.6088e+019

>> 2^125 = 4.2535e+037

>> 2^200 = 1.6069e+060

Prime factorization of large number

Principle of Current Cryptography

In 1994, 1600 workstation with super speed obtained prime

factors of L=129 in about 8 months. If L=250, 800,000 years

Factorization of Integers

The number N---- approximate length L bits ---(0 ~ 2L-1)

The number N has a factor in the range (1, )

Try each number in this range to find a factor of N---At least steps

S~ =2L/2

But, for Shor’s Quantum Computation

S ~ poly(log(N))

The Factoring Firestorm

18819881292060796383869723946165043

98071635633794173827007633564229888

59715234665485319060606504743045317

38801130339671619969232120573403187

9550656996221305168759307650257059

Peter

Shor

1994

3980750864240649373971

2550055038649119906436

2342526708406385189575

946388957261768583317

4727721461074353025362

2307197304822463291469

5302097116459852171130

520711256363590397527

Best classical algorithm

takes time

Shor’s quantum algorithm

takes time

An efficient algorithm for factoring breaks the

RSA public key cryptosystem

Qubit
• Pure state of a qubit
• Basis
• Superposition of states and

Qubit:

• The element of carrying information------- The quantum state
• |0>, |1> and any linear combination (superposition) c1|0>+c2|1>

Definition (Qubit Strings)

A qubit string σ is a state vector or density

operator

Quantum Parallelism

A quantum computer can perform 2n operations at the same time due to superposition :

However we get only one answer when we measure the result:

F[000] F[001] F[010] . . F[111]

The Discrete Fourier Transform
• Assume Lqubits hold any number x, from 0 to 2L-1
• Any number x can be expressed as the state
• |x> = |xL-1xL-2 …x1 x0 >= |xL-1 > |xL-2 > ….|x1 > |x0 >
• Where x= and a tensor product
• Ajacts only on the qubit representedby j-th atom
• The operator |ij><kj|on the state |nj>
• |ij><kj||nj> = |ij>
• Aj|0j> = (|0j>+|1j>) Aj|0j> = (|0j>-|1j>)
• Bj|0jk> = |0jk> Bj|1jk> = |1jk> Bj|2jk> = |2jk>
• Bj|3jk> =exp( )|3jk>

A0B01B02A1B12A0|x> = {(|0>+|4>)-(|2>+|6>)+i(|1>+|5>)-

i(|3>+|7>) =

|x> == >

A0B01B02A1B12A0 perform a discrete Fourier transform!

Shor’salgothrim

[1] Quantum Fourier Transform

| > = =

== >

Finding the period of a periodic function

[2] = , then find Greatest Common Divisor

and

Quantum computers U :

input qubit string σ → output qubit string U(σ)

• Definition (Quantum Kolmogorov Complexity)

Let U be a universal quantum computer and δ > 0. Then, for every qubit string ρ, define

QCδ(ρ) = min{ℓ(σ) | ||ρ − U(σ)||Tr ≤ δ}

• the difference between two qubit strings, it is natural to use the trace distance which is defined as ||ρ − U(σ)||Tr:= (1/2)Tr|ρ − σ|
References:
• Y. L. Orlov and V. N. Potapov, Complexity: an internet resource for analysis of DNA sequence complexity, Nucleic Acids Research, 2004, Vol. 32
• Fabio Benatti, TyllKrüger,Markus Müller, Rainer Siegmund-Schultze, Arleta Szkoła. Entropy and Quantum Kolmogorov Complexity: A Quantum Brudno’sTheorem, Communications in Mathematical Physics 265, 437–461 (2006)