Encoding Information for DNA computing

1 / 45

# Encoding Information for DNA computing - PowerPoint PPT Presentation

Encoding Information for DNA computing. Shinnosuke Seki. Purpose. What’s an advantage of encoding? To make a “ good ” or tractable code set for DNA computing. Development of polynomial-time algorithms which decide whether a given code set is “good” or “bad”. Claude Elwood Shannon.

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.

## PowerPoint Slideshow about ' Encoding Information for DNA computing' - kelly-conner

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

### Encoding Information for DNA computing

Shinnosuke Seki

Purpose
• What’s an advantage of encoding?
• To make a “good” or tractable code set for DNA computing.
• Development of polynomial-time algorithms which decide whether a given code set is “good” or “bad”.
Claude Elwood Shannon
• The father of information theory (Shannon’s entropy)
• Boolean algebra with binary arithmetic makes it possible to simplify electromechanical relays
• In “A mathematical theory of communication” [Shannon48],he showed that we can send error-free information even on noisy channel.
• Chess program using minimax evaluation procedure
• etc. …
Shannon’s information channel

Positive Noise

• R > C overflow
• R ≤ C We can make the error rate as small as possible.
• To attain R = C in the noisy channel, we need to find a ‘good’ code.

capacity C

sender

encoder

decoder

Information flow R

Negative Noise

Biological perspective
• A biological reaction can be described in terms of information channel model.
• example The case of heredity
• For billions of years, Mother Nature has developed wonderful code system?
• Biology -> Computer Science

Natural Selection

heredity

parent

DNA

DNA

child

Mutation

Review:in vitro DNA computing
• Encode a given problem into single or double-stranded DNAs (ssDNAs, dsDNAs)
• Computation by a succession of bio-operations.
• Decode the resulting solution and extract its output.

A

T

C

G

5’ - A T C G G T C A A C T G C C C T A A T G  3’

3’  T A G C C A G T T G A C G G G A T T A C - 5’

Review: WK-complementarity
• Hydrogen bonds
• Two strands which are
• complementary to each other
• with opposite directions

can form a (complete) dsDNA.

• Example
• Find a solution of Hamiltonian path problem in a solution in polynomial time order of the input graph.
• The solution is filled with encoding oligonucleotides.

1

3

1

2

3

4

ACG CTT

ATA GAT

CGG TTA

ACT TAA

GAA TAT

CTA GCC

AAT TGA

1 -> 2

2 -> 3

3 -> 4

2

4

What’s a good code set?
• Each code word (oligonucleotide) shouldn’t form any undesirable structure.
• This may make itself inert.
• Code words don’t interact with each other in an undesirable way.
• Structure formation is due to
• WK-complementarity
• Gibbs free energy

A

A T

2

ATA GAT

T A

G

What’s a good code set? (cont.)
• Uniform melting temperature
• Preventing undesirable hybridizations
• Other constraints
• Avoiding repeated bases
• Forbidden subsequences
• Using a restriction enzyme, its corresponding recognition site should appear only in intended sites
• Using only 3 types of nucleotides A, C, T
Melting temperature
• Melting temperature Tm of a dsDNA is
• the temperature at which half of the dsDNAs is denatured.
• The higher Tm is, the more stable the dsDNA is.
• R: gas constant,
• Ct: total oligo concentration,
• ΔH & ΔS : enthalpy & entropy
• α: 1 for self-complementary and 4 for non-self
Melting temperature (cont.)
• Uniform melting temperature
• To uniform Tm can eliminate a bias of hybridization.
• GC content
• The ratio of the # of G’s and C’s over the total # of nucleotides in a sequence
• G-C pair is more stable than A-T pair.
• Higher GC content implies higher Tm.
• Sequences are designed with 50% GC content.
Gibbs free energy (ΔG)
• A well-known indicator of stability for DNA structures
• A structure with lower ΔG is more stable.
• The ΔG of entire structure is the sum of ΔG of each substructures [ZuSt81].
Nearest-neighborhood method

Refer to [AlSa97], [TKY04] ([8], [9] in this table)

Template method[ArKo02]
• Prepare 2 bit sequences, each of which has some desirable property
• (e.g., 50%-GC content, error-correction).
• Using convert rule, from these 2 sequences, we construct a sequence.
Template method (cont.)
• Design criteria
• Template
• An element x should have at least d-mismatches with xR, xx, xR xR, xxR, xRx.
• An exhaustive search to find a good template
• Map (error-correcting code)
• A code whose words have at least k-mismatches.
• e.g. BCH code
• Drawback
• It cannot prevent sequences from forming secondary structures.
GC-template

Template contains the same # of 0’s and 1’s (50% GC-content)

Map is an error correcting code.

AG-template

Map is constant weight codes (50% GC-content)

Results in the bigger set of sequences

AG-templates, GC-templates[KKA03]
Other approaches
• DNASequenceGenerator[FBR00]
• A software with GUI
• Create a sequence with melting temperature, GC-content, no palindromes, start codons, nor restriction sites.
Other approaches
• Suyama’s approach[YoSu00]
• To generate sequences randomly, add it into a sequence set iff it satisfies all of the following constraints:
• Uniform melting temperature
• No mis-hybridization
• No formation of stable secondary structure
• Drawback is to fall into local optima easily.
Other approaches
• Hybrid randomized neighborhoods[TuHo03]
• Stochastic local search (SLS) algorithm
• Searches neighbors by mutating current best sequences randomly with a probability ε.
• It moves to the direction where the # of constraint conflicts is maximally decreased with a probability 1-ε.
Other approaches
• GA (genetic algorithm)-based approach[ANH00]
• Use GAs to evaluate fitness of solutions
• As criteria
• Restriction sites
• GC-content
• Hamming distance
• Same base repetition
Other approaches
• Gibbs free energy base approach [TKY05], [KNO08]
• Taking thermodynamics into consideration
• Gibbs free energy as a stability measure
• Greater accuracy because it takes into account stability of loops or stacking between base-pairs
• More computational time to calculate free energy
• How to decrease this computational complexity?
A formal language approach
• Design a set of structure-free codes in terms of WK-complementary.
• More reliable codes than Free-energy approach
• More efficient algorithm for decision problems
• Need to consider each structure separately.

TCATCCGATTTCGGG

AGTAGGCTAAAGCCC

A formal language approach (cont.)
• Abstraction of biological concepts
• {A, C, G, T} → an alphabet V,
• WK-complementarity → an antimorphic involution
• Involution
• A mapping θ s.t. θ2 is identity (symmetry).
• Antimorphism
• θ(xy) = θ(y)θ(x) (opposite direction).
• e.g. (TCATCCGATTTCGGG) = CCCGAAATCGGATGA
Bond-free properties[KKS05]
• θ-non-overlapping:
• θ-compliant:
• Strictly (a) : a property (a) with θ-non-overlapping
Bond-free properties[KKS05]
• θ-p-compliant:
• θ-s-compliant:
Bond-free properties[KKS05]
• θ-free:
• θ-sticky-free:
Bond-free properties[KKS05]
• θ-3’-overhang-free:
• θ-5’-overhang-free:
• θ-overhang-free: both of these
Decidability [KKS05]
• Theorem
• the following problem is decidable in quadratic time w.r.t. |A|
• Input: an NFA A,
• Output: Yes/No depending on whether L(A) satisfies any of the following properties (or their strictly versions):
• θ-compliant, θ-p-compliant, θ-s-compliant,
• θ-sticky-free,
• θ-3’-overhang-free, θ-5’-overhang-free, θ-overhang-free.
Decidability and maximality[KKS05]
• Theorem
• Let M be a regular language and L be a regular subset of M with a property ρ:
• ρ is one of the followings:
• θ-compliant,
• θ-p-compliant,
• θ-s-compliant, or
• θ-sticky-free
• Then it is decidable whether L is a maximal subset of M satisfying ρ.
Secondary structure prevention
• Secondary structures:
• Hairpin-loop (or simply hairpin)
• Internal loop
• Multiple-branch loop
• Pseudoknot
• They can be undesirable
• e.g. for Adleman’s encoding technique for Hamiltonian Path Problem (HPP).

Hairpin

Hairpin frame (multiple loop)

5’

3’

5’

Internal loop

3’

5’

A C G T

3’

3’

5’

G C C

Secondary Structures

TAA---ACG---CGTTA---CGT---CGGT

Hairpin-free language
• A formal model of hairpin: x v y θ(v) z.
• Hairpin freeness
• Intuitively it’s almost impossible to prevent hairpins of short stack length (say 2 or 3).
• Our desire is to prevent any hairpin of stack length no less than some given parameter k.

x v y θ(v) z

Hairpin-free language [KKL06]
• A word w is (θ, k)-hairpin-free (abbr. hp(θ, k)-free) iff
• hpf(θ, k) : the set of all hp(θ, k)-free words on Σ*
• hp(θ, k) : Σ* - hpf(θ, k).
• A language L is called (θ, k)-hairpin-free iff

X

X

X

w

θ(w)

Regularity of hairpin languages
• hp(θ, k) and hpf(θ, k) are regular.
• For a hp(θ, k)-free language L, there exists a finite automaton M s.t. L = L(M).
Hairpin Freeness Problems
• Hairpin-Freeness problem
• Maximal Hairpin-Freeness problem

Input: A nondeterministic automaton M,

Output: Y/N depending on whether L(M) is hp(θ, k)-free.

Input: A deterministic automaton M1, and NFA M2.

Output: Y/N depending on whether there is a word

s.t. is hp(θ, k)-free.

Decidability
• The hairpin-freeness problem for regular languages is decidable in time.
• The maximal hairpin-freeness problem for regular languages is decidable in time.
Hairpin Frames
• So-called Multiple loop
• hp-frame of degree n:
• The right figure is an example of hp-frame of degree 3.
• A word u is hp(fr, j)-word if it contains a hp-frame of degree j.
Regularity & decidability
• hp(θ, fr, j) : the set of all hp(fr, j)-words on Σ*
• hpf(θ, fr, j) : its complement in Σ*
• The languages hp(θ, fr, j) & hpf(θ, fr, j) are regular.
• The hp(fr, j)-freeness problem is decidable in linear time.
• The maximal hp(fr, j)-freeness problem is decidable in time.
Application : DNA-HRAMs

C

G

• n-bit DNA-HRAM consists of n hairpins.
• Each hairpin stores 1-bit information by forming and deforming a hairpin as shown above.

A

T

G

C

opening

T

A

--A-C-T-G-T-C-G-A-C-A-G-T--

C

G

A

T

closing

0

1

n-bit DNA-HRAM
• Concatenation of n 1-bit RAM, which is equivalent to hp-frame of degree n.
• In order for this word to work as n-bit RAM, the following subword should be hpf(θ, 20)-free.
• DNA memory with 4 hairpins was proposed in [KYO08].
Reference
• [AlSa97] Allawi, HT., SantaLucia, J.: Thermodynamics and NMR of internal G T mismatches in DNA. Biochemistry 36(34) (1997) 10581-10594
• [ArKo02] Arita, M., Kobayashi, S.: DNA sequence design using templates. New Generation Computing 20 (2002) 263-277
• [ANH00] Arita, M., Nishikawa, A., Hagiya, M., Komiya, K., Gouzu, H., Sakamoto, K.: Improving sequence design for dna computing. Proc. Genetic and Evolutionary Computation Conference (2000) 875-882.
• [FBR00] Feldkamp, U., Saghafi, S., Rauhe, H.: A DNA sequence compiler. Proc. DNA6, (2000)
• [KKS05] Kari, L., Konstantinidis, S., Sosik, P.: Preventing undesirable bonds between DNA codewords. Prof. DNA10, LNCS 3384 (2005) 182-191.
• [KKL06] Kari, L., Konstantinidis, S., Losseva, E., Sosik, P., Thierrin, G.: A formal language analysis of DNA hairpin structures. Fundamenta Informaticae 71 (2006) 453-475
• [KKA03] Kobayashi, S., Kondo, T., Arita, M.: On template method for DNA sequence design. Proc. DNA8, LNCS 2568 (2003) 205-214
Reference (cont.)
• [KNO08] Kawashimo, S., Ng, Y-K., Ono, H., Sadakane, K., Yamashita, M.: Speeding up local-search type algorithms for designing dna sequences under thermodynamical constraints. Proc. DNA14 (2008) 152-161
• [KYO08] Kameda, A., Yamamoto, M., Ohuchi, A., Yaegashi, S., Hagiya, M.: Unravel four hairpins! Natural Computing 7 (2008) 287-298
• [RFL01] Ruben, A. J., Freeland, S. J., Landweber, L. F.: PUNCH: An evolutionary algorithm for optimizing bit set selection. DNA7 (2001) 150-160
• [Shannon48] Shannon, C.E.: A mathematical theory of communication. Bell System Technical Journal 27 (1948) 379-423, 623-656
• [TKY04] Tanaka, F., Kameda, A., Yamamoto, M., Ohuchi, A.: Thermodynamic parameters based on a nearest-neighbor model for DNA sequences with a single-bulge loop. Biochemistry 43(22) (2004) 7143-7150
• [TKY05] Tanaka, F., Kameda, A., Yamamoto, M., Ohuchi, A.: Design of nucleic acid sequences for DNA computing based on a thermodynamic approach. Nucleic Acids Res. 33(3) (2005) 903-911
Reference (cont.)
• [TuHo03] Tulpan, D., Hoos, H.: Hybrid randomised neighbourhoods improve stochastic local search for dna code design. In Advances in Artificial Intelligence: 16th Conference of the Canadian Society for Computational Studies of Intelligence, 2671 (2003) 418-433
• [YoSu00] Yoshida, H., Suyama, A.: Solution to 3-sat by breadth first search. Proc. the 5th DIMACS Workshop on DNA Based Computers, 54 (2000) 9-22
• [ZuSt81] Zuker, M., Stiegler, P.: Optimal computer folding of large RNA sequences using thermodynamics and auxiliary information. Nucleic Acids Res. 9(1) (1981) 133-148