Encoding Information for DNA computing

Encoding Information for DNA computing Shinnosuke Seki

Purpose • What’s an advantage of encoding? • To make a “good” or tractable code set for DNA computing. • Development of polynomial-time algorithms which decide whether a given code set is “good” or “bad”.

Claude Elwood Shannon • The father of information theory (Shannon’s entropy) • Boolean algebra with binary arithmetic makes it possible to simplify electromechanical relays • In “A mathematical theory of communication” [Shannon48],he showed that we can send error-free information even on noisy channel. • Chess program using minimax evaluation procedure • etc. …

Shannon’s information channel Positive Noise • R > C overflow • R ≤ C We can make the error rate as small as possible. • To attain R = C in the noisy channel, we need to find a ‘good’ code. capacity C sender encoder decoder receiver Information flow R Negative Noise

Biological perspective • A biological reaction can be described in terms of information channel model. • example The case of heredity • For billions of years, Mother Nature has developed wonderful code system? • Biology -> Computer Science Natural Selection heredity parent DNA DNA child Mutation

Review:in vitro DNA computing • Encode a given problem into single or double-stranded DNAs (ssDNAs, dsDNAs) • Computation by a succession of bio-operations. • Decode the resulting solution and extract its output.

A T C G 5’ - A T C G G T C A A C T G C C C T A A T G  3’ 3’  T A G C C A G T T G A C G G G A T T A C - 5’ Review: WK-complementarity • Hydrogen bonds • Two strands which are • complementary to each other • with opposite directions can form a (complete) dsDNA. • Example

Adleman’s first trial • Find a solution of Hamiltonian path problem in a solution in polynomial time order of the input graph. • The solution is filled with encoding oligonucleotides. 1 3 1 2 3 4 ACG CTT ATA GAT CGG TTA ACT TAA GAA TAT CTA GCC AAT TGA 1 -> 2 2 -> 3 3 -> 4 2 4

What’s a good code set? • Each code word (oligonucleotide) shouldn’t form any undesirable structure. • This may make itself inert. • Code words don’t interact with each other in an undesirable way. • Structure formation is due to • WK-complementarity • Gibbs free energy A A T 2 ATA GAT T A G

What’s a good code set? (cont.) • Uniform melting temperature • Preventing undesirable hybridizations • Other constraints • Avoiding repeated bases • Forbidden subsequences • Using a restriction enzyme, its corresponding recognition site should appear only in intended sites • Using only 3 types of nucleotides A, C, T

Melting temperature • Melting temperature Tm of a dsDNA is • the temperature at which half of the dsDNAs is denatured. • The higher Tm is, the more stable the dsDNA is. • R: gas constant, • Ct: total oligo concentration, • ΔH & ΔS : enthalpy & entropy • α: 1 for self-complementary and 4 for non-self

Melting temperature (cont.) • Uniform melting temperature • To uniform Tm can eliminate a bias of hybridization. • GC content • The ratio of the # of G’s and C’s over the total # of nucleotides in a sequence • G-C pair is more stable than A-T pair. • Higher GC content implies higher Tm. • Sequences are designed with 50% GC content.

Gibbs free energy (ΔG) • A well-known indicator of stability for DNA structures • A structure with lower ΔG is more stable. • The ΔG of entire structure is the sum of ΔG of each substructures [ZuSt81].

Nearest-neighborhood method Refer to [AlSa97], [TKY04] ([8], [9] in this table)

Secondary structures look like…

Template method[ArKo02] • Prepare 2 bit sequences, each of which has some desirable property • (e.g., 50%-GC content, error-correction). • Using convert rule, from these 2 sequences, we construct a sequence.

Template method (cont.) • Design criteria • Template • An element x should have at least d-mismatches with xR, xx, xR xR, xxR, xRx. • An exhaustive search to find a good template • Map (error-correcting code) • A code whose words have at least k-mismatches. • e.g. BCH code • Drawback • It cannot prevent sequences from forming secondary structures.

GC-template Template contains the same # of 0’s and 1’s (50% GC-content) Map is an error correcting code. AG-template Map is constant weight codes (50% GC-content) Results in the bigger set of sequences AG-templates, GC-templates[KKA03]

Other approaches • DNASequenceGenerator[FBR00] • A software with GUI • Create a sequence with melting temperature, GC-content, no palindromes, start codons, nor restriction sites.

Other approaches • Suyama’s approach[YoSu00] • To generate sequences randomly, add it into a sequence set iff it satisfies all of the following constraints: • Uniform melting temperature • No mis-hybridization • No formation of stable secondary structure • Drawback is to fall into local optima easily.

Other approaches • Hybrid randomized neighborhoods[TuHo03] • Stochastic local search (SLS) algorithm • Searches neighbors by mutating current best sequences randomly with a probability ε. • It moves to the direction where the # of constraint conflicts is maximally decreased with a probability 1-ε.

Other approaches • GA (genetic algorithm)-based approach[ANH00] • Use GAs to evaluate fitness of solutions • As criteria • Restriction sites • GC-content • Hamming distance • Same base repetition

Other approaches • Gibbs free energy base approach [TKY05], [KNO08] • Taking thermodynamics into consideration • Gibbs free energy as a stability measure • Advantage • Greater accuracy because it takes into account stability of loops or stacking between base-pairs • Disadvantage • More computational time to calculate free energy • How to decrease this computational complexity?

A formal language approach • Design a set of structure-free codes in terms of WK-complementary. • Advantage • More reliable codes than Free-energy approach • More efficient algorithm for decision problems • Disadvantage • Need to consider each structure separately.

TCATCCGATTTCGGG AGTAGGCTAAAGCCC A formal language approach (cont.) • Abstraction of biological concepts • {A, C, G, T} → an alphabet V, • WK-complementarity → an antimorphic involution • Involution • A mapping θ s.t. θ2 is identity (symmetry). • Antimorphism • θ(xy) = θ(y)θ(x) (opposite direction). • e.g. (TCATCCGATTTCGGG) = CCCGAAATCGGATGA

Bond-free properties[KKS05] • θ-non-overlapping: • θ-compliant: • Strictly (a) : a property (a) with θ-non-overlapping

Bond-free properties[KKS05] • θ-p-compliant: • θ-s-compliant:

Bond-free properties[KKS05] • θ-free: • θ-sticky-free:

Bond-free properties[KKS05] • θ-3’-overhang-free: • θ-5’-overhang-free: • θ-overhang-free: both of these

Decidability [KKS05] • Theorem • the following problem is decidable in quadratic time w.r.t. |A| • Input: an NFA A, • Output: Yes/No depending on whether L(A) satisfies any of the following properties (or their strictly versions): • θ-compliant, θ-p-compliant, θ-s-compliant, • θ-sticky-free, • θ-3’-overhang-free, θ-5’-overhang-free, θ-overhang-free.

Decidability and maximality[KKS05] • Theorem • Let M be a regular language and L be a regular subset of M with a property ρ: • ρ is one of the followings: • θ-compliant, • θ-p-compliant, • θ-s-compliant, or • θ-sticky-free • Then it is decidable whether L is a maximal subset of M satisfying ρ.

Secondary structure prevention • Secondary structures: • Hairpin-loop (or simply hairpin) • Internal loop • Multiple-branch loop • Pseudoknot • They can be undesirable • e.g. for Adleman’s encoding technique for Hamiltonian Path Problem (HPP).

Hairpin Hairpin frame (multiple loop) 5’ 3’ 5’ Internal loop 3’ 5’ A C G T 3’ 3’ 5’ G C C Secondary Structures

TAA---ACG---CGTTA---CGT---CGGT Hairpin-free language • A formal model of hairpin: x v y θ(v) z. • Hairpin freeness • Intuitively it’s almost impossible to prevent hairpins of short stack length (say 2 or 3). • Our desire is to prevent any hairpin of stack length no less than some given parameter k. x v y θ(v) z

Hairpin-free language [KKL06] • A word w is (θ, k)-hairpin-free (abbr. hp(θ, k)-free) iff • hpf(θ, k) : the set of all hp(θ, k)-free words on Σ* • hp(θ, k) : Σ* - hpf(θ, k). • A language L is called (θ, k)-hairpin-free iff

X X X w θ(w) Regularity of hairpin languages • hp(θ, k) and hpf(θ, k) are regular. • For a hp(θ, k)-free language L, there exists a finite automaton M s.t. L = L(M).

Hairpin Freeness Problems • Hairpin-Freeness problem • Maximal Hairpin-Freeness problem Input: A nondeterministic automaton M, Output: Y/N depending on whether L(M) is hp(θ, k)-free. Input: A deterministic automaton M1, and NFA M2. Output: Y/N depending on whether there is a word s.t. is hp(θ, k)-free.

Decidability • The hairpin-freeness problem for regular languages is decidable in time. • The maximal hairpin-freeness problem for regular languages is decidable in time.

Hairpin Frames • So-called Multiple loop • hp-frame of degree n: • The right figure is an example of hp-frame of degree 3. • A word u is hp(fr, j)-word if it contains a hp-frame of degree j.

Regularity & decidability • hp(θ, fr, j) : the set of all hp(fr, j)-words on Σ* • hpf(θ, fr, j) : its complement in Σ* • The languages hp(θ, fr, j) & hpf(θ, fr, j) are regular. • The hp(fr, j)-freeness problem is decidable in linear time. • The maximal hp(fr, j)-freeness problem is decidable in time.

Application : DNA-HRAMs C G • n-bit DNA-HRAM consists of n hairpins. • Each hairpin stores 1-bit information by forming and deforming a hairpin as shown above. A T G C opening T A --A-C-T-G-T-C-G-A-C-A-G-T-- C G A T closing 0 1

n-bit DNA-HRAM • Concatenation of n 1-bit RAM, which is equivalent to hp-frame of degree n. • In order for this word to work as n-bit RAM, the following subword should be hpf(θ, 20)-free. • DNA memory with 4 hairpins was proposed in [KYO08].

Reference • [AlSa97] Allawi, HT., SantaLucia, J.: Thermodynamics and NMR of internal G T mismatches in DNA. Biochemistry 36(34) (1997) 10581-10594 • [ArKo02] Arita, M., Kobayashi, S.: DNA sequence design using templates. New Generation Computing 20 (2002) 263-277 • [ANH00] Arita, M., Nishikawa, A., Hagiya, M., Komiya, K., Gouzu, H., Sakamoto, K.: Improving sequence design for dna computing. Proc. Genetic and Evolutionary Computation Conference (2000) 875-882. • [FBR00] Feldkamp, U., Saghafi, S., Rauhe, H.: A DNA sequence compiler. Proc. DNA6, (2000) • [KKS05] Kari, L., Konstantinidis, S., Sosik, P.: Preventing undesirable bonds between DNA codewords. Prof. DNA10, LNCS 3384 (2005) 182-191. • [KKL06] Kari, L., Konstantinidis, S., Losseva, E., Sosik, P., Thierrin, G.: A formal language analysis of DNA hairpin structures. Fundamenta Informaticae 71 (2006) 453-475 • [KKA03] Kobayashi, S., Kondo, T., Arita, M.: On template method for DNA sequence design. Proc. DNA8, LNCS 2568 (2003) 205-214

Reference (cont.) • [KNO08] Kawashimo, S., Ng, Y-K., Ono, H., Sadakane, K., Yamashita, M.: Speeding up local-search type algorithms for designing dna sequences under thermodynamical constraints. Proc. DNA14 (2008) 152-161 • [KYO08] Kameda, A., Yamamoto, M., Ohuchi, A., Yaegashi, S., Hagiya, M.: Unravel four hairpins! Natural Computing 7 (2008) 287-298 • [RFL01] Ruben, A. J., Freeland, S. J., Landweber, L. F.: PUNCH: An evolutionary algorithm for optimizing bit set selection. DNA7 (2001) 150-160 • [Shannon48] Shannon, C.E.: A mathematical theory of communication. Bell System Technical Journal 27 (1948) 379-423, 623-656 • [TKY04] Tanaka, F., Kameda, A., Yamamoto, M., Ohuchi, A.: Thermodynamic parameters based on a nearest-neighbor model for DNA sequences with a single-bulge loop. Biochemistry 43(22) (2004) 7143-7150 • [TKY05] Tanaka, F., Kameda, A., Yamamoto, M., Ohuchi, A.: Design of nucleic acid sequences for DNA computing based on a thermodynamic approach. Nucleic Acids Res. 33(3) (2005) 903-911

Reference (cont.) • [TuHo03] Tulpan, D., Hoos, H.: Hybrid randomised neighbourhoods improve stochastic local search for dna code design. In Advances in Artificial Intelligence: 16th Conference of the Canadian Society for Computational Studies of Intelligence, 2671 (2003) 418-433 • [YoSu00] Yoshida, H., Suyama, A.: Solution to 3-sat by breadth first search. Proc. the 5th DIMACS Workshop on DNA Based Computers, 54 (2000) 9-22 • [ZuSt81] Zuker, M., Stiegler, P.: Optimal computer folding of large RNA sequences using thermodynamics and auxiliary information. Nucleic Acids Res. 9(1) (1981) 133-148

Encoding Information for DNA computing

Encoding Information for DNA computing

Presentation Transcript

DNA Computing Tutorial

DNA Computing

DNA Computing

DNA Computing

DNA computing

DNA Computing

Quantum vs. DNA Computing

Information Encoding for Impaired Optical Path Validation

DNA Computing

Computing with DNA

Information Processing: Encoding

Machine Learning Framework for DNA Computing

DNA Computing Zhe Wang

DNA Computing

Applications Of DNA computing

DNA Computing

DNA COMPUTING

Encoding  Storage  Retrieval Encoding: Putting information into memory

Encoding information

DNA Computing Tutorial

Biological Computing – DNA solution

Real Value Representation for DNA Computing