Vyacheslav V. Rykov

DNA CODES GENERATION USING AN IMPROVED METRIC Vyacheslav V. Rykov

Outline • DNA Hybridization/Cross Hybridization • DNA Codes • Nearest Neighbor Thermodynamics • complete computations • bounds • Overview of Applications and Purposes • DNA Bitstring Library • Biomolecular Computing

DNA Hybridization • DNA strands are modeled by directed 3’--> 5’ sequences of letters from the alphabet {A, C, G, T} • (A, T) and (C, G) are complementarypairs. • Two oppositely directed DNA sequences are capable of coalescing into a duplex. • Because an A (C) in one strand can (usually) only bind to a T (G) in the oppositely directed strand, the greatest energy of duplex formation is obtained when the two sequences are reverse-complements (complements)

Orientation of single DNA strands is important for hybridization.

A DNA Code Coding Strands for Ligation Probing Complement Strands for Reading TACGCGACTTTC GAAAGTCGCGTA ATCAAACGATGC GCATCGTTTGAT TGTGTGCTCGTC GACGAGCACACA ATTTTTGCGTTA, TAACGCAAAAAT CACTAAATACAA TTGTATTTAGTG GAAAAAGAAGAA, TTCTTCTTTTTC 5’ 3’ 5’ 3’ Watson Crick (WC) Duplexes 5’TACGCGACTTTC3’ ATCAAACGATGC Must Have 5’GAAAGTCGCGTA3’ GCATCGTTTGAT TACGCGACTTTC Cross Hybridized (CH) Duplexes ATTTTTGCGTTA Must Avoid GCATCGTTTGAT GAAAAAGAAGAA

x=5’ggCaCaTcatAct3’ 5’ggCaCaTcatAct3’ 5’ AggTTaaCcatct3’ y=5’agatgGttAAccT3’ 5’ggCaCaTcatAct3’ 3’ TccAAttGgtaga5’ 5’agatgGttAAccT3’ =y

DNA codes serve as universal components for biomolecular computing. DNA codes are closed under reverse-complementation. The strands in a DNA code have such binding specificity that a code strand will only hybridize with its reverse-complement and will not cross hybridize with any other code strand in the DNA code Such collections of strands are crucial to the success biomolecular computing and biomolecular nanotechnology. Basic idea is to have correct, parallel and autonomous addressing

Characterization of synthetic DNA bar codes in Saccharomyces cervisiae gene-deletion strains” (Eason et al., PNAS). DNA codes for self-assembly of any components that can be attached to DNA. Their size presents the potential for increased complexity and location control in nanostructures produced by assembly that is driven by DNA duplex formation. Fundamentalphysical limits and increasing costs of fabrication facilities will force alternatives to conventional microelectronics manufacturing to be developed. In self-assembly, weak, local interactions among molecular components spontaneously organize those components into aggregates with properties that range from simple to complex DNA memory:The capacity and storage density of such memories is potentially very large. Information could be mined through massively parallel template-matching reactions. In addition, information could be processed based upon context, and information matched associatively based upon content.

DNA Computing Interest into DNA computing was sparked in 1994 by Len Adleman. Adleman showed how we can use DNA molecules to solve a mathematical problem. (Hamiltonian path problem). DNA computing relies on the fact that DNA strands can be represented as sequences of bases (4-ary sequences) and the property of hybridization. In Hybridization, errors can occur. Thus, error-correcting codes are required for efficient synthesis of DNA strands to be used in computing.

DNA Computing Strand Engineering No codeword-codewode CH (cc-CH) No codeword-probe CH (cp-CH) No probe-probe CH (pp-CH) A A A A A A A A C C=T1 G G T T T T T T T T =BEAD PROBE (T1) T T T C C A A A A A =F1 T T T T T G G A A A = BEAD PROBE (F1) T T T C T T A A C C=T2 G G T T A A G A A A= BEAD PROBE (T2) A C T A A C A A A A=F2 T T T T G T T A G T= BEAD PROBE (F2) C A T A A A A C A C=T3 G T G T T T T A T G= BEAD PROBE (T3) A T C T T T T C A A=F3 T T G A A A A G A T= BEAD PROBE (F3) C A A T C C A T T A=T4 T A A T G G A T T G= BEAD PROBE (T4) C C T T C T A A A T=F4 A T T T A G A A G G= BEAD PROBE (F4) A C T C C T A A T A=T5 T A T T A G G A G T= BEAD PROBE (T5) T C T C T C T A C T=F1 A G T A G A G A G A= BEAD PROBE (F5) Only Allowed Hybridizations T T T T T T G G T T G G=Probe(T1) G G T G G T T T T T T T=Probe(F1) T G G A A G G A A A A A=Probe(T2) G G T T T G A G G T A A =Probe(F2) G G A G T T G T G A A A=Probe(T3) C C A A C C A A A A A A = T1 A A A A A A A C C A C C=F1 T T T T T C C T T C C A =T2 T T A C C T C A A A C C =F2 T T T C A C A A C T C C=T3 No cp-CH T T G T G G A T T G A A=Probe(F3) T T G A G A G A G T G A=Probe(T4) A G A G G A G A A A G A=Probe(F4) G A T G G T G A G A T G=Probe(T5) G T G T G T A G T G T T=Probe(F5) T T C A A T C C A C A A =F3 T C A C T C T C T C A A =T4 T C T T T C T C C T C T=F4 C A T C T C A C C A T C =T5 A A C A C T A C A C A C =F5 No cc-CH No pp-CH

DNA Computing Strand Engineering No codeword cp-CH T T C A A T C C A C A A =F3 T T G T G G A T T G A A=Probe(F3) T C A C T C T C T C A A =T4 T T G A G A G A G T G A=Probe(T4) T C T T T C T C C T C T=F4 A G A G G A G A A A G A=Probe(F4) C A T C T C A C C A T C =T5 G A T G G T G A G A T G=Probe(T5) A A C A C T A C A C A C =F5 G T G T G T A G T G T T=Probe(F5) C C A A C C A A A A A A = T1 T T T T T T G G T T G G=Probe(T1) A A A A A A A C C A C C=F1 G G T G G T T T T T T T=Probe(F1) T T T T T C C T T C C A =T2 T G G A A G G A A A A A=Probe(T2) T T A C C T C A A A C C =F2 G G T T T G A G G T A A =Probe(F2) T T T C A C A A C T C C=T3 G G A G T T G T G A A A=Probe(T3) PROBE(F2) G G T T T G A G G T A A C A A C C A A A A A A- T T A C C T C A A A C C- T T C A A T C C A C A A- T C A C T C T C T C A A - C A T C T C A C C A T C Yes WC bonding Yes, bitstring is F2 Good read T1-F2-F3-T4-T5 1 0 0 1 1 PROBE(T2) G G A G T T G T G A A C A A C C A A A A A A- T T A C C T C A A A C C- T T C A A T C C A C A A- T C A C T C T C T C A A - C A T C T C A C C A T C Darn! CH bonding No, bitstring is not T2 Bad read T1-F2-F3-T4-T5

DNA Computing Strand Engineering No codeword pp-CH, cc-CH PROBE(F2) pp-CH interferes with reading G G T T T G A G G T A A T T G A G A G A GT G PROBE(T4) PROBE(F2) G G T T T G A G G T A A bonding site competition T T G A G A G A GT G C A A C C A A A A A A- T T A C C T C A A A C C- T T C A A T C C A C A A- T C A C T C T C T C A A - C A T C T C A C C A T C cc-CH interferes with separation and leads to unwanted library strand interaction T1-F2-F3-T4-T5 F1-F2-T3-T4-f5 C A A C C A A A A A A- T T A C C T C A A A C C- T T C A A T C C A C A A- T C A C T C T C T C A A - C A T C T C A C C A T C T T T C C A A A A-AT T A C C T C A A A C C- T T T C A C A A C T C C-T C A C T C T C T C A A - A A C A C T A C A C A C

Watson-Crick Nearest Neighbor Computation 1.44 2.24 WC Duplex 5’g g c a c a3’ 3’c c g t g t 5’ 5’g g c a c a3’ 5’g g c a c a3’ NNFE=8.42 5’g g c a c a3’ 5’g g c a c a3’ 1.84 1.45 1.45

Cross Hybridized Nearest Neighbor Upper Bound Computation 1.45 1.28 5’ggCaCaTcatAct3’ 3’ TccAAttGgtaga5’ 5’g gC aCaTcatAct3’ 3’ Tc cA AttGgtaga5’ 5’g gC aC a T c a t A ct3’ 5’ A g g T T a a C c a t ct3’ .27 1.84 0.88 NNFE~<5.45 5’ggCaCaTcatAct3’ 5’ AggTTaaCcatct3’ NNFE~<5.72

Intermolecular Interactions Duplexes loop symmetric loop asymmetric 2.90 (5.3) 2.20 (5.3)

Intramolecular Interactions CAAGACTTTTTGGTAGTAAA ***TTTCCC*********GGAA***GGGAAA***********TTCC***

NNFE~<5.66 NNFE~<5.66+ .59 + .32=6.57

5’ggC aC aT c a tA ct3’ 3’ T cc AA t t G g t aga5’ 5’ggC a C a T c a t A ct3’ 5’ A ggTT a aC c a t ct3’ 5’ggCaCaTcatAct3’ 3’ TccAAttGgtaga5’ Virtual Stacked Pairs Virtual Duplex 5’ggCaCaTcatAct3’ 5’ AggTTaaCcatc3’

5’GGCACATCATACT3’ 5’AGTATGATGTGCC 3’ 5’AGGTTAACCATCT3’ 5’AGATGGTTAACCT3’ 5’GGCACATCATACT3’ Neareast Neighbor Appr. Free Energy of duplex formation (WC) 5’AGTATGATGTGCC 3’ …= 18.8 2.24 1.84 1.45 5’GGCACATCATACT3’ 5’AGGTTAACCATCT3’ 5’ggCaCaTcatAct3’ 3’ TccAAttGgtaga5’ 5’ggC aC aT c a tA ct3’ 3’ T cc AA t t G g t aga5’ 1.28 1.45 5’ggCaCaTcatAct3’ 5’ AggTTaaCcatct3’ 5’ggC a C a T c a t A ct3’ 5’ A ggTT a aC c a t ct3’ NNFE CH =6.45 0.88 1.84

correlation=.737 Our FE bound Precise FE Length 16 435 random

Basic Notations Let denote a set consisting of all vectors (codewords) of length n built over i.e. Let such that: 1) 2) 3) Let be such that: is referred to as a Code of length n, size M, and minimum distance d.

A sphere in centered at x having radius d: Volume of the sphere around x, of radius d: Spaces A space is HOMOGENEOUS when the volume of a sphere does not depend on where it is centered i.e. A space is NON - HOMOGENEOUS when the volume of a sphere does depend on where it is centered.

Similarity Sequence is a subsequence of if and only if there exists a strictly increasing sequence of indices: Such that: is defined to be the set of longest common subsequences of and is defined to be the length of the longest common subsequenceof and

Example of LCS Just what it says: x = <A C G T C G A G C> y = <G A C G C T G A G> LCS(x, y) = {<A C G C G A G>} |LCS(x, y)| = 7

Insertion-Deletion Metric Original Insertion-Deletion metric (Levenshtein 1966): This metric results from the number of deletions and insertions that need to be made to obtain ‘ y ’from ‘ x ’. For vectors that have the same length: the number of deletions that will be made is: likewise, the number of insertions that will be made is:

Better Metric ? • LCS is simple and easy to compute. • LCS essentially is a count of the number of base pairings between two sequences, and thus does approximate bonding energy. • Clue: if two base pairs bond, but neither their neighbors to the right or left bond, it really doesn’t contribute much. • We might call such inconsequential bonds “lone” bonds.

“Lone Bonds” B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B The red bonds are “lone bonds” that don’t contribute to the binding energy.

Block LCS The longest common subsequence SUCH THAT: If xi is matched to yj, THEN EITHER xi-1 is matched to yj-1, OR xi+1 is matched to yj+1

Longest Common Stacked Pair Subsequence A common subsequence is called a common stacked pair subsequence of length between x and y if two elements , are consecutive inx and consecutive inyor if they are non -consecutive in xand ornon-consecutiveiny, then and are consecutive in xandy. Let , denote the length of the longest sequence occurring as a common stacked pair subsequence subsequence zbetween sequences x and y. The number , is called a similarityof blocks between xand y.The metric is defined to be

Bounds in Coding Theory We will be working in a NON-HOMOGENEOUS space making the obtainment of exact formulas for sphere volumes and code sizes VERY HARD. [6] L. M. G. M. Tolhuizen (1997): The Generalized Varshamov-Gilbert Bound is Implied by Turan’s Theorem, IEEE Transactions on Information Theory, 43:05. Varshamov-Gilbert Lower Bound on Code Size in with any metric:

Turan's Theorem Let G be a simple graph on vertices and e edges. G contains an M-clique if: CLIQUES:

The edge set of G is constructed as follows; an edge (x, y) exists in G if and only if d(x, y) > d. The first question is; how many edges does G have? This can be found by taking spheres of radius d − 1 around each vector and counting how many vectors are outside the particular sphere. Since edges will be double counted, we must divide by 2:

From Turan to Varshamov-Gilbert If: Then there exists a code of size M.

Let Then: Hence there exists a code of size M and so:

Stacked Pair Metric Bounds The upper bound for the average sphere volume in this metric will be: The Varshamov-Gilbert bound becomes:

Bounds for Stacked Pair Metric d = 6 d = 7 d= 8 d= 9 d= 10

Insertion-deletion stacked pair thermodynamic metric Thermodynamic weight of virtual stacked pairs. • Can use statistical estimation of sphere volume.

A C G C G T T A C T G A T A C A Get LCS of this and add 1 for the A’s that have to match Case 1:sequences end with the same symbol

A C G C G T T A C T G A T A C C Take the best LCS of these two Case 2:sequences end with different symbols

Solve Problem Recursively • If x(i) and y(j) end with the same symbol, say A, then: LCS(x(i), y(j)) = LCS(x(i – 1), y(j – 1) + A • If xi and yj do NOT end with the same symbol, then: LCS(x(i), y(j)) = max[LCS(x(i – 1), y(j)), LCS(x(i), y(j – 1))]

Inefficient: we keep evaluating the same LCS(i, j) over and over. Instead, use dynamic programming. Fill in a table of LCS(i, j) values by i and j. You only have to figure each LCS(i, j) once. O(n2). Dynamic Programming

In terms of dynamic programming table: Cell we are trying to figure out Information we use

Stacked pair metric Algorithm for Stacked Pair Metric The longest common subsequence SUCH THAT there are no lone bonds. If xi is matched to yj, THEN EITHER xi-1 is matched to yj-1, OR xi+1 is matched to yj+1

Cannot “break” a block LCS Big regular LCS: A C T G C T G A C G C T Break to get two smaller regular LCS’s: A C T G C T G A C G C T

Cannot “break” a block LCS Big block LCS: G G T A G G C C T A C C CANNOT break to get two smaller block LCS’s: G G T A G G C C T A C C

Adding a single symbol to a string can have effects arbitrarily far back A C T C C C C T G G G G G A C T G A C T C C C C T G G G G G G A C T G These three bonds make the LCSP. Add just one symbol, G, and the red bond must be moved to make the new LCSP.

Two algorithms

Tail equality of two sequences Tail equality 3: A G C T C A T C T C Tail equality 0: A G C T G A T C T A End count of a matching End count 2: A G C T C A T C T C End count 0: A G C T G A T C T A

The end count of a matching between x and y cannot exceed the tail equality of x and y. • Let LCSP(k)(i, j) be the length of the longest LCSP(i, j) achievable with a matching of end count k. • where e is the tail equality of x and y.

Vyacheslav V. Rykov