950 likes | 1.09k Views
Universal DNA Tag Systems: A combinational design scheme. Yao-lin Chang,Chi-hung Tsai, Han-yu Chuang,Yu-cheng Huang, Bo-j Chen. Motivation from biology need. By Yao-lin Chang. cDNA microarray. DNA Tag/AntiTag System (TAT). Reporter Advantages
E N D
Universal DNA Tag Systems:A combinational design scheme Yao-lin Chang,Chi-hung Tsai, Han-yu Chuang,Yu-cheng Huang, Bo-j Chen
Motivation from biology need By Yao-lin Chang
DNA Tag/AntiTag System(TAT) Reporter • Advantages • These unversal components can be mass-produced. (reducing the manufacturing costs) AntiTag Tag Target-specific Part
ExampleGenotyping ATTGGCTATTGCCCATCGGGAA Given: The positions of SNPs (e.g. red) Goal: Determine the variants present in a given sample. (e.g. GTG)
Step 1 CGATAACGGGTAGCCCTT TAG1 ATTGGCTATTGCCCATCGGGAA ACGGGTAGCCCTT TAG2 ATTGGCTATTGCCCATCGGGAA CTT TAG3 ATTGGCTATTGCCCATCGGGAA
Step 2 CGATAACGGGTAGCCCTT TAG1 G ATTGGCTATTGCCCATCGGGAA A A ACGGGTAGCCCTT TAG2 T G ATTGGCTATTGCCCATCGGGAA C T C CTT TAG3 ATTGGCTATTGCCCATCGGGAA
Step 3 C CGATAACGGGTAGCCCTT TAG1 ATTGGCTATTGCCCATCGGGAA A ACGGGTAGCCCTT TAG2 ATTGGCTATTGCCCATCGGGAA C CTT TAG3 ATTGGCTATTGCCCATCGGGAA
Step 4 C CGATAACGGGTAGCCCTT TAG1 ANTITAG1 A ACGGGTAGCCCTT TAG2 ANTITAG2 C CTT TAG3 ANTITAG3
Design problem • It is desirable to have as many tags as possible. • If too many tags are used, cross-hybridization may happened. Tag1 AntiTag1 A C T A G A T C A A T G A T C
Previous Work • Demonstration of a word design strategy for DNA computing on surfaces. Nucleic Acid Research, 1997 • Methods for sorting polynucleotides using oligonucleotide tags. US Patent, 1997. • Universal DNA microarray method for multiplex detection of low abundance point mutations. J. Mol. Bio., 1999.
Thermodynamic model(1) • DNA duplexes are held together by weak H-bonds between W-C complementary nucleotides. • The energy requires to melt DNA duplex is dependent on • Strand length • C-G contant (C G v.s. A T )
Thermodynamic model(2) • Melting temperature • At t, half of U and V will be in a single-stranded form, and half will occur in duplexes. • TAT System
TAG1 TAG1 TAG1 Anti-2 Anti-2 Anti-2 TAG1 Anti-1 Anti-1 Anti-1 TAG1 TAG1
Thermodynamic model(3) • Melting temperature estimation: 2-4 rule (commonly used for short oligo-nucleotides) - approximately twice the number of A-T pairs plus four times the number of C-G pairs.
Combinatorial Tag Design Problem • Given c and h, we call a set T of tags a valid c-h code if • Each tag has weight of h or more • Any substring of weight c or more occurs at most once. The weight w(s) of a tag is , where w(A)=w(T)=1, w(C) =w(G) =2.
Combinatorial Tag Design Problem(cont.) • A valid c-h code corresponds to a solution of the TAT design problem. • Our goal is to find the maximum valid c-h code(i.e. a set that contains maximum tags) based on given c and h.
TAG3 If weight of Substring = c Tag’s weight at least h 在此區間,不可能有 任何 substring 的結 合存在,且溫度不足 以破壞任何 Tag 與 其 AntiTag 之結合 c-1 TAG1 Anti-3 TAG1 TAG3 Anti-2
Tag-AntiTag System design problem by Chi-hung Tsai
TAT system design problem • To construct a TAT system with a maximum number of tag-antitag pairs such that the following properties are satisfied: • For each tag-antitag pair (U, Ū) the melting temperature tm(U, Ū)>= H • For any two distinct tag U and V, common substring x, tm(x,x) < C
Definition • Weight w(s) of a string s = a1a2…ak is Σw(ai), where w(A)=w(T)=1, w(C)=w(G)=2 • Given c, h, we call a set of strings a valid c-h code if the following two conditions are satisfied: • Condition 1 Each tag has a weight of h or more. • Condition 2 Any substring of weight c or more occurs at most once.
A valid 4-10 code with 12 tags GACCAAT CAGCTAT GTCGATA CTGGTTA CATTATCA GAAATTCT CTTAATGA GTATTTGT ATATAGTG TAAAACTC AATAAGAG TTTTACAC
Definition: c-token • Definition: we call a string t a c-token if w(t) >= c, but t does not properly contain a suffix of weight >= c.
Definition • Weight w(s) of a string s = a1a2…ak is Σw(ai), where w(A)=w(T)=1, w(C)=w(G)=2 • Given c, h, we call a set of strings a valid c-h code if the following two conditions are satisfied: • Condition 1 Each tag has a weight of h or more. • Condition 2’ any c-token occurs at most once.
Tail weight token’s tail weight Tag T: GACCAAT 2 GAC C 2 CC C 1 CCA A 1 CAA A 1 CAAT T Tail weight of T: 7
Lemma 1: Tag tail weight • All characters of T except the first two terminated a token and thus contribute their weight to the tail weight of T • The maximal prefix that does not contain a suffix of weight >= c has a total weight of at most c-1 • Lemma1: Any tag in a valid c-h code has a tail weight of at least h-c+1
Definition • <n> denote the set of strings with weight n Є N • Gn denote the number of such strings • G1 = 2 A, T • G2 = 6 AA, AT, TA, TT, C, G • Gn = 2*Gn-2 + 2*Gn-1 for n>=3
Token classes • Weight of 1(A or T) => W(weak) • Weight of 2(C or G) => S(strong) • We partition tokens into four classes • Token is terminated by either strong or a weak character • Token has a weight of either c or c+1
Number of tokens and tail weight in a valid c-h code Max. occurr. Max. Token class in valid code tail weight <c-2>S 2*Gc-2 4*Gc-2 S<c-3>S 4*Gc-3 8*Gc-3 <c-1>W 2*Gc-1 2*Gc-1 S<c-2>W 2*Gc-2 2*Gc-2
Lemma 2: Maximum Tail Weight • The total tail weight of all tags contained in a valid c-h code is at most 2*Gc-1+6*Gc-2+8*Gc-3
Theorem 1: Upper Bound • From lemma 1, lemma 2 yields the following upper bound • Lemma1: Any tag in a valid c-h code has a tail weight of at least h-c+1 • The total tail weight of all tags contained in a valid c-h code is at most 2*Gc-1+6*Gc-2+8*Gc-3 • Any valid c-h code contains at most (2*Gc-1+6*Gc-2+8*Gc-3)/ h-c-1Tags
Our Construction Using Circular Strings by Han-yu Chuang
Construction overview • A method of constructing a nearly optimal c-h code for arbitrary values of c and h • The construction lower bound : The upper bound stated in Theorem 1:
Construction overview(con’d) • Comparing the above two bounds,this method at least achieve a factor of approximately 0.89(h-c+1)/(h-c+3) relative to the upper bound • For example,when c=12 and h=30,this construction yields 12119 tags,which corresponds to 87.6% of the upper bound of 13840 one gets from Theorem 1.
Construction overview(con’d) • Two stages: • Construct a set of circular strings in which each token occurs at most once. • Extract tags as substrings from the circular strings.
Construction Stage 2 • In stage 2,the algorithm need to • To satisfy Condition 1(Each tag has a weight of h or more),each of the extracted substrings has a weight of h or more. • To satisfy Condition 2’(Any token occurs at most once),the overlap between two tags has a weight of at most c-1.
Construction Stage 2(con’d) • In stage 2,a straightforward greedy algorithm iterates the following two operations.Starting at some position • Collect characters until their cumulative weight reaches or exceeds h,forming one tag. • Track back over as many characters as possible without collecting a weight of c or more.
Construction Stage 2(con’d) • Stop criteria: Some overlap of weight >= c with the first extracted tag occurs, and the last retrieved tag is discarded. • Given the best start position,this algorithm produces the largest number of tags that are substrings of a given circular string and can be included in a valid code.
Construction overview(con’d) • Illustration: c-1 or c-2 (h+1)-(c-2) h or h+1 C or c+1
Construction overview(con’d) • Each circular string leads to at least tags. • Definition 3(Circular String Problem) Given the parameters c >0 and h >0,construct a set C of circular strings that contain any substring of weight >= c at most once,and maximize
Construction Stage 1 • Meta-String μ and Bit-String β • Each character a Σ is identified with a pair(μ, β),where μ {W,S},and β {0,1}.
Meta-String μ and Bit-String β • Each String s is identified by its pair of meta-string μ(s) and bit-string β(s). For example, We call s an instance of the meta-string μ.
Meta-String μ and Bit-String β(con’d) 3. Each circular string in our construction will be an instance of a long circular meta-string that arises from repeating a shorter meta-string.
De Bruijn sequence • To avoid generating several identical tokens from repetitions of a meta-string μ,this construction will ensure that each instance of the μ is paired with a different pattern in the bit-string. • For k N,a binary De Bruijn sequence of order k is a cyclic binary sequence of length 2k in which each possible substring of length k occurs exactly once. We denote it by Dk. • Reading Dk once,starting from a specific offset I relative to a fixed origin position,we obtain a linearization.We denote it by Dik.
Circular String Construction • Define: If s is a string,we will denote k repetitions of s by sk. • Each cycle is based on a meta-string μ of weight c.
General case • Let α be the shortest period of μW, i.e, μW =(α)p. • Set k=k(μ)=gcd(| α |,2| μ |)
General case(con’d) • Meta-cycle MC(μ) := (α) 2| μ |/k • Bit-cycle BCi(μ) := (Di| μ |)| α |/k , i = 0,…,k-1 • MC(μ) has the same length with BCi(μ).So we take their lcm.
Circular String Construction(con’d) • For every meta-string μ with w(μ) =c,our code contains the k cycles Ci(μ) =(MC(μ), BCi(μ)), i=0,…,k-1 • The set of Cycles we construct is
Special case(example) • μW can’t be represented as concatenation of two or more identical substring.(no α) • gcd(| μ |+1,2| μ |)=1 • For meta-string μ that satisfy the conditions above,