Loading in 2 Seconds...
Loading in 2 Seconds...
The Information Processing Mechanism of DNA and Efficient DNA Storage. Olgica Milenkovic University of Colorado, Boulder Joint work with B. Vasic. Outline. PART I: HOW DOES DNA ENSURE ITS DATA INTEGRITY? Information Theory of Genetics : an emerging discipline
Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.
Olgica Milenkovic
University of Colorado, Boulder
Joint work with B. Vasic
What type of codes “operate” at the level of biochemical processes of the Central Dogma?
Helps in understanding the
EVOLUTION of DNA
FUNCTIONALITY of DNA
DISEASE DEVELOPMENT
IT community still not involved in this area!
Signal Processing Community is just getting involved:
Special Issue of Signal Processing Journal devoted to Genetics, 2003.
5’
O
S B
U A
G C
A K
R B
 O
P N
H E
O
S
P
H
A
T
E
PO4
CH2OH
OH
1’
4’
H
H
H
H
Sugar
2’
3’
OH
H
Deoxiribose (Sugar)
PO4
Sugar
PO4
D
O
U
B
L
E

H
E
L
I
X
Purine Bases: Adenine (A); Guanine (G)
Pyramidine Bases:Thymine (T); Cytosine (C)
A and T paired through TWO hydrogen bonds
G and C paired through THREE hydrogen bonds
instead of DNA's thymine, i.e. U replaces T. It is the RNA sequence of codes which biologists usually refer to as the genetic code (see Table.4 below).
instead of DNA's thymine, i.e. U replaces T. It is the RNA sequence of codes which biologists usually refer to as the genetic code (see Table.4 below).
In summary: all life as we know it contains DNA and its close relative RNA. These polymers
Consist of exons (coding) and introns (noncoding) regions
Length anything between several tenths up to several millions
EXAMPLE: Among most complex identified genes is
DYSTROPHINE
(2 million bps, more than 60 exons, codes for 4000 amino acids)
Escherichia Coli: around 4000 genes; Humans: 3500040000 genes
Shown (last year) to be “somewhat” responsible for RNA coding
(Far from being “junk”, but function still not well understood…)
DNA
mRNA
Proteins
Replication Transcription Translation
A Communication Theory Perspective:
Genetic Channel
DNA sequence
mRNA
Proteins
DNA sequence
What kind of errors are introduced by the Genetic Channel?
Processing in the Genetic Channel: DNA REPLICATION
Untying the knots: Topoisomerases
Unwinding the helix: Helicases
Getting it all started: Primers
Doing the hard work: Polymerases
Sealing the segments: Ligases
Helping to keep two sides apart: SSB
Timing for replication:
E. Coli: 40 min
Humans (parallel): < 2 hours
Can be prolonged for proofreading purposes
Rules: Replication always proceeds in 5’ to 3’ direction;
Replication is semiconservative;
Replication is a parallel process for eukaryotes;
Facts: Polymerases can stitch together any combination
of bases (“Ps are a little bit sloppy’’)
Combination of substitution, deletion, insertion (replication fork), shift, reversal, etc errors
(Complete exon or intron deleted, or simple base pair deletions)
1. Tautomeric shifts (transition/transvertion): *TG, *GT, *CA, *AC
2. Recombination between nonidentical molecules (“HETERODUPLEX mismatches”)
3. Spontaneous DEAMINATION (C to U, C to T, CG to TA), METHYLATION (CpG), rare
4. APURINIC/APYRAMIDINIC SITES (due to HYDROLISIS)
5. CROSSLINKS
6. STRANDBREAKAGE, OXIDATIVE DAMAGE ERRORS
7. LOSS OF 500010000 PURINE and 200500 PYRIMIDINE bases (20 hours) due to radiation
Replication Errors: Polymerases missinsertion probability between 10e3/10e5
Miscoding
AGATG
CTGCTAC
Slippage
AATG
CGTT AC
T
SlippageDislocation
GAATG
CG TTTAC
Miscoding  Realignment
AGATG
CT CTAC
G
Proofreading (Maroni, Molecular and Genetic Analysis of Human Traits):
Replication polymerases error rate ; human DNA with bps, total of 106 errors
Example:
C to U conversion causes presence of deoxyuridine, detected by uracilDNA GLYCOSYLASE
Glycosylase process acts like erasure channel
1. Proofreading based on semiconservative nature of replication
2. Excision Repair Mechanisms: Arrays of Exonucleases
Show large degree of precorrection binding activity – correction performed by EXCISION
“Jumping’’ occurs between different genes !!! (Lin, Lloyd, Roberts, Nucleases)
Reduce error levels by an additional several orders of magnitude
Mismatchspecific postreplication enzymes
Total number of errors per human DNA replication: on average JUST ONE
Replication and Repair have been optimized for balancing spontaneous mutational load:
Permitting evolution without threatening fitness or survival
Errorcorrection performed on different levels
Error correction performed in very short time
Extremely large number of very diverse errors corrected
Error correction tied to global structure of DNA
(not to consecutive base pairs)
Error correction also depends on DNA topology
Theories:
Acceptor/Donor: hydrogen atom/lone electrons
1 represents donor, 0 acceptor
Additionally, add 0 or 1 for purine and pyramidine
Code: A 1010
G 0110
T 0101
C 1001
BEST ERROR CORRECTING MECHANISM: Deinococcus radiodurans
‘’Conan The Bacterium’’
(to conquer the Red Planet !)
Spin Glasses, the Ising Model, Hopfield Networks or “Boltzmann Machines”:
State x of a spin glass with N spins that may take values in {1,+1}
Energy of the state x: E, external field h
The Hamiltonian
Hamiltonian for Ising model
+
+
Example:
Water exists as a gas, liquid or solid, but
all microscopic elements are H2O molecules
This is due to intermolecular interactions depending on temperature, pressure etc.
+
+

+
“frustration”
Codes on graphs: the most powerful class of error correcting codes in information theory, including Turbo, LowDensity ParityCheck (LDPC), RepeatAccumulate (RA) Codes
Most important consequence of graphical
description: efficient iterative decoding
Variable nodes communicate to check nodes their reliability
Check nodes decide which variables are unreliable and “suppress” their inputs
Number of edges in graph = density of H Sparse = small complexity
Variables Checks
Detrimental for convergence of decoder: presence of short cycle in code graph
Applications of LDPC codes: for cryptography, compression, distributed source coding for sensor networks, error control coding in optical, wireless comm and magnetic and optical storage…
Works for (Binary Symmetric Channel) BSC:
Each variable sends its channel reliability unless all incoming messages from checks say “change”
Each check sends estimate of the bit based on modulo two sum of other bits participating in the check
Alternative view: Variables=Atoms; Binary Values=Spins;
Variables “align” or “misalign” according to interaction patterns
LDPC equivalent todiluted spin glasses
Ground state search for above Hamiltonian = maximum aposteriori decoding of codeword
Average magnetization at a site = MAP decision for individual variable
The regulatory Network of
Gene Interactions (RNGI)
Kaufmann (1960’s): “NK” Evolution
through Changing Interactions
between Genes
Life exists at the Edge of Chaos!
BASED ON SPIN GLASSES!
RANDOM BOOLEAN FUNCTION MODEL:
Evolution carried by genes, not base pairs, and the way genes interact!
G1
G3
G2
Boolean networks: dynamical systems Attractors: point and periodic
Characterized by network topology+ Number and period lengths
choices of Boolean node functions
MOST IMPORTANT topological factor:
CONNECTIVITY
KEY: Sparse connectivity allows enough variability for evolutionary processes, produces selforganizing structures, but doesn’t allow the system to “get trapped in” chaotic behavior
MOST IMPORTANT Boolean function factors:
BIAS (number of 1 outputs)
CANNALIZATION (depends on number of inputs determining output)
N= number of genes; K=number of genes cointeracting with one given gene
K=2 critical value (mainly frozen states with islands of changing interaction)
Interaction between genes in regulatory network: very limited in scope
K ranges everywhere between 23 to 1015: If we check carefully, logarithmic in N, i.e. number of genes
Between 2 and 3 for Escherichia Coli (around a thousand genes)
4 and 8 for higher metazoea (several thousand genes)
Can explain the process of cell differentiation: genetic material of each cell
the same, yet cells functionally and morphologically very different
Each cell typeCORRESPONDS TO ONE GIVEN ATTRACTOR of the RNGI
Counting attractors for networks with N=40000 genes, K=2 gives
Cell types (correct number 258).
Example: LDPC Code under Gallager’s A Algorithm
G1
G2
G3
G4
In the Control Graph, edge (i,j) exists if ith bit controls jth bit (i.e. if iand j are at distance exactly two)
Boolean function determined by decoding algorithm: For Gallager’s A algorithm, takes form of truncated/periodically repeated MORSETHUE sequence
LDPC Code:
Variables and Checks
LDPC Code:
The Control Graph
MorseThue: 0 1 2 3 4 5 6 7 …
0 1 10 11 100 101 110 111 …
0 1 1 0 1 0 0 1 …
Properties:
SelfSimilar (fractal)
Results in unbiased Boolean functions
modulo two sums of variable nodes connected to controls
Can use meanfield theorems to see when initial perturbations in the codewords disappear in the limit: use the Boolean derivative, sensitivity analysis, iterative Jacobian and Lyapunov exponent (as in Schmulevich et.al):
matrix with
in entry (i,j).
JacobianF is a
Iterated Jacobian:
Lyapunov exponent:
The influenceof variable on the Boolean function f is defined as the expectation of the partial derivative with respect to the distribution of the variables
,
.
Influence carries important information about frozen states, error susceptibility etc.
iterative change of size of “stable core”
Control of the chaotic phase in the a Boolean network by means of periodic pulses (with period T) that “freeze” a fraction of nodes
Bold Conjecture: The ECC of DNA Replication operates on multiple levelsCarrier of information is gene, not base pairThe Global level involves Genes; Local levels may involve exons or base pairs in general;The Global Code is an LDPC Code!Wigner observed that the same mathematical concepts turn up in entirely unexpected connections in whole of science…(no explanation as of yet)LDPC related to statistical physics (spin glasses) to neural networks to selforganizing systems to …R. Sole and B. Goodwin, Signs of Life: How Complexity Pervades Biology
15gene interaction example by Hashimoto (Shmulevich, Anderson Cancer Center)
Need qary LDPC code corresponding to different levels of interaction
Cancer: genetic disorder of somatic cells
To summarize: Various forms of cancer tightly linked to malfunctioning of proofreading (ECC) mechanism
Cancer cells: correspond to a special type of attractor of the RNGI
(A cancer cell is “just another configuration” of RNGI)
(Schmulevich et.al., Anderson Cancer Research Center)
This attractor has genes interacting in a way that results in uncontrolled cell division
Key observation: CChange in RNGI results in further weakening of the proofreading system, and VV
Aging: during each cell division, telomeres get shorter and shorter…
When they become too short, errors in replication happen, leading to cancer
(a time bomb in our body)
Cancer cells “cheat” proofreading mechanism and allow telomeres to maintain constant length
Finding the errorcontrol mechanism: classifying diseases accurately, curing diseases (including cancer) by gene therapy, making telomer lengths constant over long time…
Example 2:Breast Cancer Oncogene BRCA1 tightly linked to errorcontrol of DNA and cell division regulation
How to obtain results practically? DNA Microarrays!
Figure taken from Schmulevich et.al.
 major paradigm shift from basepair
distance to chromosomal distance 
Bases within the human mitochondrion (length approximately 17000) appear with the following frequencies:
while within different regions of human fetal globin gene:
Parts of genetic sequences can be modeled
by Markov chains of given order
and transition probabilities; order 27
Regions of uniform distribution: isochors; can stretch in length up to hundreds Kbps
Repetitive patterns: tandem repeats (TR), random repeats (RR), short interspersed
repeat sequences (SINE’s, 9% of DNA), long interspersed repeat sequences (LINE’s).
BPs, like CG, have very small probability: most notorious triplet repeats, related to Huntington’s disease and FragileX mental retardation, consist of these very unlikely “CG” pairs: (CGG)m ,(CCG)m, m = number of repetitions;
JunkDNA seems to have longrange (fractal) characteristics.
A fractal patterns arises from the socalled DNA walk: a graphical representation of the DNA sequence in which one moves up for C or T and down for A or G.
Can have two, threedimensional random walk: further differentiation A,G,C,T
C A T G
Fractal dimension of the DNA molecule:
0.85 for higher species, 1 for lower
Use lingual analysis of human languages for exploring DNA "language" (Zipf method)
http://library.thinkquest.org/26242/full/ap/ap13.html
DNAWalker http://athena.bioc.uvic.ca/pbr/walk/
Provata and Almirantis, 2003: Fractal Cantor pattern in DNA
Exons  filled regions
Introns  empty regions
Random, fractal, Cantorlike set
Implication: atom (carrier of information) exon/intron pairs
Historybased random walk and DNA description in terms of urn models
Only introns in higher species have higher complexity than in lower species
Both coding and noncoding regions exhibit long range correlation, with spectral density of introns
GenCompress (Chen, ’97)
Biocompress (Grumbach/Tachi, ‘94)
Fact (Rivals, ’00)
GenomeSequenceCompress (Sato et.al 00’)
Use characteristics of DNA like repeats, reverse complements…
Compression rate is about 1.74 bits per base (78% in compression ratio)
Two classes: statistical and grammar based compression algorithms
Huffman, LempelZiv, Arithmetic Coding, BurrowsWheeler,
Kieffer’s Grammar Based Schemes
(with DNA specific modifications)
No known algorithm specially suited for fractal nature of DNA, although 90% fractal!
Inference of contextfree grammars from fractal data sets
Syntactic generation of fractals
Theory of formal languages can be used to state the problem of "syntactic fractal pattern recognition"
Explore Connections with Wavelets
(ideas by Jacques BlancTalon)
Barthel, Brandau, Hermesmeier, Heising:Fractal Prediction, 1997
Zerotree wavelet coding using fractal prediction
Example: Heighway dragon and Koch curve
Distributed Source Coding Problem: Peculiar Correlation Patterns
Could explore Wavelet Based Compression
Distributed Source Coding with LDPC Codes…
Major paradigm shift in genetic distance measure:
From basepair distance (involving deletion, insertion and substitution): Sankoff, Kruskal,Time Warps, String Edits, and Macromolecules) to Chromosomal Distance based on global arrangements of genes
Inversions are primary mechanism of genome rearrangement!
REVERSAL DISTANCE
The smallest number of inversions necessary to transform one genome into another
Finding the minimum number of reversals needed to “sort” a permutation
Permutations are signed, indicating direction of transcription
Example: (+1 +3 +2) (+1 2 3) (+1 +2 3) (+1 +2 +3)
How does one perform oneway communication (SENDING INFORMATION TO A RECEIVER WHO POSESESS CORRELATED INFORMATION) under the reversal distance measure?
DNA compression methods increase network efficiency by up to 10 times
Peribit's SR50 compressor
DNA ComputingCodes with Constant GC Content and invariant under WatsonCrick InversionMicroarray Error Control CodingUsing design theory to reduce error rate of DNA array dataUse novel clustering algorithms for DNA Array Data
Genetics is the most exciting source of new ideas for coding theory
The atom of information is a gene, not a base pair or a triple of base pairs
The error control code of the genome is to be found operating on the level of genes
Compression, phylogenic tree construction: comparison of species has to be performed on the level of genes first
Once the genes are compared, can move to local base pair comparisons