Ee381v genomic signal processing
This presentation is the property of its rightful owner.
Sponsored Links
1 / 59

EE381V: Genomic Signal Processing PowerPoint PPT Presentation


  • 107 Views
  • Uploaded on
  • Presentation posted in: General

EE381V: Genomic Signal Processing. Basic Information. Instructor: Haris Vikalo E-mail: [email protected] Phone: (512) 232-7922 Office: ACES 3.110 Hours: Tue, Thu, 11:00am-12:00pm Electronic course site: Blackboard courses.utexas.edu

Download Presentation

EE381V: Genomic Signal Processing

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Ee381v genomic signal processing

EE381V: Genomic Signal Processing


Basic information

Basic Information

  • Instructor: Haris Vikalo

    • E-mail: [email protected]

    • Phone: (512) 232-7922

    • Office: ACES 3.110

    • Hours: Tue, Thu, 11:00am-12:00pm

  • Electronic course site: Blackboard

    • courses.utexas.edu

    • distribution of homework assignments, solutions, and class notes

    • should be able to access it if you have UT EID and are registered

  • Course website: http://users.ece.utexas.edu/~hvikalo/ee381v.html

    • class notes (mirrored from Blackboard) and suggested reading

    • final project information

  • Lectures location & time: ENS 306, Tue, Thu 9:30pm-11:00am

    • may have some guest lectures, not necessarily in the same room


Basic information1

Basic Information

  • Textbook: none

    • class notes, reading assignments will be distributed via course website, Blackboard

  • Optional reading (on reserve desk in Life Sciences Library):

    • R. Durbin, et. al., “Biological Sequence Analysis: Probabilistic Models of Proteins,” Cambridge University Press, 1998.

    • M. Schena, “Microarray Analysis,” Wiley 2003.

  • Homeworks & Exams:

    • bi-weekly homeworks (algorithmic rigorous thinking, programming assignments)

    • midterm (probably take-home)

    • final project (tackle a research problem, write-up a report)

  • Grading (tentative):

    • homeworks (30%), midterm (30%), final project (40%)

  • Prerequisites: EE381J Probability and Stochastic Processes

    • also, exposure to differential equations

    • no biology background is required

    • familiarity with Matlab (to carry-out programming assignments)


Goals for the term

Goals for the Term

  • Introduction to genomic signal processing

    • fundamental problems in genomic signal and information processing

    • research directions for active participation in the field

  • Duality: computation and biology

    • give a biology/technology background to motivate a computational task

    • provide background on the relevant signal processing / computational techniques

    • describe a solution

  • Foundations and frontiers

    • well defined conventional problems and general methodologies

    • contemporary challenges, future research directions, etc.

  • Scope of the topics

    • core biotechnologies: modeling and signal processing algorithms

    • cellular systems: algorithmic/computational tools for inferring their structure and understanding how they function


Ee381v genomic signal processing

Signal Processing for Core Biotechnologies

Systems for sequencing and detection:

Gene Expression

Profiling

DNA Sequencing

DNA Amplification

ABI Prism ® 310 Genetic Analyzer

Affymetrix GeneChip ®

Roche LightCycler ®


Ee381v genomic signal processing

Signal Processing for Cellular Systems

  • Information flow in a cell (traditional view: Central Dogma):

2nm

10 bases

= 3.4nm

Protein

DNA

RNA

Translation

Transcription

  • Information (signal) is carried by molecules.


Ee381v genomic signal processing

Signal Processing for Cellular Systems

  • Follow the information flow:

Sequences

  • Moreover, study the temporal changes in the information flow

    • gives insight in regulation mechanisms, biological network structure, etc.

  • Previously mentioned biotechnologies interrupt the information flow and so provide insight into the cellular structure and functions

Mechanisms


Computational and signal processing challenges

Computational and Signal Processing Challenges

Sequencing and Genome assembly

Gene finding

Regulatory motif discovery

DNA

Sequence alignment

Comparative genomics

Database lookup

ACATGCTAT

ACGTGATAA

Evolutionary theory

AGAGGATAT

ATATCATAT

ATATGATTT

Cluster

discovery

Gene expression analysis

Regulatory networks inference

Emerging network properties

Protein network analysis


Computational and signal processing challenges1

Computational and Signal Processing Challenges

Genome assembly

Gene finding

Regulatory motif discovery

DNA

Sequence alignment

Comparative genomics

Database lookup

ACATGCTAT

ACGTGATAA

Evolutionary theory

AGAGGATAT

ATATCATAT

SEQUENCES

ATATGATTT

Cluster

discovery

Gene expression analysis

Regulatory networks inference

Emerging network properties

INTERACTIONS

Protein network analysis


Computational and signal processing challenges2

Computational and Signal Processing Challenges

  • Sample topics and computational / signal processing tools:

    • Sequencing and sequence analysis

      • modeling with hidden Markov models (HMM)

      • many problems require dynamic programming solutions

    • Technologies (systems) for bio-molecular detection

      • modeling with continuous-time Markov processes (discrete, stochastic), often use approximations (continuous-valued, deterministic)

      • estimation techniques for data recovery

    • Gene expression analysis / Cluster discovery

      • various data mining techniques

    • Network modeling and analysis

      • modeling with (multiple) continuous-time Markov processes, graph models (Boolean, Bayesian networks)

      • Monte Carlo simulation techniques, network inference


Computational and signal processing challenges3

Computational and Signal Processing Challenges

  • Recent IEEE special issues (can be accessed via IEEE Xplore):

    • IEEE Transactions on Information Theory, Special Issue on Molecular Biology and Neuroscience,vol. 56, no. 2, February 2010.

    • IEEE Journal of Selected Topics in Signal Processing, Special Issue on Genomic and Proteomic Signal Processing, vol. 2, no. 3, June 2008.

    • IEEE Signal Processing Magazine, Special Issue on Signal Processing in Genomics, vol. 24, no. 1, January 2007.

    • IEEE Trans. on Signal Processing, Special Issue on Genomic Signal Processing, vol. 54, no. 6, June 2006.

  • Today’s goal: Molecular Biology Primer

    • will be complemented by a few papers posted to Blackboard


Biological systems

Biological Systems

Organisms are remarkably uniform at the molecular level.

Molecules

Macromolecules

Tissue/Cell

Organ

Organism


Biological systems1

Biological Systems

Biological sciences study different morphology levels of biological systems

Molecules

Macromolecules

Tissue/Cell

Organ

Organism

Biophysics

Genomics

Molecular Biology

Biochemistry

Cell Biology


Biological systems dna molecules

Biological Systems: DNA Molecules

  • In eukaryotes, DNA is tightly packaged into the structures called chromosomes, inside the nucleus of a cell

  • Potato has 48 of them, goldfish 94

  • Prokaryotes (e.g., bacteria): only a single loop of stable chromosomal DNA


Structure of dna

Structure of DNA

  • Four nucleotides: adenine (A), cytosine (C), guanine (G), and thymine (T)

  • Forms a double helix – each strand is linked via sugar-phosphate bonds, strands are linked via hydrogen bonds

  • sugar-phosphate bonds are strong, hydrogen weak


Ee381v genomic signal processing

Structure of DNA: Nucleotides

  • Structure of adenine (one of the nucleotides):

  • Symbolically:

  • Base pairing (A with T, C with G):


Ee381v genomic signal processing

Structure of DNA: Backbone

  • What about the backbone?

+

cytosine

A

C

G

T

.

.

.

+

.

.

.

  • Sugar-phosphate backbone is directional (5’-3’ or 3’-5’)

  • Why it matters: enzymes typically “care” about the direction


Structure of dna1

Structure of DNA

  • Human Genome has 3.2 billion DNA base pairs

  • 3.2 billion (3.2 x 109) symbols:

    • 200 (1000 pages each) NYC phone books

    • 800Mb (roughly, a data CD)

    • a person typing 60 words/minute for 8 hours/day, would take more than 50 years to type the entire human genome sequence

    • placed end-to-end the DNA in one human cell extends almost 6 feet

    • if all the DNA in a body were connected this way, it would stretch approx. 67 billion miles!

      • 150k round trips to the moon, 70 to the sun

  • DNA stores hereditary information

    • copied during the cell reproduction process


Dna replication

DNA Replication

  • During the cell reproduction process, DNA replicates

    • The twisted, compacted double helix of DNA has to unwind and separate its two strands

    • Each strand becomes a pattern, or template, for making a new strand, so the two new DNA molecules have one new strand and one old strand

    • The copy is done by a cellular protein machine called DNA polymerase, which reads the template DNA strand and stitches together the complementary new strand

    • The process of replication is astonishingly fast and accurate, although occasional mistakes, such as deletions or duplications, occur.


Ee381v genomic signal processing

Agent of DNA Replication: DNA Polymerase

  • DNA polymerase adds free nucleotides to the 3’ end of the new strand

    • so, the new strand grows in a 5'-3' direction

    • requires a “primer” (short sequence) to initiate extension

  • New strands grow in 5’-3’ direction


Mistakes in dna replication

Mistakes in DNA Replication

  • A built-in proof reading system (mismatch-pair system) catches and corrects nearly all of these errors

    • DNA replication: 1 error per 1 billion bases

  • Mistakes that are not corrected can lead to diseases such as cancer and certain genetic disorders

    • Fanconi anemia, early aging diseases, etc.

  • A sidenote: many drugs used to treat cancer work by attacking DNA

    • chemotherapy drugs disrupt the DNA copying process, which goes on much faster in rapidly dividing cancer cells than in other cells

    • side-effect: most of these drugs do affect normal cells that grow and divide frequently, such as cells of the immune system and hair cells


Ee381v genomic signal processing

Central Dogma

Stated by Francis Crick in 1958, re-stated in a Nature paper published in 1970:

“The central dogma of molecular biology is based on the principle that the flow of genetic information travels from DNA to RNA and finally to the translation of proteins.”  

Flow of Information

  • Genes carry hereditary information but do not do the any actual work in cells

    • they serve as instruction books for making functional molecules such as ribonucleic acid (RNA) and proteins

    • two steps: transcription and translation


Ee381v genomic signal processing

Transcription

Transcription

  • In transcription, the information coded in DNA is copied into RNA

    • the RNA nucleotides are complementary to those on the DNA

    • note: RNA pairs a uracil (U), instead of a T, with an A on the DNA

RNA Polymerase

enzyme

RNA building blocks

(4 different nucleotides)

DNA

Promoter

sequence

Gene

sequence

Termination

sequence

Initiation

1

Gene

Selective binding of enzyme to promoter region

Polymerization

2

Synthesizing RNA based of the DNA template

RNA

Termination

3

Enzyme detachment at termination region

  • Reading and copying the DNA is facilitated by the RNA polymerase


Ee381v genomic signal processing

RNA Polymerase

  • Like DNA polymerase, RNA polymerase adds free nucleotides to the 3’ end of the new strand

    • only one DNA strand is copied

    • the new strand grows in a 5'-3' direction

  • The resulting mRNA sequence:

  • This direction is preferred for energy reasons…


Ee381v genomic signal processing

Transcription

  • All cells contain the same DNA

    • so, what makes a nerve cell different from a red blood cell?

  • Each cell "turns on," or expresses, only a subset of genes

  • Activity of RNA polymerase is affected by a number of proteins

    • these proteins vary in different cell types throughout the body

  • Note: gene expression levels are affected by diseases


Ee381v genomic signal processing

Translation

Translation

  • In translation, messenger RNA (mRNA) is mapped to a specific protein (string of amino acids) according to the rules specified by the genetic code

  • A four-letter alphabet is mapped to a 20-letter alphabet

    • there is an embedded redundancy


Ee381v genomic signal processing

Translation

  • During translation, a four-letter alphabet is mapped to a 20-letter alphabet:

ACGCTACGTCAGTCGCATCGACTCGCCGCATCAGACGCGCCGATTTCACAAAAAAAAACGCGCTATACTATACGCAGGCATCGACGCCCCCTTACTTCGAGACACACACACACACACACACACACTTCTCCTCGCGGGGGGGTTTTTTACAGGCATCAGCATCTACGACGACTCATATTTTTTTTTCAGCCAAAAAAAAAAAACTCTCGGGGGGGGGATACTCAGCATCATACTTTTTTACGTGTATTTTATATTATATATGGCGCTTTACTTCTCACTCCTTTCCCATCCAGCGGGCTACGACTACTCAGCATTCACTTACCA

ACGCUACGUCAGUCGCAUCGACUCGCCGCAUCAGACGCGCCGAUUUCACAAAAAAAAACGCGCUAUACUAUACGCAGGCAUCGACGCCCCCUUACUUCGAGACACACACACACACACACACACACUUCUCCUCGCGGGGGGGUUUUUUACAGGCAUCAGCAUCUACGACGACUCAUAUUUUUUUUUCAGCCAAAAAAAAAAAACUCUCGGGGGGGGGAUACUCAGCAUCAUACUUUUUUACGUGUAUUUUAUAUUAUAUAUGGCGCUUUACUUCUCACUCCUUUCCCAUCCAGCGGGCUACGACUACUCAGCAUUCACUUACCA

TLRQSHRLAASDAPISQKKTRYTIRRHRRPLTSRHTHTHTHTSPRGGVFYRHQHLRRLIFFFQPKKKNSRGGDTQHHTFLRVFYIIYGALLLTPFPSSGLRLLSIHLP

RNA Sequence Length = 324 bases

Amino acid length = 108 bases

DNA Sequence Length = 324 bases

Protein

DNA

RNA

Translation

Transcription


Ee381v genomic signal processing

Translation

Translation

  • Simplified version of the translation mechanism:

Protein building blocks

(20 different amino-acids)

Ribosome

RNA

Start code

RNA code

Stop code

Initiation

1

Ribosome binds to a specific sequence of RNA

Elongation

2

Synthesizing proteins based of the RNA template

Protein

Termination

3

Enzyme detachment at the stop code region


Ee381v genomic signal processing

Translation

  • This is a fairly good description of the translation process in prokaryotes (give or take a few details omitted for simplicity)

  • However, the process is much more complicated in eukaryotes


Ee381v genomic signal processing

Translation in Eukaryotes

  • In eukaryotes, the region of DNA coding for a protein is usually not continuous. This region is composed of alternating stretches of exons and introns:

    • exons: pieces of coding sequence

    • introns: regions between exons

  • To get an mRNA molecule which is mapped to a working protein, the intron sections are trimmed and exon pieces stitched together: RNA splicing

  • If inaccurate, splicing may lead to an abnormal protein or no protein at all

    • a form of Alzheimer’s disease is caused by this


Ee381v genomic signal processing

Translation in Eukaryotes

  • Alternative splicing: arranging exons in different patterns

    • enables cells to make different proteins from a single gene.

  • Leads to possibility of creating many proteins from a single gene

    • e.g., 25k human genes can make hundred of thousands of different proteins


Ee381v genomic signal processing

Translation

Translation in Prokaryotes

Detailed description of the translation process in prokaryotes is complicated:


Ee381v genomic signal processing

Translation

Translation in Eukaryotes

Detailed description of the translation process in eukaryotes is even more complicated:


Ee381v genomic signal processing

Central Dogma Revisited

  • The traditional view:

  • Very stable molecule

  • Very large length

  • Only 4 building blocks

  • Unstable molecules

  • A fragment of DNA length

  • Only 4 building blocks

  • Complex molecules with huge variety of shapes and forms

  • Proportional to RNA size

  • Has 20 building blocks

  • However, there is feedback, creating a control system:

External Stimuli


Ee381v genomic signal processing

Central Dogma Revisited

  • Feedback, creating a control system:

Translation

Transcription

Chemical Building Blocks

Chemical Building Blocks

Biochemical

Biological

Functionality

DNA

RNA

Protein

Ribosome

RNA Polymerase

External Stimuli

  • We are interested in information/signals in these complex, nonlinear, and probabilistically described biomolecular systems with feedback


Ee381v genomic signal processing

Central Dogma Revisited

  • The signal/information in these systems is carried by bio-molecules

    • so, we may be interested in their structure, amounts, interaction…

Environment

DNA

Protein (1)

Gene (1)

1

1

2

Gene (2)

Protein (2)

2

Protein (M)

Gene (3)

N

M

1

2

M

RNA (M)

RNA (2)

RNA (1)


Ee381v genomic signal processing

Cell as a Control System

  • Information/signals are carried by molecules:

Translation

Transcription

Chemical Building Blocks

Chemical Building Blocks

Biochemical

Biological

Functionality

DNA

RNA

Protein

Ribosome

RNA Polymerase

External Stimuli

  • Signals are controlled via feedback

    • control allows adaptability to varying conditions

    • again, molecules facilitate the control


Ee381v genomic signal processing

Transcription

Controlling Transcription Mechanism

  • May be rather complex, involves transcription factors (proteins) which bind to promoter regions (upstream of a gene):

RNA Polymerase

enzyme

RNA building blocks

(4 different nucleotides)

DNA

Promoter

sequence

Gene

sequence

Termination

sequence

Initiation

1

Gene

Selective binding of enzyme to promoter region

Polymerization

2

Synthesizing RNA based of the DNA template

RNA

Termination

3

Enzyme detachment at termination region

  • Transcription can be upregulated/downregulated


Ee381v genomic signal processing

Controlling Transcription Mechanism

Transcription factors bind to promoters and recruit RNA polymerase:

RNA Polymerase

enzyme

RNA building blocks

(nucleotides)

DNA

Promoter

sequence

Gene

sequence

Termination

sequence

…XXXXXXXXGCCGCCG

vvvvTTGACAvvv…vvvTATAATvvvXXXXXXXXXXX…

Gene

17bp

0

-35

-10

v: nucleotides which may vary

X: transcribed sequence

  • Since located close to the beginning of a gene, promoter regions are indicative of genes locations on DNA


Ee381v genomic signal processing

A Few Words About Viruses

Ebola Virus

Bacteriophage Virus

DNA or RNA

Capsid (Protein Coat)

Virus is not a living organism. It is a biological system which merely contains genetic material.


Ee381v genomic signal processing

Viruses Cont’d

Hepatitis B virus :

DNA Sequence

Start: 1 ctccacaacc ttccaccaaa ctctgcaaga tcccagggtg agaggcctgt atttccctgc 61 tggtggctcc agttcaggaa cagtaaaccc tgttccgact actgcctctc ccatatcgtc 121 aatcttctcg aggattgggg accctgcgct gaacatggag aacatcacat caggattcct 181 aggacccctg ctcgtgttac aggcggggtt tttcttgttg acaagaatcc tcacaatacc 241 gcagagtcta gactcgtggt ggacttctct caattttcta ggggggacca ccgtgtgtct 301 tggccaaaat tcgcagtccc caacctccaa tcactcacca acctcctgtc ctccaacttg 361 tcctggttat cgctggatgt gtctgcggcg ttttatcatc ttcctcttca tcctgctgct 421 atgcctcatc ttcttgttgg ttcttctgga ctatcaaggt atgttgcccg tttgtcctct 481 aattccagga tcttcaacca ccagcgtggg accatgcaga acctgcacga ctactgttca 541 aggaacctct atgtatccct cctgttgctg taccaaacct tcggacggaa attgcacctg 601 tattcccatc ccatcatcct gggctttcgg aaaattccta tgggagtggg cctcagcccg 661 tttctcctgg ctcagtttac tagtgccatt tgttcagtgg ttcgtagggc tttcccccac 721 tgtttggctt tcagttatat ggatgatgtg gtattggggg ccaagtctgt acagcatctt 781 gagtcccttt ttaccgctgt taccaatttt cttttgtctt tgggtataca tttaaaccct 841 aacaaaacta aaagatgggg ttactcttta aatttcatgg gctatgtcat tggatgttat 901 gggtcattgc cacaagatca catcatacaa aaaatcaaag aatgttttag aaaacttcct 961 gttaacaggc ctattgattg gaaagtctgt caacgtattg tgggtctttt gggttttgct 1021 gctcctttta cacaatgtgg ttatcctgct ttaatgccct tgtatgcctg tattcaatct 1081 aagcaggctt tcactttctc gccaacttac aaggcctttc tgtgtaaaca atacctgaac 1141 ctttaccccg ttgcccggca acggcccggt ctgtgccaag tgtttgctga cgcaaccccc 1201 actggctggg gcttggtcat gggccatcag cgcatgcgtg gaacctttct ggctcctttg 1261 ccgatccata ctgcggaact cctagccgct tgttttgctc gcagcaggtc tggagcaaac 1321 attctcggga cggataactc tgttgttctc tcccgcaaat atacatcatt tccatggctg 1381 ctaggctgtg ctgccaactg gatcctgcgc gggacgtcct ttgtttacgt cccgtcggcg 1441 ctgaatcccg cggacgaccc ttctcggggc cgcttgggac tctatcgtcc ccttctccgt 1501 ctgccgttcc gtccgaccac ggggcgcacc tctctttacg cggactcccc gtctgtgcct 1561 tctcatctgc cggaccgtgt gcacttcgct tcacctctgc acgtcgcatg gagaccaccg 1621 tgaacgccca ccacttcttg cccaaggtct tacataagag gactcttgga ctctctgtaa 1681 tgtcaacgac cgaccttgag gcatacttca aagactgttt gtttaaagac tgggaggagt 1741 tgggggagga gattagatta aaggtctttg tactaggagg ctgtaggcat aaattggtct 1801 gcgcaccagc accatgcaac tttttcacct ctgcctaatc atctcttgtt catgtcctac 1861 tgttcaagcc tccaagctgt gccttgggtg gctttggggc atggacattg acccttataa 1921 agaatttgga gctactgtgg agttactctc gtttttgcct tctgacttct ttccttcgct 1981 acgagatctt cttgataccg cctcagctct gtatcgggaa gccttagagt ctcctgagca 2041 ttgttcacct catcatactg cactcaggca agctatcctt tgctgggggg agctaatgac 2101 tctagctacc tgggtgggtg ttaatttgga agatccagca tctagggacc tagtagtcag 2161 ttatgtcaac actaatatgg gcctaaagtt caggcaacta ttgtggtttc acatttcttg 2221 tctcactttt ggaagagaaa cggtcataga gtatttggtg tctttcggag tgtggattcg 2281 cactccacca gcttatagac cacctaatgc ccctatctta tcaacacttc cggagactac 2341 tgttgttaga ggacgaggca ggtcctctag aagaagaact ccctcgcctc gcagacgaag 2401 gtctcaatcg ccgcgtcgca gaagatctca atctcgggaa tctcaatgtt agtattcctt 2461 ggactcataa ggtgggaaac tttacggggc tttattcctc tactgtacct gtctttaacc 2521 ctcattggaa aacaccttct tttcctaata tacatttaca ccaagacatc atcaaaaaat 2581 gtgaacaatt tgtaggtcca ctcacagtca atgagaaacg aagactgcaa ttaattatgc 2641 ctgctaggtt ttatccaaat gttaccaaat atttgccatt agataagggt attaaacctt 2701 attatccaga acatctagtt aatcattact tccaaaccag acattattta cacactcttt 2761 ggaaggcggg tatattatat aagagagaaa caacacgtag cgcctcattt tgtgggtcac 2821 catattcttg ggaacaaaag ctacagcatg gggcagaatc tttccaccag caaccctctg 2881 ggattctttc ccgaccacca gttggatcca gccttcagag caaactccgc aaatccagat 2941 tgggacttca atcccaacaa ggacacctgg ccagccgcca acaaggtagg agctggagca 3001 ttcgggctgg gattcacccc accgcacgga ggccttttgg ggtggagccc tcaggctcag 3061 ggcataatac aaaccttgcc agcaaatccg cctcctgcat ctaccaatcg ccagtcagga 3121 aggcagccta ccccgctgtc tccacctttg agaaacactc atcctcaggc catgcagtgg 3181 aa //

It is just one long (3182 bp) DNA!


Ee381v genomic signal processing

How Virus Reproduces

Viruses impose themselves into the genetic information flow of organisms:

Environment

External

Internal

DNA or RNA

They exploit existing mechanisms!


Ee381v genomic signal processing

How Virus Reproduces

HIV Virus:

Drugs: reverse transcriptase inhibitors


Ee381v genomic signal processing

Biotechnological Methods

Use biological machinery to help us achieve desired goals

  • For instance, to examine a DNA/RNA strand, we first break it into pieces

  • might use mechanical force to randomly fragmentize

  • alternative: restriction enzymes

    • ordinarily, they break foreign DNA into fragments for protection

    • make them break DNA strands we want to examine (as opposed to breaking them with mechanical force)

    • the cleavage sites characterized by certain sequences (usually palindromes)

  • Another example: cloning to obtain sufficient quantities of a desired material

  • insert a DNA fragment of interest into an autonomously replicating DNA molecule, called a cloning vector (often plasmids - circular DNA - in bacteria)

  • the fragment of interest obtained using restriction enzyme

    • also, use restriction enzyme to create an insertion in the plasmid


Ee381v genomic signal processing

Biotechnological Methods

  • Recall: sugar-phosphate bonds are strong, hydrogen bonds are weak

    • when heated, hydrogen bonds break and strands separate

  • It goes the other way around, too: when complementary single stranded molecules (ssDNA) get close to each other, electrostatic interactions (via hydrogen bonds) may create dsDNA

ACGT

Hydrogen

bonds

TGCA

Energy

ACGT

ΔE ≈ 1-50kT at room temperature

TGCA

  • Because of thermal energy, the binding is a reversible stochastic process


Ee381v genomic signal processing

Biotechnological Methods

  • Many biosensors exploit the hybridization property

    • when complementary single stranded molecules (DNA or RNA) get close to each other, they may bind due to electrostatic interactions

ACGT

Hydrogen

bonds

TGCA

Energy

ACGT

ΔE ≈ 1-50kT at room temperature

TGCA

  • The number of created dsDNA can be detected

    • detect the presence and quantify the amounts of DNA fragments


Ee381v genomic signal processing

End of a (brief) molecular biology primer

Next: DNA Sequencing


Ee381v genomic signal processing

DNA Sequencing: Human Genome Project

  • The goals of Human Genome Project:

    • Identify all of the approximately 30,000-35,000 genes in human DNA

    • determine the sequences of the 3 billion base pairs that make up human DNA

    • store this information in databases, etc.

  • Meeting the goals of Human Genome Project required further improvement of speed, reliability, and cost

    • in 1998, the total sequencing efforts produced 200 Mb

    • in January 2003, the DOE Joint Genome Institute alone sequenced 1.5 Bb


Ee381v genomic signal processing

DNA Sequencing: Technology

  • Early sequencing methods were laborious, often required extensive use of hazardous chemicals (radioactive labeling), and did not scale

  • It all changed in the ‘70s with the work of Sanger, Gilbert

    • shared the 1980 Nobel prize in chemistry

  • Today, sequencing tasks are often performed using the chain termination method (Sanger et al., 1977) or a variation thereof

    • separate ssDNA fragments using gel electrophoresis


Ee381v genomic signal processing

DNA Sequencing: Background

  • Phosphate groups in the DNA backbone contain negatively-charged oxygen molecules

  • The phosphate-sugar backbone of DNA has an overall negative charge

Hydrogen bonds

C..G

A..T

G..C

T..A

C

G

T

A

C

G

A

T

  • DNA fragments of different length have different overall negative charge


Ee381v genomic signal processing

DNA Sequencing: Background

  • Distinguish between DNA fragments by subjecting them to an electric field

  • Due to their negative charge, DNA fragments move

    • force applied:

    • developed velocity:

  • A classical setup: movement through a gel

    • on a molecular scale, the gel

    • looks like a matrix

    • longer DNA fragments travel

    • slower (because they get

    • trapped in the matrix)

    • dye reporters


Ee381v genomic signal processing

Polymerization

  • How do we generate copies of DNA fragments of different length?

Polymerase

Nucleotides

A C G T

ssDNA

Primer

DNA polymerization:

-3’

-5’

5’-

3’-

ssDNA template

Primer hybridizes to the matching sequence on the DNA template

ACCGCT

5’-

-3’

Hybridization

3’-

-5’

1

TGGCGA

The polymerase enzyme has affinity to only bind to the 3’-end of the primer of template

5’-

-3’

Enzyme binding

-5’

3’-

2

A C G T

The polymerase copies DNA using nucleotides

-3’

5’-

Polymerization

-5’

3’-

3


Ee381v genomic signal processing

Polymerization: Terminators

  • In addition to standard nucleotides (A,C,G, T), we add modified (A’, C’, G’,T’)

    • when incorporated into the strand, they terminate the polymerization

  • Example: Add A’ to the original A, C, G, and T nucleotides

A’ A C G T

ACCGCT

5’-

-3’

3’-

-5’

A

  • Output of polymerization process: a number of DNA fragments (varying length), each terminated where the template has a ‘T’

T

A

T

A

T

A

T

A

T


Ee381v genomic signal processing

DNA Sequencing

  • Example: Add A’ to the original mix of A, C, G, and T nucleotides

A’ A C G T

ACCGCT

5’-

-3’

3’-

-5’

  • Subject the product of polymerization (denatured) to the gel electrophoresis:

0

4

10

13

20

20

Gel

Electrophoresis

30

40

40

50

52

Sample

Ladder


Ee381v genomic signal processing

Sanger DNA Sequencing

  • If we use different nucleotides with terminators in different polymerization steps for the same sample, we sequence the entire template DNA

T

GGA

G

C

A

G

T

T

T

T

C

A

A

A

C

G

C

T

A

G

C

C

A

A’

C’

G’

T’

A’

C’

G’

T’

  • Improvements of the basic idea: capillary electrophoresis (CE)

    • process in the interior of a small capillary filled with an electrolyte


Ee381v genomic signal processing

Sanger DNA Sequencing

  • Lots of data to acquire and process:

Sample 1

Sample 2

Sample 3

Sample 4

  • Achievable resolution limits the length of fragments to be sequenced


Ee381v genomic signal processing

Sanger DNA Sequencing

  • We can use labeled primers:

A C G T T’

ACCGCT

5’-

-3’

3’-

-5’

  • Alternatively, we can use “color” labeled nucleotides to mix the polymerized samples:

A C G T

Gel

Electrophoresis

A’ C’ G’ T’

ACCGCT

5’-

-3’

3’-

-5’


Ee381v genomic signal processing

Sanger DNA Sequencing

  • Even more data to acquire and process:


Shotgun sequencing

Shotgun Sequencing

  • Chain termination method is suitable only for short strands (300-1000bp)

    • longer sequences must be subdivided into smaller fragments

    • re-assembled to give the overall sequence

  • Two methods are used for the fragmentation:

    • chromosome walking -- going through the entire strand piece by piece

    • shotgun sequencing -- random fragments (faster, more complex)

  • Shotgun sequencing:

    • break the DNA randomly into small segments and sequence them

    • perform several rounds of the above step (multiple reads)

    • use the overlapping ends of different reads to assemble them into a contiguous sequence

      • assembly requires computational tasks such as overlap detection, error correction, etc.


  • Login