Cs 5263 bioinformatics
Download
1 / 96

CS 5263 Bioinformatics - PowerPoint PPT Presentation


  • 234 Views
  • Uploaded on

CS 5263 Bioinformatics. Lectures 1 & 2: Introduction to Bioinformatics and Molecular Biology. Outline. Administravia What is bioinformatics Why bioinformatics Course overview Short introduction to molecular biology. Survey form. Your name Email Academic preparation Interests

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about ' CS 5263 Bioinformatics' - gaia


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
Cs 5263 bioinformatics

CS 5263 Bioinformatics

Lectures 1 & 2: Introduction to Bioinformatics and Molecular Biology


Outline
Outline

  • Administravia

  • What is bioinformatics

  • Why bioinformatics

  • Course overview

  • Short introduction to molecular biology


Survey form
Survey form

  • Your name

  • Email

  • Academic preparation

  • Interests

  • help me better design lectures and assignments


Course info
Course Info

  • Instructor: Jianhua Ruan

    Office: S.B. 4.01.48

    Phone: 458-6819

    Email: [email protected]

    Office hours: MW 2-3pm

  • Web: http://www.cs.utsa.edu/~jruan/teaching/cs5263_fall_2008/


Course description
Course description

  • A survey of algorithms and methods in bioinformatics, approached from a computational viewpoint.

  • Prerequisite:

    • Programming experiences

    • Some knowledge in algorithms and data structures

    • Basic understanding of statistics and probability

    • Appetite to learn some biology


Textbooks
Textbooks

  • An Introduction to Bioinformatics Algorithms

    by Jones and Pevzner

  • Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids

    by Durbin, Eddy, Krogh and Mitchison

  • Additional resources

    • Papers

    • Handouts

    • See course website


Grading
Grading

  • Attendance: 10%

    • At most 2 classes missed without affecting grade

  • Homeworks: 50%

    • About 5 assignments

    • Combination of theoretical and programming exercises

    • No exams

    • No late submission accepted

    • Read the collaboration policy!

  • Final project and presentation: 40%


Why bioinformatics
Why bioinformatics

  • The advance of experimental technology has generated huge amount of data

    • The human genome is “finished”

    • Even if it were, that’s only the beginning…

  • The bottleneck is how to integrate and analyze the data

    • Noisy

    • Diverse



Genome annotations

Meyer, Trends and Tools in Bioinfo and Compt Bio, 2006


What is bioinformatics
What is bioinformatics

  • National Institutes of Health (NIH):

    • Research, development, or application of computational tools and approaches for expanding the use of biological, medical, behavioral or health data, including those to acquire, store, organize, archive, analyze, or visualize such data.


What is bioinformatics1
What is bioinformatics

  • National Center for Biotechnology Information (NCBI):

    • the field of science in which biology, computer science, and information technologymerge to form a single discipline. The ultimate goal of the field is to enable the discovery of new biological insightsas well as to create a global perspective from which unifying principles in biology can be discerned.


What is bioinformatics2
What is bioinformatics

  • Wikipedia

    • Bioinformatics refers to the creation and advancement of algorithms, computational and statistical techniques, and theory to solve formal and practical problems posed by or inspired from the management and analysis of biological data.


Course objectives
Course objectives

  • Learn the basis of sequence analysis and other computational biology algorithms

  • Familiarize with the research topics in bioinformatics

  • Be able to

    • Read / criticize bioinformatics research articles

    • Identify subareas that best suit your background

    • Communicate and exchange ideas with (computational) biologists


What you will learn
What you will learn?

  • Basic concepts in molecular biology and genetics

  • Algorithms to address selected problems in bioinformatics

    • Dynamic programming, string algorithms, graph algorithms

    • Statistical learning algorithms: HMM, EM, Gibbs sampling

    • Data mining: clustering / classification

  • Applications to real data


What you will not learn
What you will not learn?

  • Designing / performing biological experiments (duh!)

  • Programming (in perl, etc).

  • Building bioinformatics software tools (GUI, database, Web, …)

  • Using existing tools / databases (well, not exactly true)


Covered topics
Covered topics

1 week

  • Biology

  • Sequence analysis

    • Sequence alignment

      • Pairwise, multiple, global, local, optimal, heuristic

    • String matching

    • Motif finding

  • Gene prediction

  • RNA structure prediction

  • Phylogenetic tree

  • Functional Genomics

    • Microarray data analysis

    • Biological networks

8 weeks

5 weeks


Computer scientists vs biologists courtesy serafim batzoglou stanford

Computer Scientists vs Biologists(courtesy Serafim Batzoglou, Stanford)


Biologists vs computer scientists
Biologists vs computer scientists

  • (almost) Everything is true or false in computer science

  • (almost) Nothing is ever true or false in Biology


Biologists vs computer scientists1
Biologists vs computer scientists

  • Biologists seek to understand the complicated, messy natural world

  • Computer scientists strive to build their own clean and organized virtual world


Biologists vs computer scientists2
Biologists vs computer scientists

  • Computer scientists are obsessed with being the first to invent or prove something

  • Biologists are obsessed with being the first to discover something



1 genome sequencing

AGTAGCACAGACTACGACGAGACGATCGTGCGAGCGACGGCGTAGTGTGCTGTACTGTCGTGTGTGTGTACTCTCCTAGTAGCACAGACTACGACGAGACGATCGTGCGAGCGACGGCGTAGTGTGCTGTACTGTCGTGTGTGTGTACTCTCCT

~500 nucleotides

1. Genome sequencing

3x109 nucleotides


1 genome sequencing1

AGTAGCACAGACTACGACGAGACGATCGTGCGAGCGACGGCGTAGTGTGCTGTACTGTCGTGTGTGTGTACTCTCCTAGTAGCACAGACTACGACGAGACGATCGTGCGAGCGACGGCGTAGTGTGCTGTACTGTCGTGTGTGTGTACTCTCCT

1. Genome sequencing

3x109 nucleotides

A big puzzle

~60 million pieces

Computational Fragment Assembly

Introduced ~1980

1995: assemble up to 1,000,000 long DNA pieces

2000: assemble whole human genome


2. Gene FindingAGTAGCACAGACTACGACGAGACGATCGTGCGAGCGACGGCGTAGTGTGCTGTACTGTCGTGTGTGTGTACTCTCCT

Where are the genes?

In humans:

~22,000 genes

~1.5% of human DNA


Exon 3AGTAGCACAGACTACGACGAGACGATCGTGCGAGCGACGGCGTAGTGTGCTGTACTGTCGTGTGTGTGTACTCTCCT

Exon 1

Exon 2

Intron 1

Intron 2

5’

3’

Splice sites

Stop codon

TAG/TGA/TAA

Start codon

ATG

2. Gene Finding

Hidden Markov Models

(Well studied for many years in speech recognition)


3 protein folding
3. Protein FoldingAGTAGCACAGACTACGACGAGACGATCGTGCGAGCGACGGCGTAGTGTGCTGTACTGTCGTGTGTGTGTACTCTCCT

  • The amino-acid sequence of a protein determines the 3D fold

  • The 3D fold of a protein determines its function

  • Can we predict 3D fold of a protein given its amino-acid sequence?

    • Holy grail of compbio—40 years old problem

    • Molecular dynamics, computational geometry, machine learning


4 sequence comparison alignment

queryAGTAGCACAGACTACGACGAGACGATCGTGCGAGCGACGGCGTAGTGTGCTGTACTGTCGTGTGTGTGTACTCTCCT

DB

4. Sequence Comparison—Alignment

AGGCTATCACCTGACCTCCAGGCCGATGCCC

TAGCTATCACGACCGCGGTCGATTTGCCCGAC

-AGGCTATCACCTGACCTCCAGGCCGA--TGCCC---

| | | | | | | | | | | | | x | | | | | | | | | | |

TAG-CTATCAC--GACCGC--GGTCGATTTGCCCGAC

Sequence Alignment

Introduced ~1970

BLAST: 1990, most cited paper in history

Still very active area of research

BLAST

Efficient string matching algorithms

Fast database index techniques


Lipman & Pearson, 1985AGTAGCACAGACTACGACGAGACGATCGTGCGAGCGACGGCGTAGTGTGCTGTACTGTCGTGTGTGTGTACTCTCCT

…, comparison of a 200-amino-acid sequence to the 500,000 residues in the National Biomedical Research Foundation library would take less than 2 minutes on a minicomputer, and less than 10minutes on a microcomputer (IBM PC).

…, comparison of a 200-amino-acid sequence to the 500,000 residues in the National Biomedical Research Foundation library would take less than 2 minutes on a minicomputer, and less than 10minutes on a microcomputer (IBM PC).

Database size today: 1012

(increased by 2 million folds).

BLAST search: 1.5 minutes


5 microarray analysis clinical prediction of leukemia type
5. Microarray analysisAGTAGCACAGACTACGACGAGACGATCGTGCGAGCGACGGCGTAGTGTGCTGTACTGTCGTGTGTGTGTACTCTCCTClinical prediction of Leukemia type

  • 2 types

    • Acute lymphoid (ALL)

    • Acute myeloid (AML)

  • Different treatments & outcomes

  • Predict type before treatment?

Bone marrow samples: ALL vs AML

Measure amount of each gene


Some goals of biology for the next 50 years
Some goals of biology for the next 50 yearsAGTAGCACAGACTACGACGAGACGATCGTGCGAGCGACGGCGTAGTGTGCTGTACTGTCGTGTGTGTGTACTCTCCT

  • List all molecular parts that build an organism

    • Genes, proteins, other functional parts

  • Understand the function of each part

  • Understand how parts interact physically and functionally

  • Study how function has evolved across all species

  • Find genetic defects that cause diseases

  • Design drugs rationally

  • Sequence the genome of every human, use it for personalized medicine

  • Bioinformatics is an essential component for all the goals above


A short introduction to molecular biologyAGTAGCACAGACTACGACGAGACGATCGTGCGAGCGACGGCGTAGTGTGCTGTACTGTCGTGTGTGTGTACTCTCCT


LifeAGTAGCACAGACTACGACGAGACGATCGTGCGAGCGACGGCGTAGTGTGCTGTACTGTCGTGTGTGTGTACTCTCCT

  • Two categories:

    • Prokaryotes (e.g. bacteria)

      • Unicellular

      • No nucleus

    • Eukaryotes (e.g. fungi, plant, animal)

      • Unicellular or multicellular

      • Has nucleus


Prokaryote vs eukaryote
Prokaryote vs EukaryoteAGTAGCACAGACTACGACGAGACGATCGTGCGAGCGACGGCGTAGTGTGCTGTACTGTCGTGTGTGTGTACTCTCCT

  • Eukaryote has many membrane-bounded compartment inside the cell

    • Different biological processes occur at different cellular location


Organism organ cell

OrganAGTAGCACAGACTACGACGAGACGATCGTGCGAGCGACGGCGTAGTGTGCTGTACTGTCGTGTGTGTGTACTCTCCT

Organism, Organ, Cell

Organism


Chemical contents of cell
Chemical contents of cellAGTAGCACAGACTACGACGAGACGATCGTGCGAGCGACGGCGTAGTGTGCTGTACTGTCGTGTGTGTGTACTCTCCT

  • Water

  • Macromolecules (polymers) - “strings” made by linking monomers from a specified set (alphabet)

    • Protein

    • DNA

    • RNA

  • Small molecules

    • Sugar

    • Ions (Na+, Ka+, Ca2+, Cl- ,…)

    • Hormone


DNAAGTAGCACAGACTACGACGAGACGATCGTGCGAGCGACGGCGTAGTGTGCTGTACTGTCGTGTGTGTGTACTCTCCT

  • DNA: forms the genetic material of all living organisms

    • Can be replicated and passed to descendents

    • Contains information to produce proteins

  • To computer scientists, DNA is a string made from alphabet {A, C, G, T}

    • e.g. ACAGAACGTAGTGCCGTGAGCG

  • Each letter is a nucleotide

  • Length varies from hundreds to billions


RNAAGTAGCACAGACTACGACGAGACGATCGTGCGAGCGACGGCGTAGTGTGCTGTACTGTCGTGTGTGTGTACTCTCCT

  • Historically thought to be information carrier only

    • DNA => RNA => Protein

    • New roles have been found for them

  • To computer scientists, RNA is a string made from alphabet {A, C, G, U}

    • e.g. ACAGAACGUAGUGCCGUGAGCG

  • Each letter is a nucleotide

  • Length varies from tens to thousands


Protein
ProteinAGTAGCACAGACTACGACGAGACGATCGTGCGAGCGACGGCGTAGTGTGCTGTACTGTCGTGTGTGTGTACTCTCCT

  • Protein: the actual “worker” for almost all processes in the cell

    • Enzymes: speed up reactions

    • Signaling: information transduction

    • Structural support

    • Production of other macromolecules

    • Transport

  • To computer scientists, protein is a string made from 20 kinds of characters

    • E.g. MGDVEKGKKIFIMKCSQCHTVEKGGKHKTGP

  • Each letter is called an amino acid

  • Length varies from tens to thousands


Dna rna zoom in
DNA/RNA zoom-inAGTAGCACAGACTACGACGAGACGATCGTGCGAGCGACGGCGTAGTGTGCTGTACTGTCGTGTGTGTGTACTCTCCT

  • Commonly referred to as Nucleic Acid

  • DNA: Deoxyribonucleic acid

  • RNA: Ribonucleic acid

  • Found mainly in the nucleus of a cell (hence “nucleic”)

  • Contain phosphoric acid as a component (hence “acid”)

  • They are made up of a string of nucleotides


Nucleotides
NucleotidesAGTAGCACAGACTACGACGAGACGATCGTGCGAGCGACGGCGTAGTGTGCTGTACTGTCGTGTGTGTGTACTCTCCT

  • A nucleotide has 3 components

    • Sugar ring (ribose in RNA, deoxyribose in DNA)

    • Phosphoric acid

    • Nitrogen base

      • Adenine (A)

      • Guanine (G)

      • Cytosine (C)

      • Thymine (T) or Uracil (U)


Monomers of rna ribo nucleotide
Monomers of RNA: ribo-nucleotideAGTAGCACAGACTACGACGAGACGATCGTGCGAGCGACGGCGTAGTGTGCTGTACTGTCGTGTGTGTGTACTCTCCT

  • A ribonucleotide has 3 components

    • Sugar - Ribose

    • Phosphate group

    • Nitrogen base

      • Adenine (A)

      • Guanine (G)

      • Cytosine (C)

      • Uracil (U)


Monomers of dna deoxy ribo nucleotide
Monomers of DNA: deoxy-ribo-nucleotideAGTAGCACAGACTACGACGAGACGATCGTGCGAGCGACGGCGTAGTGTGCTGTACTGTCGTGTGTGTGTACTCTCCT

  • A deoxyribonucleotide has 3 components

    • Sugar – Deoxy-ribose

    • Phosphate group

    • Nitrogen base

      • Adenine (A)

      • Guanine (G)

      • Cytosine (C)

      • Thymine (T)


Polymerization nucleotides nucleic acids

Nitrogen BaseAGTAGCACAGACTACGACGAGACGATCGTGCGAGCGACGGCGTAGTGTGCTGTACTGTCGTGTGTGTGTACTCTCCT

Nitrogen Base

Nitrogen Base

Phosphate

Phosphate

Phosphate

Sugar

Sugar

Sugar

Polymerization: Nucleotides => nucleic acids


AAGTAGCACAGACTACGACGAGACGATCGTGCGAGCGACGGCGTAGTGTGCTGTACTGTCGTGTGTGTGTACTCTCCT

G

C

G

A

C

T

G

5’

Free phosphate

5 prime

3 prime

5’-AGCGACTG-3’

AGCGACTG

DNA

Often recorded from 5’ to 3’, which is the direction of many biological processes.

e.g. DNA replication, transcription, etc.

Base

5

Phosphate

Sugar

4

1

2

3

3’


AAGTAGCACAGACTACGACGAGACGATCGTGCGAGCGACGGCGTAGTGTGCTGTACTGTCGTGTGTGTGTACTCTCCT

G

U

G

A

C

U

G

5’

Free phosphate

5 prime

3 prime

5’-AGUGACUG-3’

AGUGACUG

RNA

Often recorded from 5’ to 3’, which is the direction of many biological processes.

e.g. translation.

3’


AAGTAGCACAGACTACGACGAGACGATCGTGCGAGCGACGGCGTAGTGTGCTGTACTGTCGTGTGTGTGTACTCTCCT

T

G

C

C

G

G

C

A

T

C

G

A

T

G

C

3’

5’

Base-pair:

A = T

G = C

Forward (+) strand

5’-AGCGACTG-3’

3’-TCGCTGAC-5’

Backward (-) strand

AGCGACTG

TCGCTGAC

One strand is said to be reverse- complementary to the other

3’

5’

DNA usually exists in pairs.


Dna double helix
DNA double helixAGTAGCACAGACTACGACGAGACGATCGTGCGAGCGACGGCGTAGTGTGCTGTACTGTCGTGTGTGTGTACTCTCCT

G-C pair is stronger than A-T pair


Reverse complementary sequences
Reverse-complementary sequencesAGTAGCACAGACTACGACGAGACGATCGTGCGAGCGACGGCGTAGTGTGCTGTACTGTCGTGTGTGTGTACTCTCCT

  • 5’-ACGTTACAGTA-3’

  • The reverse complement is:

    3’-TGCAATGTCAT-5’

    =>

    5’-TACTGTAACGT-3’

  • Or simply written as

    TACTGTAACGT


Orientation of the double helix
Orientation of the double helixAGTAGCACAGACTACGACGAGACGATCGTGCGAGCGACGGCGTAGTGTGCTGTACTGTCGTGTGTGTGTACTCTCCT

  • Double helix is anti-parallel

    • 5’ end of each strand pairs with 3’ end of the other

    • 5’ to 3’ motion in one strand is 3’ to 5’ in the other

  • Double helix has no orientation

    • Biology has no “forward” and “reverse” strand

    • Relative to any single strand, there is a “reverse complement” or “reverse strand”

    • Information can be encoded by either strand or both strands

      5’TTTTACAGGACCATG 3’

      3’AAAATGTCCTGGTAC 5’


RNAAGTAGCACAGACTACGACGAGACGATCGTGCGAGCGACGGCGTAGTGTGCTGTACTGTCGTGTGTGTGTACTCTCCT

  • RNAs are normally single-stranded

  • Form complex structure by self-base-pairing

  • A=U, C=G

  • Can also form RNA-DNA and RNA-RNA double strands.

    • A=T/U, C=G


Protein zoom in

Carboxyl groupAGTAGCACAGACTACGACGAGACGATCGTGCGAGCGACGGCGTAGTGTGCTGTACTGTCGTGTGTGTGTACTCTCCT

Amino group

Protein zoom-in

  • Protein is the actual “worker” for almost all processes in the cell

  • A string built from 20 letters

    • E.g. MGDVEKGKKIFIMKCSQCHTVEKGGKH

  • Each letter is called an amino acid

    R

    |

    H2N--C--COOH

    |

    H

Side chain

Generic chemical form of amino acid


Amino acid
Amino acidAGTAGCACAGACTACGACGAGACGATCGTGCGAGCGACGGCGTAGTGTGCTGTACTGTCGTGTGTGTGTACTCTCCT

  • 20 amino acids, only differ at side chains

    • Each can be expressed by three letters

    • Or a single letter: A-Y, except B, J, O, U, X, Z

    • Alanine = Ala = A

    • Histidine = His = H


Amino acids peptide
Amino acids => peptideAGTAGCACAGACTACGACGAGACGATCGTGCGAGCGACGGCGTAGTGTGCTGTACTGTCGTGTGTGTGTACTCTCCT

R R

| |

H2N--C--COOH H2N--C--COOH

| |

H H

R R

| |

H2N--C--CO--NH--C--COOH

| |

H H

Peptide bond


Protein1

RAGTAGCACAGACTACGACGAGACGATCGTGCGAGCGACGGCGTAGTGTGCTGTACTGTCGTGTGTGTGTACTCTCCT

R

R

R

R

R

H2N

COOH

C-terminal

N-terminal

Protein

  • Has orientations

  • Usually recorded from N-terminal to C-terminal

  • Peptide vs protein: basically the same thing

  • Conventions

    • Peptide is shorter (< 50aa), while protein is longer

    • Peptide refers to the sequence, while protein has 2D/3D structure


Protein structure
Protein structureAGTAGCACAGACTACGACGAGACGATCGTGCGAGCGACGGCGTAGTGTGCTGTACTGTCGTGTGTGTGTACTCTCCT

  • Linear sequence of amino acids folds to form a complex 3-D structure.

  • The structure of a protein is intimately connected to its function.


Genome and chromosome
Genome and chromosomeAGTAGCACAGACTACGACGAGACGATCGTGCGAGCGACGGCGTAGTGTGCTGTACTGTCGTGTGTGTGTACTCTCCT

  • Genome: the complete DNA sequences in the cell of an organism

    • May contain one (in most prokaryotes) or more (in eukaryotes) chromosomes

  • Chromosome: a single large DNA molecule in the cell

    • May be circular or linear

    • Contain genes as well as “junk DNAs”

    • Highly packed!


Formation of chromosome
Formation of chromosomeAGTAGCACAGACTACGACGAGACGATCGTGCGAGCGACGGCGTAGTGTGCTGTACTGTCGTGTGTGTGTACTCTCCT


Formation of chromosome1
Formation of chromosomeAGTAGCACAGACTACGACGAGACGATCGTGCGAGCGACGGCGTAGTGTGCTGTACTGTCGTGTGTGTGTACTCTCCT

50,000 times shorter than extended DNA

The total length of DNA present in one adult human is the equivalent of nearly 70 round trips from the earth to the sun


GeneAGTAGCACAGACTACGACGAGACGATCGTGCGAGCGACGGCGTAGTGTGCTGTACTGTCGTGTGTGTGTACTCTCCT

  • Gene: unit of heredity in living organisms

    • A segment of DNA with information to make a protein


Some statistics
Some statisticsAGTAGCACAGACTACGACGAGACGATCGTGCGAGCGACGGCGTAGTGTGCTGTACTGTCGTGTGTGTGTACTCTCCT


Human genome
Human genomeAGTAGCACAGACTACGACGAGACGATCGTGCGAGCGACGGCGTAGTGTGCTGTACTGTCGTGTGTGTGTACTCTCCT

  • 46 chromosomes: 22 pairs + X + Y

  • 1 from mother, 1 from father

  • Female: X + X

  • Male: X + Y


Human genome1
Human genomeAGTAGCACAGACTACGACGAGACGATCGTGCGAGCGACGGCGTAGTGTGCTGTACTGTCGTGTGTGTGTACTCTCCT

  • Every cell contains the same genomic information

    • Except sperms and eggs, which only contain half of the genome

      • Otherwise your children would have 46 + 46 chromosomes


Cell division mitosis
Cell division: mitosisAGTAGCACAGACTACGACGAGACGATCGTGCGAGCGACGGCGTAGTGTGCTGTACTGTCGTGTGTGTGTACTCTCCT

  • A cell duplicates its genome and divides into two identical cells

  • These cells build up different parts of your body


Cell division meiosis
Cell division: meiosisAGTAGCACAGACTACGACGAGACGATCGTGCGAGCGACGGCGTAGTGTGCTGTACTGTCGTGTGTGTGTACTCTCCT

  • A reproductive cell divides into four cells, each containing only half of the genomes

    • Diploid => haploid

  • Two haploid cells (sperm + egg) forms a zygote

    • Which will then develop into a multi-cellular organism by mitosis


Central dogma of molecular biology
Central dogma of molecular biologyAGTAGCACAGACTACGACGAGACGATCGTGCGAGCGACGGCGTAGTGTGCTGTACTGTCGTGTGTGTGTACTCTCCT

DNA replication is critical in both mitosis and meiosis


Dna replication
DNA ReplicationAGTAGCACAGACTACGACGAGACGATCGTGCGAGCGACGGCGTAGTGTGCTGTACTGTCGTGTGTGTGTACTCTCCT

  • The process of copying a double-stranded DNA molecule

    • Semi-conservative

      5’-ACATGATAA-3’

      3’-TGTACTATT-5’

      5’-ACATGATAA-3’ 5’-ACATGATAA-3’

      3’-TGTACTATT-5’ 3’-TGTACTATT-5’


  • Mutation: changes in DNA base-pairsAGTAGCACAGACTACGACGAGACGATCGTGCGAGCGACGGCGTAGTGTGCTGTACTGTCGTGTGTGTGTACTCTCCT

  • Proofreading and error-correcting mechanisms exist to ensure extremely high fidelity


Central dogma of molecular biology1
Central dogma of molecular biologyAGTAGCACAGACTACGACGAGACGATCGTGCGAGCGACGGCGTAGTGTGCTGTACTGTCGTGTGTGTGTACTCTCCT


Transcription
TranscriptionAGTAGCACAGACTACGACGAGACGATCGTGCGAGCGACGGCGTAGTGTGCTGTACTGTCGTGTGTGTGTACTCTCCT

  • The process that a DNA sequence is copied to produce a complementary RNA

    • Called message RNA (mRNA) if the RNA carries instruction on how to make a protein

    • Called non-coding RNA if the RNA does not carry instruction on how to make a protein

    • Only consider mRNA for now

  • Similar to replication, but

    • Only one strand is copied


Transcription1
TranscriptionAGTAGCACAGACTACGACGAGACGATCGTGCGAGCGACGGCGTAGTGTGCTGTACTGTCGTGTGTGTGTACTCTCCT

(where genetic information is stored)

  • DNA-RNA pair:

  • A=U, C=G

  • T=A, G=C

(for making mRNA)

Coding strand: 5’-ACGTAGACGTATAGAGCCTAG-3’

Template strand: 3’-TGCATCTGCATATCTCGGATC-5’

mRNA: 5’-ACGUAGACGUAUAGAGCCUAG-3’

Coding strand and mRNA have the same sequence, except that T’s in DNA are replaced by U’s in mRNA.


Translation
TranslationAGTAGCACAGACTACGACGAGACGATCGTGCGAGCGACGGCGTAGTGTGCTGTACTGTCGTGTGTGTGTACTCTCCT

  • The process of making proteins from mRNA

  • A gene uniquely encodes a protein

  • There are four bases in DNA (A, C, G, T), and four in RNA (A, C, G, U), but 20 amino acids in protein

  • How many nucleotides are required to encode an amino acid in order to ensure correct translation?

    • 4^1 = 4

    • 4^2 = 16

    • 4^3 = 64

  • The actual genetic code used by the cell is a triplet.

    • Each triplet is called a codon


The genetic code
The Genetic CodeAGTAGCACAGACTACGACGAGACGATCGTGCGAGCGACGGCGTAGTGTGCTGTACTGTCGTGTGTGTGTACTCTCCT

Third

letter


Translation1
TranslationAGTAGCACAGACTACGACGAGACGATCGTGCGAGCGACGGCGTAGTGTGCTGTACTGTCGTGTGTGTGTACTCTCCT

  • The sequence of codons is translated to a sequence of amino acids

  • Gene: -GCT TGT TTA CGA ATT-

  • mRNA: -GCUUGUUUACGAAUU -

  • Peptide: - Ala - Cys - Leu - Arg - Ile –

  • Start codon: AUG

    • Also code Met

    • Stop codon: UGA, UAA, UAG


Translation2
TranslationAGTAGCACAGACTACGACGAGACGATCGTGCGAGCGACGGCGTAGTGTGCTGTACTGTCGTGTGTGTGTACTCTCCT

  • Transfer RNA (tRNA) – a different type of RNA.

    • Freely float in the cell.

    • Every amino acid has its own type of tRNA that binds to it alone.

  • Anti-codon – codon binding crucial.

tRNA-Pro

Anti-codon

Nascent peptide

tRNA-Leu

mRNA


Transcriptional regulation
Transcriptional regulationAGTAGCACAGACTACGACGAGACGATCGTGCGAGCGACGGCGTAGTGTGCTGTACTGTCGTGTGTGTGTACTCTCCT

  • Will talk more in later lectures

  • RNA polymerase binds to certain location on promoter to initiate transcription

  • Transcription factor binds to specific sequences on the promoter to regulate the transcription

    • Recruit RNA polymerase: induce

    • Block RNA polymerase: repress

    • Multiple transcription factors may coordinate

Transcription factor

RNA Polymerase

Transcription starting site

gene

promoter


Splicing
SplicingAGTAGCACAGACTACGACGAGACGATCGTGCGAGCGACGGCGTAGTGTGCTGTACTGTCGTGTGTGTGTACTCTCCT

Transcription starting site

  • Pre-mRNA needs to be “edited” to form mature mRNA

  • Will talk more in later lectures.

gene

promoter

transcription

Pre-mRNA

intron

intron

Pre-mRNA

exon

exon

exon

3’ UTR

5’ UTR

Splicing

Mature mRNA

(mRNA)

Open reading frame (ORF)

Start codon

Stop codon


Summary
SummaryAGTAGCACAGACTACGACGAGACGATCGTGCGAGCGACGGCGTAGTGTGCTGTACTGTCGTGTGTGTGTACTCTCCT

  • DNA: a string made from {A, C, G, T}

    • Forms the basis of genes

    • Has 5’ and 3’

    • Normally forms double-strand by reverse complement

  • RNA: a string made from {A, C, G, U}

    • mRNA: messenger RNA

    • tRNA: transfer RNA

    • Other types of RNA: rRNA, miRNA, etc.

    • Has 5’ and 3’

    • Normally single-stranded. But can form secondary structure

  • Protein: made from 20 kinds of amino acids

    • Actual worker in the cell

    • Has N-terminal and C-terminal

    • Sequence uniquely determined by its gene via the use of codons

    • Sequence determines structure, structure determines function

  • Central dogma: DNA transcribes to RNA, RNA translates to Protein

    • Both steps are regulated


Experimental techniques to manipulate DNAAGTAGCACAGACTACGACGAGACGATCGTGCGAGCGACGGCGTAGTGTGCTGTACTGTCGTGTGTGTGTACTCTCCT


Dna synthesis
DNA synthesisAGTAGCACAGACTACGACGAGACGATCGTGCGAGCGACGGCGTAGTGTGCTGTACTGTCGTGTGTGTGTACTCTCCT

  • Creating DNA synthetically in a laboratory

  • Chemical synthesis

    • Chemical reactions

    • Arbitrary sequences

    • Maximum length 160-200

  • Cloning: make copies based on a DNA template

    • Biological reactions

    • Requires template

    • Many copies of a long DNA in a short time


In vivo dna cloning
in vivoAGTAGCACAGACTACGACGAGACGATCGTGCGAGCGACGGCGTAGTGTGCTGTACTGTCGTGTGTGTGTACTCTCCT DNA Cloning

  • Connect a piece of DNA to bacterial DNA, which can then be replicated together with the host DNA

bacterial DNA


In vitro dna cloning
in vitroAGTAGCACAGACTACGACGAGACGATCGTGCGAGCGACGGCGTAGTGTGCTGTACTGTCGTGTGTGTGTACTCTCCT DNA Cloning

  • Polymerase chain reaction (PCR)

5’

5’

denature

5’

5’

Primer (< 30 bases)

5’

5’

5’

5’

DNA Polymerase

dNTP

5’

5’

5’

5’


Some terms
Some termsAGTAGCACAGACTACGACGAGACGATCGTGCGAGCGACGGCGTAGTGTGCTGTACTGTCGTGTGTGTGTACTCTCCT

  • Denature: a DNA double-strand is separated into two strands

    • By raising temperature

  • Renature: the process that two denatured DNA strands re-forms a double-strand

    • By cooling down slowly

  • Hybridization: two heterogeneous DNAs form a double-stranded DNA

    • may have mismatches

    • The rationale behind many molecular biological techniques including DNA microarray


Dna sequencing technology
DNA sequencing technologyAGTAGCACAGACTACGACGAGACGATCGTGCGAGCGACGGCGTAGTGTGCTGTACTGTCGTGTGTGTGTACTCTCCT

  • Read out the letters from a DNA sequence

1974, Frederick Sanger

GTGAGGCGCTGC


Dna sequencing basic idea
DNA sequencing: Basic ideaAGTAGCACAGACTACGACGAGACGATCGTGCGAGCGACGGCGTAGTGTGCTGTACTGTCGTGTGTGTGTACTCTCCT

  • PCR

    primer extension

    5’-TTACAGGTCCATACTA

    3’-AATGTCCAGGTATGATACATAGG-5’

  • We need to supply A, C, G, T for the synthesis to continue

  • Besides A, C, G, T, we add some A*, C*, G*, and T*

    • Very similar to ACGT in all aspects, except that

    • The extension will stop if used


Dna sequencing cont
DNA sequencing, contAGTAGCACAGACTACGACGAGACGATCGTGCGAGCGACGGCGTAGTGTGCTGTACTGTCGTGTGTGTGTACTCTCCT


Dna sequencing cont1
DNA sequencing, contAGTAGCACAGACTACGACGAGACGATCGTGCGAGCGACGGCGTAGTGTGCTGTACTGTCGTGTGTGTGTACTCTCCT


Advances in dna sequencing
Advances in DNA sequencingAGTAGCACAGACTACGACGAGACGATCGTGCGAGCGACGGCGTAGTGTGCTGTACTGTCGTGTGTGTGTACTCTCCT

  • 1969: three years to sequence 115nt DNA

  • 1979: three years to sequence ~1650nt

  • 1989: one week to sequence ~1650nt

  • 1995: Haemophilus genome sequenced at TIGR - 1,830,138nt

  • 2000: Human Genome - working draft sequence, 3 billion bases

  • 2003: (near) completion of human genome


The bioinformatics landmark
The bioinformatics landmarkAGTAGCACAGACTACGACGAGACGATCGTGCGAGCGACGGCGTAGTGTGCTGTACTGTCGTGTGTGTGTACTCTCCT

  • Completion of human genome sequencing is a success embraced by

    • Advancement in sequencing technology

    • Speed of computation

    • Algorithm development in bioinformatics

  • HGP (Human Genome Project) strategy

    • Hierarchical sequencing

    • Estimated 15 years (1990 – 2005), completed in 13 years

    • $3 billion

  • Celera strategy

    • Whole-genome shotgun sequencing

    • Three years (1998-2001)

    • $300 million


NowAGTAGCACAGACTACGACGAGACGATCGTGCGAGCGACGGCGTAGTGTGCTGTACTGTCGTGTGTGTGTACTCTCCT

  • Over 300 genomes have been sequenced

  • ~1011 - 1012 nt


2007AGTAGCACAGACTACGACGAGACGATCGTGCGAGCGACGGCGTAGTGTGCTGTACTGTCGTGTGTGTGTACTCTCCT

  • Genomes of three individual human were sequenced

    • James Watson

    • Craig Venter

    • TBN Chinese

  • Cost for sequencing Watson’s genome

    • $3 million, 2 months

    • Compared to $3 billion, 13 years for HGP


  • Sequencing speed has been tremendously improvedAGTAGCACAGACTACGACGAGACGATCGTGCGAGCGACGGCGTAGTGTGCTGTACTGTCGTGTGTGTGTACTCTCCT

  • High efficiency and relatively low cost makes it possible to sequence the genome of any individual from any species

    What’s next?


Continue to sequence more species? AGTAGCACAGACTACGACGAGACGATCGTGCGAGCGACGGCGTAGTGTGCTGTACTGTCGTGTGTGTGTACTCTCCT

More individuals?

What to do with those sequences?


Coming next: biological sequence analysisAGTAGCACAGACTACGACGAGACGATCGTGCGAGCGACGGCGTAGTGTGCTGTACTGTCGTGTGTGTGTACTCTCCT


ad