The genes the whole genes and nothing but the genes
This presentation is the property of its rightful owner.
Sponsored Links
1 / 48

The Genes, the Whole Genes, and Nothing But the Genes PowerPoint PPT Presentation


  • 80 Views
  • Uploaded on
  • Presentation posted in: General

The Genes, the Whole Genes, and Nothing But the Genes. Jim Kent University of California Santa Cruz. Ben Franklin - Childhood Hero. Hi Voltage Experiments. A Man of High Values. Early to bed Early to rise. Rock Collection. Shell Collection. Bottlecap Collection. Bug Collection.

Download Presentation

The Genes, the Whole Genes, and Nothing But the Genes

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


The genes the whole genes and nothing but the genes

The Genes, the Whole Genes, and Nothing But the Genes

Jim Kent

University of California Santa Cruz


Ben franklin childhood hero

Ben Franklin - Childhood Hero

Hi Voltage Experiments


A man of high values

A Man of High Values


The genes the whole genes and nothing but the genes

Early to bed

Early to rise


Rock collection

Rock Collection


Shell collection

Shell Collection


Bottlecap collection

Bottlecap Collection


Bug collection

Bug Collection


Jim kent genome scientist not to be confused with richard stallman

Jim Kent - Genome Scientist not to be confused with Richard Stallman


Modern bug collection

Modern Bug Collection

if (a = b)

if (string == “something”)

for (x=0; x<count; ++x);

process(x);

for (x=0; x<width; ++x)

for (y=0; y<height; ++x)

plot(x, y, data[x][y]);


Naive biological questions

Naive Biological Questions

  • Is an ant an individual?

  • Do dolphins talk with each other?

  • How come newts and worms can regenerate so much better than we can?

  • How does a plant grow from a seed, an animal from an egg?


Regeneration by nature

Regeneration by Nature

  • Among vertebrates only amphibians can regenerate limbs.

  • The process involves dedifferentiation, repatterning, and growth.

  • Not likely we’ll be able to engineer this soon.

  • Simpler regenerations though may be tractable and medically quite important.


From egg to adult in 3x10 9 bases

From Egg to Adult in 3x109 Bases

  • A single cell, the fertilized egg, eventually differentiates into the ~300 different types of cells that make up an adult body.

  • With a few exceptions all of these cells contain the full human genome, but express only a subset of the genes.

  • Gene expression patterns are determined largely by the cell type, and vice versa.


From totempotency to senility

From Totempotency to Senility

  • Human cells become more and more specialized during development

  • An egg can become anything. (Initially most of it will become placenta).

  • Liver cells only become liver cells.

  • A neuron can’t even reproduce.


The genes the whole genes and nothing but the genes

Primary Flows of Information and Substance in Cell

DNA

creation

regulation

mRNA

transcription

factors

splicing

factors

Receptors

Enzymes

structural

proteins

signaling

molecules

structural

sugars

structural

lipids

Environment

& other cells


An extreme case of dedifferentiation

An Extreme Case of Dedifferentiation

  • The cloning of Dolly the sheep showed that a differentiated genome could be reset.

  • An egg is huge compared to a normal cell. Putting a normal cell into an egg as Wilmut et al did, swamps out the normal cell transcription factor and receptors with egg transcription factors and receptors.

  • Cloning success rate sometimes improved by passing a nucleus through multiple eggs.


Human diseases involving a small population of cells

Human Diseases Involving a Small Population of Cells

  • Parkinson’s - from the death of dopamine producing neurons in the substantia negra.

  • Macular degeneration - a leading cause of blindness in the elderly.

  • Type I Diabetes - from the death of insulin-producing cells in the pancreas.


Pancreas differentiation pathway

Pancreas Differentiation Pathway

From Huang Tsai, J Biomed Sci 2000:7:27-34 and Jensen et al, Diabetes 2000:49 163-176


Cell type determinants

Cell Type Determinants

  • Cell type of parent cell.

  • Interactions with other cells.

  • Interactions with the extracellular environment.


Flexibility of stem cells

Flexibility of Stem Cells

  • In many cases stem cells are flexible enough that putting them into a particular tissue will cause them to differentiate into the type of cells that make up that tissue.

  • At low levels bone transplanted bone marrow (blood stem cells) develops into neuron in stroke victims!

  • Making this happen at high enough levels to be useful will likely require some engineering.


The genes the whole genes and nothing but the genes

Primary Flows of Information and Substance in Cell

DNA

creation

regulation

mRNA

transcription

factors

splicing

factors

Receptors

Enzymes

structural

proteins

signaling

molecules

structural

sugars

structural

lipids

environment

other cells


To understand the body need

To Understand the Body Need

  • The genome

  • A comprehensive list of genes

  • Gene expression data

  • Protein localization in cell

  • Protein/protein and protein/DNA interaction information.

  • Ways to store, display and query masses of data so human investigators can focus on relevant bits.

  • Many talented and hardworking human investigators.


Where are we now

Where are we now?

  • The genome >95% complete. 98% complete in April.

  • A comprehensive list of genes - ~75% of coding regions. <50% of transcription start sites.

  • Gene expression data - publically available on ~1/3 of genes.

  • Protein localization in cell - very spotty. Computer predictions are about 75% accurate.

  • Protein/protein and protein/DNA interaction information - just getting started.


The genes

The Genes

  • Identifying genes is a prerequisite for a great deal of other research.

    • Expression microarrays

    • In situ mRNA hybridization

    • Producing proteins for cellular localization experiments

    • Etc.


The whole genes

The Whole Genes

  • The full gene including the 5’ and 3’ UTRs are critical for

    • Avoiding misleading fragmentation/fusion artifacts.

    • Understanding mRNA targeting and stability

    • Finding transcription factor binding sites

    • Understanding the regulatory networks that drive and maintain cell differentiation.


Nothing but the genes

Nothing But the Genes

  • Experimental analysis is expensive.

  • Unreal genes can mislead:

    • Analysis of multiple alignments to look for active sites etc.

    • Protein classification systems and phylogenies

  • One bogus gene can lead to another as much annotation is done via homology.


Methods of identifying genes

Methods of Identifying Genes

  • mRNA/cDNA sequencing

  • Microarrays covering entire genome

  • Genetics in model organisms

  • Cross species protein homology

  • Cross species genomic homology

  • HMM and other purely computational genefinding.


Cdna sequencing

cDNA Sequencing

  • Extract RNA from cells.

  • Use reverse transcriptase and a poly-U primer to convert to RNA starting at poly-A tail.

  • Insert cDNA into vectors that grow in E. coli

  • Sequence a read from one or both sides of insert using primers on vector

  • If EST looks to be new sequence full cDNA.

  • Artifacts and limitations are possible at each stage!


Common cdna problems solutions

Common cDNA Problems & Solutions

  • For rarely expressed genes little RNA is available.

    • Normalize libraries. Use embryonic and exotic tissues as mRNA source.

  • Splicing is not instantanious, can get retained introns.

    • Spin out nuclei and just use cytoplasmic mRNA

    • Align to genome and look for splicing

  • Reverse transcriptase falls off before it’s finished

    • Preferentially taking larger cDNAs.

    • G-cap selected libraries (Sugano)

    • Normalizing only on 5’ ends (Soares)


More cdna problems solutions

More cDNA Problems & Solutions

  • Reverse transcriptase has a high error rate and is prone to small deletions.

    • Compare cDNA to genomic DNA

    • Sequence multiple cDNA clones

  • At a low level cell seems to tolerate a certain degree of nonsense transcription and splicing. Normalizing increases concentration of these as well as of rare genes.

    • Ignore everything that’s not coding (ouch)

    • ???


Cdna status summary

cDNA Status & Summary

  • ~10,000 cDNA sequence have been accumulated over years by various labs working on gene families and pathways.

  • Riken project has ~33,000 unique cDNAs in mouse. ~11,000 of these seem to have retained introns. ~3,000 are noncoding antisense. ~70% include initial ATG

  • Mammalian Gene Collection (MGC) has ~15,000 human cDNAs with initial ATGs. Having to resort to exotic libraries and RT-PCR to get more.

  • Human refSeq has ~18,000 human cDNAs.


Whole genome microarrays

Whole Genome Microarrays

  • Perlegen and Affymetrix are making microarrays that cover entire non-RepeatMasker masked genome. Results on chromosome 21 and 22 published.

  • Based on 25-mers.

  • Rarely expressed genes may not stand out above background.

  • Have to cope with cross-hybridization issues, GC content, etc.

  • Advantages - no homology required, can sense lower concentrations of mRNA than random EST sequencing.


Cross hybridization at work

Cross-hybridization at Work

Zoomed in on right side:


Genetics in model organisms

Genetics in Model Organisms

  • Zap hapless yeast, worms, flies, and mice

  • Inbreed offspring and look for twisted ones.

  • Advantages:

    • Works at DNA level, so expression level doesn’t matter

    • You get hints of function right away.

    • Can look for gene interactions simply by breeding mutants.

  • Disadvantages:

    • Finding which DNA is mutated can take a long time.

    • Essential genes can be hard to find - all you see is reduced fertility in the inbreeding stage.

    • Genes only needed in certain environments and duplicated genes may be missed in screens.


Cross species genome comparisons

Cross Species Genome Comparisons

  • Mutations occur more or less randomly across genome but

  • Mutations in functional areas tend to be weeded out by selection

  • In comparing DNA across species, the functional areas are more conserved than the nonfunctional areas in general


The genes the whole genes and nothing but the genes

Of Fish and Mice and Men


Comparative genomics at bmp10

Comparative Genomics at BMP10


Conservation of gene features

Conservation of Gene Features

Conservation pattern across 3165 mappings of human RefSeq mRNAs to the genome. A program sampled 200 evenly spaced bases across 500 bases upstream of transcription, the 5’ UTR, the first coding exon, introns, middle coding exons, introns, the 3’ UTR and 500 bases after polyadenylatoin. There are peaks of conservation at the transition from one region to another.


Detail near translation start

Detail Near Translation Start

Note the relatively conserved base 3 before translation

Start (constrained to be a G or an A by the Kozak

Consensus sequence, and the first three translated bases

(ATG).


Normalized escores

Normalized eScores


Computational gene finding

Computational Gene Finding


Basic techniques

Basic Techniques

  • Bacteria - look for open reading frames - long stretches between start and stop codons.

  • Eukaryotes - introns are challenging

    • Look for coding exons (bounded by AG / GT)

    • HMMs can model coding regions and splice sites simultaniously

    • Generalized HMMs (genscan) can string together probable exons

    • Homology based ones (GeneWise) can map proteins to genome allowing for considerable evolutionary divergence.


Limitations of basic approach

Limitations of Basic Approach

  • Introns are vast, GT/AG splice signals are small.

  • Coding signal is stronger than start/stop signal. As a result gene fragmentation and fusion is a big problem.

  • Pseudo genes, processed and otherwise, mimic coding regions.

  • Pure HMM approaches tend to overpredict

  • Pure homology approaches only can tell us about what we already know.


Composite approaches

Composite Approaches

  • Use EST info to constrain HMMs (Genie)

  • Use protein homology info on top of HMMs (fgenesh++, GenomeScan)

  • Use cross species genomic alignments on top of HMMs (twinscan, fgenesh2, SLAM, SGP)


Computational gene finding1

Computational Gene Finding


Acknowledgements

Individuals

Institutions

Acknowledgements

David Haussler, Angie Hinrichs, Chuck Sugnet, Matt Schwartz, Robert Baertsch Donna Karolchik,

Francis Collins, Bob Waterston, Eric Lander, John Sulston, Richard Gibbs

Lincoln Stein, Sean Eddy, Olivier Jaillon, David Kulp, Victor Solovyev, Ewan Birney, Greg Schuler, Deanna Church, Asif Chinwalla, the Gene Cats.

Everyone else!

NHGRI, The Wellcome Trust, HHMI, NCI, Taxpayers in the US and worldwide.

Whitehead, Sanger, Wash U, Baylor, Stanford, DOE, and the international sequencing centers.

NCBI, Ensembl, Genoscope, The SNP Consortium, UCSC, MGC, Softberry, Affymetrix.


The genes the whole genes and nothing but the genes

THE END


Parasol and kilo cluster

Parasol and Kilo Cluster

  • UCSC cluster has 1000 CPUs running Linux

  • 1,000,000 BLASTZ jobs in 25 hours for mouse/human alignment

  • We wrote Parasol job scheduler to keep up.

    • Very fast and free.

    • Jobs are organized into batches.

    • Error checking at job and at batch level.


  • Login