genes genomes and genomics l.
Skip this Video
Loading SlideShow in 5 Seconds..
Genes, Genomes, and Genomics PowerPoint Presentation
Download Presentation
Genes, Genomes, and Genomics

Loading in 2 Seconds...

play fullscreen
1 / 33

Genes, Genomes, and Genomics - PowerPoint PPT Presentation

  • Uploaded on

Genes, Genomes, and Genomics. Bioinformatics in the Classroom plagiarized from: June, 2003. Two. Again …. Francis Collins, HGP. Craig Venter, Celera Inc. What’s in a chromosome?. Hierarchical vs. Whole Genome.

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
Download Presentation

Genes, Genomes, and Genomics

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
genes genomes and genomics

Genes, Genomes, and Genomics

Bioinformatics in the Classroom

plagiarized from:

June, 2003

two again
Two. Again …

Francis Collins, HGP

Craig Venter, Celera Inc.

the value of genome sequences lies in their annotation
The value of genome sequences lies in their annotation
  • Annotation – Characterizing genomic features using computational and experimental methods
  • Genes: Four levels of annotation
    • Gene Prediction – Where are genes?
    • What do they look like?
    • Domains – What do the proteins do?
    • Role – What pathway(s) involved in?
how many genes
How many genes?
  • Consortium: 35,000 genes?
  • Celera: 30,000 genes?
  • Affymetrix: 60,000 human genes on GeneChips?
  • Incyte and HGS: over 120,000 genes?
  • GenBank: 49,000 unique gene coding sequences?
  • UniGene: > 89,000 clusters of unique ESTs?
current consensus in flux
Current consensus (in flux …)
  • 15,000 known genes (similarity to previously isolated genes and expressed sequences from a large variety of different organisms)
  • 17,000 predicted (GenScan, GeneFinder, GRAIL)
  • Based on and limited to previous knowledge
what are genes 1
What are genes? - 1
  • Complete DNA segments responsible to make functional products
  • Products
    • Proteins
    • Functional RNA molecules
      • RNAi (interfering RNA)
      • rRNA (ribosomal RNA)
      • snRNA (small nuclear)
      • snoRNA (small nucleolar)
      • tRNA (transfer RNA)
what are genes 2
What are genes? - 2
  • Definition vs. dynamic concept
  • Consider
    • Prokaryotic vs. eukaryotic gene models
    • Introns/exons
    • Posttranscriptional modifications
    • Alternative splicing
    • Differential expression
    • Genes-in-genes
    • Genes-ad-genes
    • Posttranslational modifications
    • Multi-subunit proteins
prokaryotic gene model orf genes
Prokaryotic gene model: ORF-genes
  • “Small” genomes, high gene density
    • Haemophilus influenza genome 85% genic
  • Operons
    • One transcript, many genes
  • No introns.
    • One gene, one protein
  • Open reading frames
    • One ORF per gene
    • ORFs begin with start,

end with stop codon (def.)



eukaryotic gene model spliced genes
Eukaryotic gene model: spliced genes
  • Posttranscriptional modification
    • 5’-CAP, polyA tail, splicing
  • Open reading frames
    • Mature mRNA contains ORF
    • All internal exons contain open “read-through”
    • Pre-start and post-stop sequences are UTRs
  • Multiple translates
    • One gene – many proteins via alternative splicing
expansions and clarifications
Expansions and Clarifications
  • ORFs
    • Start – triplets – stop
    • Prokaryotes: gene = ORF
    • Eukaryotes: spliced genes or ORF genes
  • Exons
    • Remain after introns have been removed
    • Flanking parts contain non-coding sequence (5’- and 3’-UTRs)
where do genes live
Where do genes live?
  • In genomes
  • Example: human genome
    • Ca. 3,200,000,000 base pairs
    • 25 chromosomes : 1-22, X, Y, mt
    • 28,000-45,000 genes (current estimate)
    • 128 nucleotides (RNA gene) – 2,800 kb (DMD)
    • Ca.25% of genome are genes (introns, exons)
    • Ca. 1% of genome codes for amino acids (CDS)
    • 30 kb gene length (average)
    • 1.4 kb ORF length (average)
    • 3 transcripts per gene (average)
sample genomes
Sample genomes

List of 68 eukaryotes, 141 bacteria, and 17 archaea at

genomic sequence features
Genomic sequence features
  • Repeats (“Junk DNA”)
    • Transposable elements, simple repeats
    • RepeatMasker
  • Genes
    • Vary in density, length, structure
    • Identification depends on evidence and methods and may require concerted application of bioinformatics methods and lab research
  • Pseudo genes
    • Look-a-likes of genes, obstruct gene finding efforts.
  • Non-coding RNAs (ncRNA)
    • tRNA, rRNA, snRNA, snoRNA, miRNA
gene identification
Gene identification
  • Homology-based gene prediction
    • Similarity Searches (e.g. BLAST, BLAT)
    • Genome Browsers
    • RNA evidence (ESTs)
  • Ab initio gene prediction
    • Gene prediction programs
    • Prokaryotes
      • ORF identification
    • Eukaryotes
      • Promoter prediction
      • PolyA-signal prediction
      • Splice site, start/stop-codon predictions
gene prediction through comparative genomics
Gene prediction through comparative genomics
  • Highly similar (Conserved) regions between two genomes are useful or else they would have diverged
  • If genomes are too closely related all regions are similar, not just genes
  • If genomes are too far apart, analogous regions may be too dissimilar to be found
genome browsers
Genome Browsers

NCBI Map Viewer

Generic Genome Browser (CSHL)

Ensembl Genome Browser

UCSC Genome Browser

Apollo Genome Browser

gene discovery using ests
Gene discovery using ESTs
  • Expressed Sequence Tags (ESTs) represent sequences from expressed genes.
  • If region matches EST with high stringency then region is probably a gene or pseudo gene.
    • EST overlapping exon boundary gives an accurate prediction of exon boundary.
ab initio gene prediction
Ab initio gene prediction
  • Prokaryotes
    • ORF-Detectors
  • Eukaryotes
    • Position, extent & direction: through promoter and polyA-signal predictors
    • Structure: through splice site predictors
    • Exact location of coding sequences: through determination of relationships between potential start codons, splice sites, ORFs, and stop codons
  • ORF detectors
    • NCBI:
  • Promoter predictors
    • CSHL:
    • BDGP:
    • ICG: TATA-Box predictor
  • PolyA signal predictors
    • CSHL:
  • Splice site predictors
    • BDGP:
  • Start-/stop-codon identifiers
    • DNALC: Translator/ORF-Finder
    • BCM: Searchlauncher
how it works i motif identification
How it works I – Motif identification

Exon-Intron Borders = Splice Sites






Splice site Splice site






Splice site Splice site

Motif Extraction Programs at

how it works ii movies
How it works II - Movies

Pribnow-Box Finder 0/1

Pribnow-Box Finder all

gene prediction programs
Gene prediction programs
  • Rule-based programs
    • Use explicit set of rules to make decisions.
    • Example: GeneFinder
  • Neural Network-based programs
    • Use data set to build rules.
    • Examples: Grail, GrailEXP
  • Hidden Markov Model-based programs
    • Use probabilities of states and transitions between these states to predict features.
    • Examples: Genscan, GenomeScan
evaluating prediction programs
Evaluating prediction programs
  • Sensitivity vs. Specificity
  • Sensitivity
    • How many genes were found out of all present?
    • Sn = TP/(TP+FN)
  • Specificity
    • How many predicted genes are indeed genes?
    • Sp = TP/(TP+FP)
gene prediction accuracies
Gene prediction accuracies
  • Nucleotide level: 95%Sn, 90%Sp (Lows less than 50%)
  • Exon level: 75%Sn, 68%Sp (Lows less than 30%)
  • Gene Level: 40% Sn, 30%Sp (Lows less than 10%)
  • Programs that combine statistical evaluations with similarity searches most powerful.
common difficulties
Common difficulties
  • First and last exons difficult to annotate because they contain UTRs.
  • Smaller genes are not statistically significant so they are thrown out.
  • Algorithms are trained with sequences from known genes which biases them against genes about which nothing is known.
  • Masking repeats frequently removes potentially indicative chunks from the untranslated regions of genes that contain repetitive elements.
the annotation pipeline
The annotation pipeline
  • Mask repeats using RepeatMasker.
  • Run sequence through several programs.
  • Take predicted genes and do similarity search against ESTs and genes from other organisms.
  • Do similarity search for non-coding sequences to find ncRNA.
annotation nomenclature
Annotation nomenclature
  • Known Gene – Predicted gene matches the entire length of a known gene.
  • Putative Gene – Predicted gene contains region conserved with known gene. Also referred to as “like” or “similar to”.
  • Unknown Gene – Predicted gene matches a gene or EST of which the function is not known.
  • Hypothetical Gene – Predicted gene that does not contain significant similarity to any known gene or EST.