genes genomes and genomics l.
Download
Skip this Video
Loading SlideShow in 5 Seconds..
Genes, Genomes, and Genomics PowerPoint Presentation
Download Presentation
Genes, Genomes, and Genomics

Loading in 2 Seconds...

play fullscreen
1 / 33

Genes, Genomes, and Genomics - PowerPoint PPT Presentation


  • 250 Views
  • Uploaded on

Genes, Genomes, and Genomics. Bioinformatics in the Classroom plagiarized from: http://www.dnalc.org/bioinformatics/presentations/hhmi_2003/2003_3.ppt June, 2003. Two. Again …. Francis Collins, HGP. Craig Venter, Celera Inc. What’s in a chromosome?. Hierarchical vs. Whole Genome.

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Genes, Genomes, and Genomics' - burian


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
genes genomes and genomics

Genes, Genomes, and Genomics

Bioinformatics in the Classroom

plagiarized from:

http://www.dnalc.org/bioinformatics/presentations/hhmi_2003/2003_3.ppt

June, 2003

two again
Two. Again …

Francis Collins, HGP

Craig Venter, Celera Inc.

the value of genome sequences lies in their annotation
The value of genome sequences lies in their annotation
  • Annotation – Characterizing genomic features using computational and experimental methods
  • Genes: Four levels of annotation
    • Gene Prediction – Where are genes?
    • What do they look like?
    • Domains – What do the proteins do?
    • Role – What pathway(s) involved in?
how many genes
How many genes?
  • Consortium: 35,000 genes?
  • Celera: 30,000 genes?
  • Affymetrix: 60,000 human genes on GeneChips?
  • Incyte and HGS: over 120,000 genes?
  • GenBank: 49,000 unique gene coding sequences?
  • UniGene: > 89,000 clusters of unique ESTs?
current consensus in flux
Current consensus (in flux …)
  • 15,000 known genes (similarity to previously isolated genes and expressed sequences from a large variety of different organisms)
  • 17,000 predicted (GenScan, GeneFinder, GRAIL)
  • Based on and limited to previous knowledge
what are genes 1
What are genes? - 1
  • Complete DNA segments responsible to make functional products
  • Products
    • Proteins
    • Functional RNA molecules
      • RNAi (interfering RNA)
      • rRNA (ribosomal RNA)
      • snRNA (small nuclear)
      • snoRNA (small nucleolar)
      • tRNA (transfer RNA)
what are genes 2
What are genes? - 2
  • Definition vs. dynamic concept
  • Consider
    • Prokaryotic vs. eukaryotic gene models
    • Introns/exons
    • Posttranscriptional modifications
    • Alternative splicing
    • Differential expression
    • Genes-in-genes
    • Genes-ad-genes
    • Posttranslational modifications
    • Multi-subunit proteins
prokaryotic gene model orf genes
Prokaryotic gene model: ORF-genes
  • “Small” genomes, high gene density
    • Haemophilus influenza genome 85% genic
  • Operons
    • One transcript, many genes
  • No introns.
    • One gene, one protein
  • Open reading frames
    • One ORF per gene
    • ORFs begin with start,

end with stop codon (def.)

TIGR: http://www.tigr.org/tigr-scripts/CMR2/CMRGenomes.spl

NCBI: http://www.ncbi.nlm.nih.gov/PMGifs/Genomes/micr.html

eukaryotic gene model spliced genes
Eukaryotic gene model: spliced genes
  • Posttranscriptional modification
    • 5’-CAP, polyA tail, splicing
  • Open reading frames
    • Mature mRNA contains ORF
    • All internal exons contain open “read-through”
    • Pre-start and post-stop sequences are UTRs
  • Multiple translates
    • One gene – many proteins via alternative splicing
expansions and clarifications
Expansions and Clarifications
  • ORFs
    • Start – triplets – stop
    • Prokaryotes: gene = ORF
    • Eukaryotes: spliced genes or ORF genes
  • Exons
    • Remain after introns have been removed
    • Flanking parts contain non-coding sequence (5’- and 3’-UTRs)
where do genes live
Where do genes live?
  • In genomes
  • Example: human genome
    • Ca. 3,200,000,000 base pairs
    • 25 chromosomes : 1-22, X, Y, mt
    • 28,000-45,000 genes (current estimate)
    • 128 nucleotides (RNA gene) – 2,800 kb (DMD)
    • Ca.25% of genome are genes (introns, exons)
    • Ca. 1% of genome codes for amino acids (CDS)
    • 30 kb gene length (average)
    • 1.4 kb ORF length (average)
    • 3 transcripts per gene (average)
sample genomes
Sample genomes

List of 68 eukaryotes, 141 bacteria, and 17 archaea at

http://www.ncbi.nlm.nih.gov/PMGifs/Genomes/links2a.html

genomic sequence features
Genomic sequence features
  • Repeats (“Junk DNA”)
    • Transposable elements, simple repeats
    • RepeatMasker
  • Genes
    • Vary in density, length, structure
    • Identification depends on evidence and methods and may require concerted application of bioinformatics methods and lab research
  • Pseudo genes
    • Look-a-likes of genes, obstruct gene finding efforts.
  • Non-coding RNAs (ncRNA)
    • tRNA, rRNA, snRNA, snoRNA, miRNA
    • tRNASCAN-SE, COVE
gene identification
Gene identification
  • Homology-based gene prediction
    • Similarity Searches (e.g. BLAST, BLAT)
    • Genome Browsers
    • RNA evidence (ESTs)
  • Ab initio gene prediction
    • Gene prediction programs
    • Prokaryotes
      • ORF identification
    • Eukaryotes
      • Promoter prediction
      • PolyA-signal prediction
      • Splice site, start/stop-codon predictions
gene prediction through comparative genomics
Gene prediction through comparative genomics
  • Highly similar (Conserved) regions between two genomes are useful or else they would have diverged
  • If genomes are too closely related all regions are similar, not just genes
  • If genomes are too far apart, analogous regions may be too dissimilar to be found
genome browsers
Genome Browsers

NCBI Map Viewer

www.ncbi.nlm.nih.gov/mapview/

Generic Genome Browser (CSHL)

www.wormbase.org/db/seq/gbrowse

Ensembl Genome Browser

www.ensembl.org/

UCSC Genome Browser

genome.ucsc.edu/cgi-bin/hgGateway?org=human

Apollo Genome Browser

www.bdgp.org/annot/apollo/

gene discovery using ests
Gene discovery using ESTs
  • Expressed Sequence Tags (ESTs) represent sequences from expressed genes.
  • If region matches EST with high stringency then region is probably a gene or pseudo gene.
    • EST overlapping exon boundary gives an accurate prediction of exon boundary.
ab initio gene prediction
Ab initio gene prediction
  • Prokaryotes
    • ORF-Detectors
  • Eukaryotes
    • Position, extent & direction: through promoter and polyA-signal predictors
    • Structure: through splice site predictors
    • Exact location of coding sequences: through determination of relationships between potential start codons, splice sites, ORFs, and stop codons
tools
Tools
  • ORF detectors
    • NCBI: http://www.ncbi.nih.gov/gorf/gorf.html
  • Promoter predictors
    • CSHL: http://rulai.cshl.org/software/index1.htm
    • BDGP: fruitfly.org/seq_tools/promoter.html
    • ICG: TATA-Box predictor
  • PolyA signal predictors
    • CSHL: argon.cshl.org/tabaska/polyadq_form.html
  • Splice site predictors
    • BDGP: http://www.fruitfly.org/seq_tools/splice.html
  • Start-/stop-codon identifiers
    • DNALC: Translator/ORF-Finder
    • BCM: Searchlauncher
how it works i motif identification
How it works I – Motif identification

Exon-Intron Borders = Splice Sites

ExonIntronExon

~~gaggcatcag|gtttgtagac~~~~~~~~~~~tgtgtttcag|tgcacccact~~

~~ccgccgctga|gtgagccgtg~~~~~~~~~~~tctattctag|gacgcgcggg~~

~~tgtgaattag|gtaagaggtt~~~~~~~~~~~atatctccag|atggagatca~~

~~ccatgaggag|gtgagtgcca~~~~~~~~~~~ttatttccag|gtatgagacg~~

Splice site Splice site

ExonIntronExon

~~gaggcatcag|GTttgtagac~~~~~~~~~~~tgtgtttcAG|tgcacccact~~

~~ccgccgctga|GTgagccgtg~~~~~~~~~~~tctattctAG|gacgcgcggg~~

~~tgtgaattag|GTaagaggtt~~~~~~~~~~~atatctccAG|atggagatca~~

~~ccatgaggag|GTgagtgcca~~~~~~~~~~~ttatttccAG|gtatgagacg~~

Splice site Splice site

Motif Extraction Programs at http://www-btls.jst.go.jp/

how it works ii movies
How it works II - Movies

Pribnow-Box Finder 0/1

Pribnow-Box Finder all

gene prediction programs
Gene prediction programs
  • Rule-based programs
    • Use explicit set of rules to make decisions.
    • Example: GeneFinder
  • Neural Network-based programs
    • Use data set to build rules.
    • Examples: Grail, GrailEXP
  • Hidden Markov Model-based programs
    • Use probabilities of states and transitions between these states to predict features.
    • Examples: Genscan, GenomeScan
evaluating prediction programs
Evaluating prediction programs
  • Sensitivity vs. Specificity
  • Sensitivity
    • How many genes were found out of all present?
    • Sn = TP/(TP+FN)
  • Specificity
    • How many predicted genes are indeed genes?
    • Sp = TP/(TP+FP)
gene prediction accuracies
Gene prediction accuracies
  • Nucleotide level: 95%Sn, 90%Sp (Lows less than 50%)
  • Exon level: 75%Sn, 68%Sp (Lows less than 30%)
  • Gene Level: 40% Sn, 30%Sp (Lows less than 10%)
  • Programs that combine statistical evaluations with similarity searches most powerful.
common difficulties
Common difficulties
  • First and last exons difficult to annotate because they contain UTRs.
  • Smaller genes are not statistically significant so they are thrown out.
  • Algorithms are trained with sequences from known genes which biases them against genes about which nothing is known.
  • Masking repeats frequently removes potentially indicative chunks from the untranslated regions of genes that contain repetitive elements.
the annotation pipeline
The annotation pipeline
  • Mask repeats using RepeatMasker.
  • Run sequence through several programs.
  • Take predicted genes and do similarity search against ESTs and genes from other organisms.
  • Do similarity search for non-coding sequences to find ncRNA.
annotation nomenclature
Annotation nomenclature
  • Known Gene – Predicted gene matches the entire length of a known gene.
  • Putative Gene – Predicted gene contains region conserved with known gene. Also referred to as “like” or “similar to”.
  • Unknown Gene – Predicted gene matches a gene or EST of which the function is not known.
  • Hypothetical Gene – Predicted gene that does not contain significant similarity to any known gene or EST.