gene prediction and annotation techniques basics
Download
Skip this Video
Download Presentation
Gene Prediction and Annotation techniques Basics

Loading in 2 Seconds...

play fullscreen
1 / 49

gene prediction and annotation techniques basics - PowerPoint PPT Presentation


  • 336 Views
  • Uploaded on

Gene Prediction and Annotation techniques Basics. Chuong Huynh NIH/NLM/NCBI Sept 30, 2004 [email protected] Acknowledgement: Daniel Lawson, Neil Hall. GATCGGTCGAGCGTAAGCTAGCTAG ATCGATGATCGATCGGCCATATATC ACTAGAGCTAGAATCGATAATCGAT CGATATAGCTATAGCTATAGCCTAT. What is gene prediction?.

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'gene prediction and annotation techniques basics ' - benjamin


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
gene prediction and annotation techniques basics

Gene Prediction and Annotation techniques Basics

Chuong Huynh

NIH/NLM/NCBI

Sept 30, 2004

[email protected]

Acknowledgement: Daniel Lawson, Neil Hall

what is gene prediction

GATCGGTCGAGCGTAAGCTAGCTAG

ATCGATGATCGATCGGCCATATATC

ACTAGAGCTAGAATCGATAATCGAT

CGATATAGCTATAGCTATAGCCTAT

What is gene prediction?

Detecting meaningful signals in uncharacterised DNA sequences.

Knowledge of the interesting information in DNA.

Sorting the ‘chaff from the wheat’

  • Gene prediction is ‘recognising protein-coding regions in genomic sequence’
basic gene prediction flow chart
Basic Gene Prediction Flow Chart

Obtain new genomic DNA sequence

1. Translate in all six reading frames and compare to protein

sequence databases

2. Perform database similarity search of expressed sequence tag

Sites (EST) database of same organism, or cDNA sequences if available

Use gene prediction program to locate genes

Analyze regulatory sequences in the gene

why is gene prediction important
Why is gene prediction important?
  • Increased volume of genome data generated
  • Paradigm shift from gene by gene sequencing (small scale) to large-scale genome sequencing.
  • No more one gene at a time. A lot of data.
  • Foundation for all further investigation. Knowledge of the protein-coding regions underpins functional genomics.

Note: this presentation is for the prediction of genes that encode protein only;

Not promoter prediction, sequences regulate activity of protein encoding genes

map viewer
Map Viewer

Genome Scan

Models

Contig

GenBank

Genes

Mouse EST hits

Human EST hits

knowing what to look for

N

Start

Middle

End

Knowing what to look for

What is a gene?

Not a full transcript with control regions

The coding sequence (ATG -> STOP)

orf finding in prokaryotes
ORF Finding in Prokaryotes
  • Simplest method of finding DNA sequences that encode proteins by searching for open reading frames
  • An ORF is a DNA sequence that contains a contiguous set of codons that species an amino acid
  • Six possible reading frames
  • Good for prokaryotic system (no/little post translation modification)
  • Runs from Met (AUG) on mRNA  stop codon TER (UAA, UAG, UGA)
  • http://www.ncbi.nlm.nih.gov/gorf/ NCBI ORF Finder
annotation of eukaryotic genomes
Annotation of eukaryotic genomes

Genomic DNA

ab initio gene prediction(w/o prior knowledge)

transcription

Unprocessed RNA

RNA processing

Comparative gene prediction

(use other biological data)

AAAAAAA

Gm3

Mature mRNA

translation

Nascent polypeptide

folding

Active enzyme

Functional identification

Function

Reactant A

Product B

two classes of sequence information
Two Classes of Sequence Information
  • Signal Terms – short sequence motifs (such as splice sites, branch points,Polypyrimidine tracts, start codons, and stop codons)
  • Content Terms – pattern of codon usage that are unique to a species and allow coding sequences to be distinguished from surrounding noncoding sequences by a statistical detection algorithm
problem using codon usage
Problem Using Codon Usage
  • Program must be taught what the codon usage patterns look like by presenting the program with a TRAINING SET of known coding sequences.
  • Different programs search for different patterns.
  • A NEW training set is needed for each species
  • Untranslated regions (UTR) at the ends of the genes cannot be detected, but most programs can identify polyadenylation sites
  • Non-protein coding RNA genes cannot be detected (attempt detection in a few specialized programs)
  • Non of these program can detect alternatively spliced transcripts
gene finding issues
Gene finding: Issues
  • Issues regarding gene finding in general
    • Genome size

(larger genome ~ more genes, but …)

    • Genome composition
    • Genome complexity

(more complexity -> less coding density; fewer genes per kb)

    • cis-splicing (processing mRNA in Eukaryotics)
    • trans-splicing (in kinetisplastid)
    • alternate splicing (e.g. in different tissues; higher organism)
    • Variation of genetic code from the universal code
gene finding genome
Gene finding: genome
  • Genome composition
    • Long ORFs tend to be coding
    • Presence of more putative ORFs in GC rich genomes (Stop codons = UAA, UAG & UGA)
  • Genome complexity
    • Simple repetitive sequences (e.g. dinucleotide) and dispersed repeats tend to be anti-coding
    • May need to mask sequence prior to gene prediction
gene finding coding density
Gene finding: coding density

As the coding/non-coding length ratio decreases, exon prediction becomes more complex

Human

Fugu

worm

E.coli

gene finding splicing
Gene finding: splicing
  • cis-splicing of genes
    • Finding multiple (short) exons is harder than finding a single (long) exon.
  • trans-splicing of genes
    • A trans-splice acceptor is no different to a normal splice acceptor

worm

E.coli

gene finding alternate splicing
Gene finding: alternate splicing
  • Alternate splicing (isoforms) are very difficult to predict.

Human A

Human B

Human C

ab initio prediction

GATCGGTCGAGCGTAAGCTAGCTAG

ATCGATGATCGATCGGCCATATATC

ACTAGAGCTAGAATCGATAATCGAT

CGATATAGCTATAGCTATAGCCTAT

ab initio prediction

What is ab initio gene prediction?

  • Prediction from first principles using the raw DNA sequence only.
  • Requires ‘training sets’ of known gene structures to generate statistical tests for the likelihood of a prediction being real.
gene finding ab initio
Gene finding: ab initio
  • What features of an ORF can we use?
    • Size - large open reading frames
    • DNA composition - codon usage / 3rd position codon bias
    • Kozak sequence CCGCCAUGG
    • Ribosome binding sites
    • Termination signal (stops)
    • Splice junction boundaries (acceptor/donor)
gene finding features
Gene finding: features

Think of a CDS gene prediction as a linear series of sequence features:

Initiation codon

Coding sequence (exon)

Splice donor (5’)

N times

Non-coding sequence (intron)

Splice acceptor (3’)

Coding sequence (exon)

Termination codon

a model ab initio predictor
A model ab initio predictor
  • Locate and score all sequence features used in gene models
  • dynamic programming to make the high scoring model from available features.
    • e.g. Genefinder (Green)
  • Running a 5’-> 3’ pass the sequence through a Markov model based on a typical gene model
    • e.g. TBparse (Krogh), GENSCAN (Burge) or GLIMMER (Salzberg)
  • Running a 5’->3’ pass the sequence through a neural net trained with confirmed gene models
    • e.g. GRAIL (Oak Ridge)
ab initio gene finding programs
Ab initio Gene finding programs
  • Most gene finding software packages use a some variant of Hidden Markov Models (HMM).
  • Predict coding, intergenic, and intron sequences
  • Need to be trained on a specific organism.
  • Never perfect!
what is an hmm
What is an HMM?
  • A statistical model that represents a gene.
  • Similar to a “weight matrix” that can recognise gaps and treat them in a systematic way.
  • Has different “states” that represent introns, exons, and intergenic regions.
malaria gene prediction tool
Malaria Gene Prediction Tool
  • Hexamer – ftp://ftp.sanger.ac.uk/pub/pathogens/software/hexamer/
  • Genefinder – email [email protected]
  • GlimmerM – http://www.tigr.org/softlab/glimmerm
  • Phat – http://www.stat.berkeley.edu/users/scawley/Phat
  • Already Trained for Malaria!!!! The more experimental derived genes used for training the gene prediction tool the more reliable the gene predictor.
glimmerm salzberg et al 1999 genomics 59 24 31
GlimmerMSalzberg et al. (1999) genomics 59 24-31
  • Adaption of the prokaryotic genefinder Glimmer.

Delcher et al. (1999) NAR 2 4363-4641

  • Based on a interpolated HMM (IHMM).
  • Only used short chains of bases (markov chains) to generate probabilities.
  • Trained identically to Phat
an end to ab initio prediction
An end to ab initio prediction
  • ab initio gene prediction is inaccurate
  • Have high false positive rates, but also low false negative rates for most predictors
  • Incorporating similarity info is meant to reduce false positive rate, but at the same also increase false negative rate.
  • Biggest determinant of false positive/negative is gene size.
  • Exon prediction sensitivity can be good
  • Rarely used as a final product
    • Human annotation runs multiple algorithms and scores exon predicted by multiple predictors.
    • Used as a starting point for refinement/verification
  • Prediction need correction and validation
  • -- Why not just build gene models by comparative means?
annotation of eukaryotic genomes32
Annotation of eukaryotic genomes

Genomic DNA

ab initio gene prediction (w/o prior knowledge)

transcription

Unprocessed RNA

RNA processing

AAAAAAA

Gm3

Mature mRNA

Comparative gene prediction(use other biological data)

translation

Nascent polypeptide

folding

Active enzyme

Functional identification

Function

Reactant A

Product B

if a cell was human

DNA

RNA

Protein

If a cell was human?
  • The cell ‘knows’ how to splice a gene together.
  • We know some of these signals but not all and not all of the time
  • So compare with known examples from the species and others

Central dogma for molecular biology

Genome

Transcriptome

Proteome

when a human looks at a cell

Extract DNA and sequence genome

DNA

Extract RNA, reverse transcribe and sequence cDNA

RNA

Peptide sequence inferred from gene prediction

Protein

When a human looks at a cell
  • Compare with the rest of the genome/transcriptome/proteome data
comparative gene prediction
comparative gene prediction
  • Use knowledge of known coding sequences to identify region of genomic DNA by similarity
    • transcriptome - transcribed DNA sequence
    • proteome - peptide sequence
    • genome - related genomic sequence
transcript based prediction datasets
Transcript-based prediction: datasets
  • Generation of large numbers of Expressed Sequence Tags (ESTs)
    • Quick, cheap but random
    • Subtractive hybridisation to find rare transcripts
    • Use multiple libraries for different life-stages/conditions
    • Single-pass sequence prone to errors
  • Generation of small number of full length cDNA sequences
    • Slow and laborious but focused
  • Large-scale sequencing of (presumed) full length cDNAs
    • Systematic, multiplexed cloning/sequencing of CDS
    • Expensive and only viable if part of bigger project
gene prediction in eukaryotes simplified
Gene Prediction in Eukaryotes – Simplified
  • For highly conserved proteins:
    • Translate DNA sequence in all 6 reading frames
    • BLASTX or FASTAX to compare the sequence to a protein sequence database
    • Or
    • Protein compared against nucleic acid database including genomic sequence that is translated in all six possible reading frame sby TBLASTN, TFASTAX/TFASTY programs.
  • Note: Approximation of the gene structure only.
transcript based prediction how it works
Transcript-based prediction: How it works
  • Align transcript data to genomic sequence using a pair-wise sequence comparison

Gene

Model:

EST

cDNA

transcript based gene prediction algorithm
Transcript-based gene prediction: algorithm
  • BLAST (Altshul) (36 hours)
    • Widely used and understood
    • HSPs often have ‘ragged’ ends so extends to the end of the introns
  • EST_GENOME (Mott) (3 days)
    • Dynamic programming post-process of BLAST
    • Slow and sometimes cryptic
  • BLAT (Kent) (1/2 hour)
    • Next generation of alignment algorithm
    • Design for looking at nearly identical sequences
    • Faster and more accurate than BLAST
peptide based gene prediction algorithm
Peptide-based gene prediction: algorithm
  • BLAST (Altshul)
    • Widely used and understood
  • Smith-Waterman
    • Preliminary to further processing
  • Used in preference to DNA-based similarities for evolutionary diverged species as peptide conservation is significantly higher than nucleotide
genomic based gene prediction algorithm
Genomic-based gene prediction: algorithm
  • BLAST (Altshul)
    • Can be used in TBLASTX mode
  • BLAT (Kent)
    • Can be used in a translated DNA vs translated DNA mode
    • Significantly faster than BLAST
  • WABA (Kent)
    • Designed to allow for 3rd position codon wobble
    • Slow with some outstanding problems
    • Only really used in C.elegans v C.briggsae analysis
comparative gene predictors
Comparative gene predictors
  • This can be viewed as an extension of the ab initio prediction tools – where coding exons are defined by similarities and not codon bias
    • GAZE (Howe) is an extension of Phil Green’s Genefinder in which transcript data is used to define coding exons. Other features are scored as in the original Genefinder implementation. This is being evaluated and used in the C.elegans project.
    • GENEWISE (Birney) is a HMM based gene predictor which attempts to predict the closest CDS to a supplied peptide sequence. This is the workhorse predictor for the ENSEMBL project.
comparative gene predictors43
Comparative gene predictors
  • A new generation of comparative gene prediction tools is being developed to utilise the large amount of genomic sequence available.
    • Twinscan (WashU) attempts to predict genes using related genomic sequences.
    • Doublescan (Sanger) is a HMM based gene predictor which attempts to predict 2 orthologous CDS’s from genomic regions pre-defined as matching.
  • Both of these predictors are in development and will be used for the C.elegans v C.briggsae match and the Mouse v Human match later this year.
summary
Summary
  • Genes are complex structure which are difficult to predict with the required level of accuracy/confidence
  • We can predict stops better than starts
  • We can only give gross confidence levels to predictions (i.e. confirmed, partially confirmed or predicted)
  • Gene prediction is only part of the annotation procedure
  • Movement from ab initio to comparative methodology as sequence data becomes available/affordable
  • Curation of gene models is an active process – the set of gene models for a genome is fluid and WILL change over time.
the annotation process
The Annotation Process

DNA SEQUENCE

Useful

Information

ANNALYSIS SOFTWARE

Annotator

annotation process

DNA sequence

Gene finders

Blastn

Blastx

Halfwise

tRNA scan

RepeatMasker

Repeats

Promoters

rRNA

Pseudo-Genes

Genes

tRNA

Fasta

BlastP

Pfam

Prosite

Psort

SignalP

TMHMM

Annotation Process
artemis
Artemis
  • Artemis is a free DNA sequence viewer and annotation tool that allows visualization of sequence features and the results of analyses within the context of the sequence, and its six-frame translation.
  • http://www.sanger.ac.uk/Software/Artemis/
slide48

atcttttacttttttcatcatctatacaaaaaatcatagaatattcatcatgttgtttaaaataatgtattccattatgaactttattacaaccctcgttatcttttacttttttcatcatctatacaaaaaatcatagaatattcatcatgttgtttaaaataatgtattccattatgaactttattacaaccctcgtt

tttaattaattcacattttatatctttaagtataatatcatttaacattatgttatcttcctcagtgtttttcattattatttgcatgtacagtttatca

tttttatgtaccaaactatatcttatattaaatggatctctacttataaagttaaaatctttttttaattttttcttttcacttccaattttatattccg

cagtacatcgaattctaaaaaaaaaaataaataatatataatatataataaataatatataataaataatatataatatataataaataatatataatat

ataatatataataaataatatataatatataatatataataaataatatataataaataatatataatatataatatataatactttggaaagattattt

atatgaatatatacacctttaataggatacacacatcatatttatatatatacatataaatattccataaatatttatacaacctcaaataaaataaaca

tacatatatatatataaatatatacatatatgtatcattacgtaaaaacatcaaagaaatatactggaaaacatgtcacaaaactaaaaaaggtattagg

agatatatttactgattcctcatttttataaatgttaaaattattatccctagtccaaatatccacatttattaaattcacttgaatattgttttttaaa

ttgctagatatattaatttgagatttaaaattctgacctatataaacctttcgagaatttataggtagacttaaacttatttcatttgataaactaatat

tatcatttatgtccttatcaaaatttattttctccatttcagttattttaaacatattccaaatattgttattaaacaagggcggacttaaacgaagtaa

ttcaatcttaactccctccttcacttcactcattttatatattccttaatttttactatgtttattaaattaacatatatataaacaaatatgtcactaa

taatatatatatatatatatatatatatatatattataaatgttttactctattttcacatcttgtccttttttttttaaaaatcccaattcttattcat

taaataataatgtattttttttttttttttttttttttattaattattatgttactgttttattatatacactcttaatcatatatatatatttatatat

atatatatatatatatatatattattcccttttcatgttttaaacaagaaaaaaaactaaaaaaaaaaaaaataataaaatatatttttataacatatgt

attattaaaatgtatatataaaaatatatattccatttattattatttttttatatacattgttataagagtatcttctcccttctggtttatattacta

ccatttcactttgaacttttcataaaaattaatagaatatcaaatatgtataatatataacaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaata

tatatatatatatatatacatataatatatatttcatctaatcatttaaaattattattatatattttttaaaaaatatatttatgataacataaaaaga

atttaattttaattaaatatatataattacatacatctaatattattatatatatataataagttttccaaatagaatacttatatattatatatatata

tatatatatatatattcttccataaaaagaataaaataaaataaaaacaccttaaaagtatttgtaaaaaattccccacattgaatatatagttgtattt

ataaaattaaagaaaaagcataaagttaccatttaatagtggagattagtaacattttcttcattatcaaaaatatttatttcctaattttttttttttg

taaaatatatttaaaaatgtaatagattatgtattaaataatataaatatagcaaaatgttcaattttagaaatttgcctctttttgacaaggataattc

aaaagatacaggtaaaaaaaaaaaaataaagtaaaacaaaacaaaacaaaaaacaaaaaaaaaaaaaaaaaaaaaaatgacatgttataatataatataa

taaataaaaattatgtaatatatcataatcgaagaaacatatatgaaaccaaaaagaaacagatcttgatttattaatacatatataactaacattcata

tctttatttttgtagatgatataaaaaattttataaactcttatgaagggatatatttttcatcatccaataaatttataaatgtatttctagacaaaat

tctgatcattgatccgtcttccttaaatgttattacaataaatacagatctgtatgtagttgatttcctttttaatgagaaaaataagaatcttattgtt

ttagggtaatgaaatatatatagatttatatttttatttatttattatatattattttttaatttttcttttatatatttattttatttagtgtataaaa

tgatatcctttatatttatatttacatgggatattcaaataataacaaaaatgagtatacacatatatatatatatatatatatatatgtatattttttt

tttttttttatgttcctataggaaagggaagaattcactgatttgtagtgtttacaatattagggaatgcaactttacacttttgaaaaaaattcagtta

agcaaaaatattaataacattaaaaagacactgatagcaaaatgtaatgaatatataataacattagaaaataagaaaattactttttatttcttaaata

aagattatagtataaatcaaagtgaattaatagaagacggaaaagaacttattgaaaatatctatttgtcaaaaaatcatatcttgttagtaataaaaaa

ttcatatgtatatatataccaattagatattaaaaattcccatattagttatacacttattgatagtttcaatttaaatttatcctacctcagagaatct

ataaataataaaaaaaagcatataaataaaataaatgatgtatcaaataatgacccaaaaaaggataataatgaaaaaaatacttcatctaataatataa

cacataacaattataatgacatatcaaataataataataataataataatattaatggggtgaaagaccatataaataataacactctggaaaataatga

tgaaccaatcttatctatatataatgaagatcttaatgttttatatatatgccaaaatatgtataacgtcctttttgttttgaatttaaataacctaagt

dna in artemis
DNA in Artemis

GC content

Black bar = stop codon

Forward translations

Reverse

Translations

DNA and amino

acids

ad