gene prediction
Download
Skip this Video
Download Presentation
Gene prediction

Loading in 2 Seconds...

play fullscreen
1 / 64

Gene prediction - PowerPoint PPT Presentation


  • 86 Views
  • Uploaded on

Gene prediction. Gene Prediction: Computational Challenge.

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Gene prediction' - kasie


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
gene prediction computational challenge
Gene Prediction: Computational Challenge

aatgcatgcggctatgctaatgcatgcggctatgctaagctgggatccgatgacaatgcatgcggctatgctaatgcatgcggctatgcaagctgggatccgatgactatgctaagctgggatccgatgacaatgcatgcggctatgctaatgaatggtcttgggatttaccttggaatgctaagctgggatccgatgacaatgcatgcggctatgctaatgaatggtcttgggatttaccttggaatatgctaatgcatgcggctatgctaagctgggatccgatgacaatgcatgcggctatgctaatgcatgcggctatgcaagctgggatccgatgactatgctaagctgcggctatgctaatgcatgcggctatgctaagctgggatccgatgacaatgcatgcggctatgctaatgcatgcggctatgcaagctgggatcctgcggctatgctaatgaatggtcttgggatttaccttggaatgctaagctgggatccgatgacaatgcatgcggctatgctaatgaatggtcttgggatttaccttggaatatgctaatgcatgcggctatgctaagctgggaatgcatgcggctatgctaagctgggatccgatgacaatgcatgcggctatgctaatgcatgcggctatgcaagctgggatccgatgactatgctaagctgcggctatgctaatgcatgcggctatgctaagctcatgcggctatgctaagctgggaatgcatgcggctatgctaagctgggatccgatgacaatgcatgcggctatgctaatgcatgcggctatgcaagctgggatccgatgactatgctaagctgcggctatgctaatgcatgcggctatgctaagctcggctatgctaatgaatggtcttgggatttaccttggaatgctaagctgggatccgatgacaatgcatgcggctatgctaatgaatggtcttgggatttaccttggaatatgctaatgcatgcggctatgctaagctgggaatgcatgcggctatgctaagctgggatccgatgacaatgcatgcggctatgctaatgcatgcggctatgcaagctgggatccgatgactatgctaagctgcggctatgctaatgcatgcggctatgctaagctcatgcgg

gene prediction computational challenge1
Gene Prediction: Computational Challenge

aatgcatgcggctatgctaatgcatgcggctatgctaagctgggatccgatgacaatgcatgcggctatgctaatgcatgcggctatgcaagctgggatccgatgactatgctaagctgggatccgatgacaatgcatgcggctatgctaatgaatggtcttgggatttaccttggaatgctaagctgggatccgatgacaatgcatgcggctatgctaatgaatggtcttgggatttaccttggaatatgctaatgcatgcggctatgctaagctgggatccgatgacaatgcatgcggctatgctaatgcatgcggctatgcaagctgggatccgatgactatgctaagctgcggctatgctaatgcatgcggctatgctaagctgggatccgatgacaatgcatgcggctatgctaatgcatgcggctatgcaagctgggatcctgcggctatgctaatgaatggtcttgggatttaccttggaatgctaagctgggatccgatgacaatgcatgcggctatgctaatgaatggtcttgggatttaccttggaatatgctaatgcatgcggctatgctaagctgggaatgcatgcggctatgctaagctgggatccgatgacaatgcatgcggctatgctaatgcatgcggctatgcaagctgggatccgatgactatgctaagctgcggctatgctaatgcatgcggctatgctaagctcatgcggctatgctaagctgggaatgcatgcggctatgctaagctgggatccgatgacaatgcatgcggctatgctaatgcatgcggctatgcaagctgggatccgatgactatgctaagctgcggctatgctaatgcatgcggctatgctaagctcggctatgctaatgaatggtcttgggatttaccttggaatgctaagctgggatccgatgacaatgcatgcggctatgctaatgaatggtcttgggatttaccttggaatatgctaatgcatgcggctatgctaagctgggaatgcatgcggctatgctaagctgggatccgatgacaatgcatgcggctatgctaatgcatgcggctatgcaagctgggatccgatgactatgctaagctgcggctatgctaatgcatgcggctatgctaagctcatgcgg

gene prediction computational challenge2
Gene Prediction: Computational Challenge

aatgcatgcggctatgctaatgcatgcggctatgctaagctgggatccgatgacaatgcatgcggctatgctaatgcatgcggctatgcaagctgggatccgatgactatgctaagctgggatccgatgacaatgcatgcggctatgctaatgaatggtcttgggatttaccttggaatgctaagctgggatccgatgacaatgcatgcggctatgctaatgaatggtcttgggatttaccttggaatatgctaatgcatgcggctatgctaagctgggatccgatgacaatgcatgcggctatgctaatgcatgcggctatgcaagctgggatccgatgactatgctaagctgcggctatgctaatgcatgcggctatgctaagctgggatccgatgacaatgcatgcggctatgctaatgcatgcggctatgcaagctgggatcctgcggctatgctaatgaatggtcttgggatttaccttggaatgctaagctgggatccgatgacaatgcatgcggctatgctaatgaatggtcttgggatttaccttggaatatgctaatgcatgcggctatgctaagctgggaatgcatgcggctatgctaagctgggatccgatgacaatgcatgcggctatgctaatgcatgcggctatgcaagctgggatccgatgactatgctaagctgcggctatgctaatgcatgcggctatgctaagctcatgcggctatgctaagctgggaatgcatgcggctatgctaagctgggatccgatgacaatgcatgcggctatgctaatgcatgcggctatgcaagctgggatccgatgactatgctaagctgcggctatgctaatgcatgcggctatgctaagctcggctatgctaatgaatggtcttgggatttaccttggaatgctaagctgggatccgatgacaatgcatgcggctatgctaatgaatggtcttgggatttaccttggaatatgctaatgcatgcggctatgctaagctgggaatgcatgcggctatgctaagctgggatccgatgacaatgcatgcggctatgctaatgcatgcggctatgcaagctgggatccgatgactatgctaagctgcggctatgctaatgcatgcggctatgctaagctcatgcgg

Gene!

gene prediction analogy
Gene Prediction Analogy
  • Newspaper written in unknown language
    • Certain pages contain encoded message, say 99 letters on page 7, 30 on page 12 and 63 on page 15.
  • How do you recognize the message? You could probably distinguish between the ads and the story (ads contain the “$” sign often)
  • Statistics-based approach to Gene Prediction tries to make similar distinctions between exons and introns.
slide6
Statistical Approach: Metaphor in Unknown Language

Noting the differing frequencies of symbols (e.g. ‘%’, ‘.’, ‘-’) and numerical symbols could you distinguish between a story and the stock report in a foreign newspaper?

two approaches to gene prediction
Two Approaches to Gene Prediction
  • Statistical: coding segments (exons) have typical sequences on either end and use different subwords than non-coding segments (introns).
  • Similarity-based: many human genes are similar to genes in mice, chicken, or even bacteria. Therefore, already known mouse, chicken, and bacterial genes may help to find human genes.
slide8
Similarity-Based Approach: Metaphor in Different Languages

If you could compare the day’s news in English, side-by-side to the same news in a foreign language, some similarities may become apparent

annotation of genomic sequence
Annotation of Genomic Sequence
  • Given the sequence of an organism’s genome, we would like to be able to identify:
    • Genes
    • Exon boundaries & splice sites
    • Beginning and end of translation
    • Alternative splicings
    • Regulatory elements (e.g. promoters)
  • The only certain way to do this is experimentally,

but it is time consuming and expensive. Computational methods can achieve reasonable accuracy quickly, and help direct experimental approaches.

primary goals

secondarygoals

slide10
Prokaryotic Gene Structure

Promoter CDS Terminator

Genomic DNA

transcription

mRNA

  • Most bacterial promoters contain the Shine-Delgarno signal, at about -10 that has the consensus sequence: 5'-TATAAT-3'.
  • The terminator: a signal at the end of the coding sequence that terminates the transcription of RNA
  • The coding sequence is composed of nucleotide triplets. Each triplet codes for an amino acid. The AAs are the building blocks of proteins.
pieces of a eukaryotic gene on the genome
~ 1-100 Mbp

~ 1-1000 kbp

5’

3’

3’

5’

5’

3’

3’

5’

promoter (~103 bp)

Polyadenylation site

enhancers (~101-102 bp)

other regulatory sequences

(~ 101-102 bp)

Pieces of a (Eukaryotic) Gene(on the genome)

exons (cds&utr) / introns

(~ 102-103 bp) (~ 102-105 bp)

what is computational gene finding
What is Computational Gene Finding?

Given an uncharacterized DNA sequence, find out:

  • Which region codes for a protein?
  • Which DNA strand is used to encode the gene?
  • Which reading frame is used in that strand?
  • Where does the gene starts and ends?
  • Where are the exon-intron boundaries in eukaryotes?
  • (optionally) Where are the regulatory sequences for that gene?
prokaryotic vs eukaryotic gene finding
Prokaryotes:

small genomes 0.5 – 10·106bp

high coding density (>90%)

no introns

Gene identification relatively easy, with success rate ~ 99%

Problems:

overlapping ORFs

short genes

finding TSS and promoters

Eukaryotes:

large genomes 107 – 1010 bp

low coding density (<50%)

intron/exon structure

Gene identification a complex problem, gene level accuracy ~50%

Problems:

many

Prokaryotic Vs. Eukaryotic Gene Finding
what is it about genes that we can measure and model
Most of our knowledge is biased towards

protein-coding characteristics

ORF (Open Reading Frame): a sequence defined by in-frame AUG and stop codon, which in turn defines a putative amino acid sequence.

Codon Usage: most frequently measured by CAI (Codon Adaptation Index)

Other phenomena

Nucleotide frequencies and correlations:

value and structure

Functional sites:

splice sites, promoters, UTRs, polyadenylation sites

What is it about genes that we can measure (and model)?
general things to remember about protein coding gene prediction software
General Things to Remember about (Protein-coding) Gene Prediction Software
  • It is, in general, organism-specific
  • It works best on genes that are reasonably similar to something seen previously
  • It finds protein coding regions far better than non-coding regions
  • In the absence of external (direct) information, alternative forms will not be identified
  • It is imperfect! (It’s biology, after all…)
gene finding different approaches
Gene Finding: Different Approaches
  • Similarity-based methods (extrinsic) - use similarity to annotated sequences:
    • proteins
    • cDNAs
    • ESTs
  • Comparative genomics - Aligning genomic sequences from different species
  • Ab initio gene-finding (intrinsic)
  • Integrated approaches
similarity based methods
Similarity-based methods
  • Based on sequence conservation due to functional constraints
  • Use local alignment tools (Smith-Waterman algo, BLAST, FASTA) to search protein, cDNA, and EST databases
  • Will not identify genes that code for proteins not already in databases (can identify ~50% new genes)
  • Limits of the regions of similarity not well defined
comparative genomics
Comparative Genomics
  • Based on the assumption that coding sequences are more conserved than non-coding
  • Two approaches:
    • intra-genomic (gene families)
    • inter-genomic (cross-species)
  • Alignment of homologous regions
  • Difficult to define limits of higher similarity
  • Difficult to find optimal evolutionary distance (pattern of conservation differ between loci)
summary for extrinsic approaches
Summary for Extrinsic Approaches

Strengths:

  • Rely on accumulated pre-existing biological data, thus should produce biologically relevant predictions

Weaknesses:

  • Limited to pre-existing biological data
  • Errors in databases
  • Difficult to find limits of similarity
ab initio gene finding
Ab initio Gene Finding

Input: A DNA string over the alphabet {A,C,G,T}

Output: An annotation of the string showing for every nucleotide whether it is coding or non-coding

AAAGCATGCATTTAACGAGTGCATCAGGACTCCATACGTAATGCCG

Gene finder

AAAGC ATGCAT TTA ACG A GT GCATC AG GA CTC CAT ACGTAA TGCCG

  • Using only sequence information
  • Identifying only coding exons of protein-coding genes (transcription start site, 5’ and 3’ UTRs are ignored)
  • Integrates coding statistics with signal detection
a eukaryotic gene
Introns

Final exon

Initial exon

3’ untranslated region

Internal exons

A eukaryotic gene
  • This is the human p53 tumor suppressor gene on chromosome 17.
  • Genscan is one of the most popular gene prediction algorithms.

This particular gene lies on the reverse strand.

observations
Observations
  • Given (walk, shop, clean)
    • What is the probability of this sequence of observations? (is he really still at home, or did he skip the country)
    • What was the most likely sequence of rainy/sunny days?
signals vs contents
Signals vs contents
  • In gene finding, a small pattern within the genomic DNA is referred to as a signal, whereas a region of genomic DNA is a content.
  • Examples of signals: splice sites, starts and ends of transcription or translation, branch points, transcription factor binding sites
  • Examples of contents: exons, introns, UTRs, promoter regions
the cpg island problem
The CpG island problem
  • Methylation in human genome
    • “CG” -> “TG” happens in most places except “start regions” of genes and within genes
    • CpG islands = 100-1,000 bases before a gene starts
  • Question
    • Given a long sequence, how would we find the CpG islands in it?
promoters
Promoters are DNA segments upstream of transcripts that initiate transcription

Promoter attracts RNA Polymerase to the transcription start site

Promoters

5’

3’

Promoter

slide28
Donor site

5’

3’

Position

%

Splice site detection

real splice sites
Real splice sites
  • Real splice sites show some conservation at positions beyond the first two.
  • We can add additional arrows to model these states.

weblogo.berkeley.edu

prior knowledge
Prior knowledge
  • The translated region must have a length that is a multiple of 3.
  • Some codons are more common than others.
  • Exons are usually shorter than introns.
  • The translated region begins with a start signal and ends with a stop codon.
  • 5’ splice sites (exon to intron) are usually GT;
  • 3’ splice sites (intron to exon) are usually AG.
  • The distribution of nucleotides and dinucleotides is usually different in introns and exons.
gene prediction and motifs
Gene Prediction and Motifs
  • Upstream regions of genes often contain motifs that can be used for gene prediction

ATG

STOP

-35

-10

0

10

TTCCAA

TATACT

Pribnow Box

GGAGG

Ribosomal binding site

Transcription start site

positional dependence
In this data, every time a “G” appears in position 1, an “A” appears in position 3.

Conversely, an “A” in position 1 always occurs with a “T” in position 3.

ACTG

ACTT

GCAC

ACTT

ACTA

GCAT

ACTA

ACTT

Positional dependence
example of positional weight matrix
Example of (Positional) Weight Matrix

TACGAT

TATAAT

TATAAT

GATACT

TATGAT

TATGTT

  • Computed by measuring the frequency of every element of every position of the site (weight)
  • Score for any putative site is the sum of the matrix values (converted in probabilities) for that sequence (log-likelihood score)
  • Disadvantages:
    • cut-off value required
    • assumes independence between adjacent bases
conditional probability
Conditional probability

GCGCAGCCGGCGCCGCCGGCGCCTCCGGGGCGGGCGAGGCAGCCTCATCCTGCG

  • What is the probability of observing an “A” at position 2, given that we observed a “C” at the previous position?
conditional probability1
Conditional probability

GCGCAGCCGGCGCCGCCGGCGCCTCCGGGGCGGGCGAGGCAGCCTCATCCTGCG

  • What is the probability of observing an “A” at position 2, given that we observed a “C” at the previous position?
  • Answer: total number of CA’s divided by total number of C’s in position 1.
  • 3/11 = 27%
  • Probability of observing CA = 3/18 = 17%.
conditional probability2
Conditional probability

GCGCAGCCGGCGCCGCCGGCGCCTCCGGGGCGGGCGAGGCAGCCTCATCCTGCG

  • What is the probability of observing a “G” at position 3, given that we observed a “C” at the previous position?
conditional probability3
Conditional probability

GCGCAGCCGGCGCCGCCGGCGCCTCCGGGGCGGGCGAGGCAGCCTCATCCTGCG

  • What is the probability of observing a “G” at position 3, given that we observed a “C” at the previous position?
  • Answer: 9/12 = 75%.
promoter structure in prokaryotes e coli
Promoter Structure in Prokaryotes (E.Coli)
  • Transcription starts at offset 0.
  • Pribnow Box (-10)
  • Gilbert Box (-30)
  • Ribosomal Binding Site (+10)
open reading frames orfs
Open Reading Frames (ORFs)
  • Detect potential coding regions by looking at ORFs
    • A genome of length n is comprised of (n/3) codons
    • Stop codons break genome into segments between consecutive Stop codons
    • The subsegments of these that start from the Start codon (ATG) are ORFs
      • ORFs in different frames may overlap

ATG

TGA

Genomic Sequence

Open reading frame

long vs short orfs
Long vs.Short ORFs
  • Long open reading frames may be a gene. At random, we should expect one stop codon every (64/3) ~= 21 codons. However, genes are usually much longer than this
  • A basic approach is to scan for ORFs whose length exceeds certain threshold. This is naïve because some genes (e.g. some neural and immune system genes) are relatively short
testing orfs codon usage
Testing ORFs: Codon Usage
  • Create a 64-element hash table and count the frequencies of codons in an ORF
  • Amino acids typically have more than one codon, but in nature certain codons are more in use
  • Uneven use of the codons may characterize a real gene
  • This compensate for pitfalls of the ORF length test
open reading frames in bacteria
Open Reading Frames in Bacteria
  • Without introns, look for long open reading frame (start codon ATG, … , stop codon TAA, TAG, TGA)
  • Short genes are missed (<300 nucleotides)
  • Shadow genes (overlapping open reading frames on opposite DNA strands) are hard to detect
  • Some genes start with UUG, AUA, UUA and CUG for start codon
  • Some genes use TGA to create selenocysteine and it is not a stop codon
coding statistics
Coding Statistics
  • Unequal usage of codons in the coding regions is a universal feature of the genomes
    • uneven usage of amino acids in existing proteins
    • uneven usage of synonymous codons (correlates with the abundance of corresponding tRNAs)
  • We can use this feature to differentiate between coding and non-coding regions of the genome
  • Coding statistics - a function that for a given DNA sequence computes a likelihood that the sequence is coding for a protein
coding statistics1
Coding Statistics
  • Many different ones
    • codon usage
    • hexamer usage
    • GC content
    • compositional bias between codon positions
    • nucleotide periodicity
codon usage in mouse genome
Codon Usage in Mouse Genome

AA codon /1000 frac

Ser TCG 4.31 0.05

Ser TCA 11.44 0.14

Ser TCT 15.70 0.19

Ser TCC 17.92 0.22

Ser AGT 12.25 0.15

Ser AGC 19.54 0.24

Pro CCG 6.33 0.11

Pro CCA 17.10 0.28

Pro CCT 18.31 0.30

Pro CCC 18.42 0.31

AA codon /1000 frac

Leu CTG 39.95 0.40

Leu CTA 7.89 0.08

Leu CTT 12.97 0.13

Leu CTC 20.04 0.20

Ala GCG 6.72 0.10

Ala GCA 15.80 0.23

Ala GCT 20.12 0.29

Ala GCC 26.51 0.38

Gln CAG 34.18 0.75

Gln CAA 11.51 0.25

codon usage and likelihood ratio
Codon Usage and Likelihood Ratio
  • An ORF is more “believable” than another if it has more “likely” codons
  • Do sliding window calculations to find ORFs that have the “likely” codon usage
  • Allows for higher precision in identifying true ORFs; much better than merely testing for length.
  • However, average vertebrate exon length is 130 nucleotides, which is often too small to produce reliable peaks in the likelihood ratio
  • Further improvement: in-frame hexamer count (frequencies of pairs of consecutive codons)
splicing signals
Splicing Signals
  • Try to recognize location of splicing signals at exon-intron junctions. This has yielded a weakly conserved donor splice site and acceptor splice site
  • Profiles for sites are still weak, and lends the problem to the Hidden Markov Model (HMM) approaches, which capture the statistical dependencies between sites
donor and acceptor sites gt and ag dinucleotides
Donor and Acceptor Sites: GT and AG dinucleotides
  • The beginning and end of exons are signaled by donor and acceptor sites that usually have GT and AC dinucleotides
  • Detecting these sites is difficult, because GT and AC appear very often

Donor

Site

Acceptor

Site

GT

AC

exon 1

exon 2

assessing performance sensitivity specificity
Assessing performance: Sensitivity & Specificity
  • Testing of predictions is performed on sequences where the gene structure is known
  • Sensitivity is the fraction of known genes (or bases or exons) correctly predicted
    • “Am I finding the things that I’m supposed to find”
  • Specificity is the fraction of predicted genes (or bases or exons) that correspond to true genes
    • “What fraction of my predictions are true?”
  • In general, increasing one decreases the other
measures of prediction accuracy part 1
FN

TN

FN

TP

FN

TN

TN

TP

FP

REALITY

PREDICTION

Measures of Prediction Accuracy, Part 1

Nucleotide level accuracy

Sensitivity =

Specificity =

number of correct exons

number of actual exons

number of correct exons

number of predicted exons

measures of prediction accuracy part 2
Measures of Prediction Accuracy, Part 2

Exon level accuracy

MISSING

EXON

WRONGEXON

CORRECTEXON

REALITY

PREDICTION

examples of gene finders
Examples of Gene Finders

FGENES – linear DF for content and signal sensors and DP for finding optimal combination of exons

GeneMark – HMMs enhanced with ribosomal binding site recognition

Genie – neural networks for splicing, HMMs for coding sensors, overall structure modeled by HMM

Genscan – WM, WA and decision trees as signal sensors, HMMs for content sensors, overall HMM

HMMgene – HMM trained using conditional maximum likelihood

Morgan – decision trees for exon classification, also Markov Models

MZEF – quadratic DF, predict only internal exons

ab initio gene finding is difficult
Ab initio Gene Finding is Difficult
  • Genes are separated by large intergenic regions
  • Genes are not continuous, but split in a number of (small) coding exons, separated by (larger) non-coding introns
    • in humans coding sequence comprise only a few percents of the genome and an average of 5% of each gene
  • Sequence signals that are essential for elucidation of a gene structure are degenerate and highly unspecific
  • Alternative splicing
  • Repeat elements (>50% in humans) – some contain coding regions
problems with ab initio gene finding
Problems with Ab initio Gene Finding
  • No biological evidence
  • In long genomic sequences many false positive predictions
  • Prediction accuracy high, but not sufficient
integrated approaches for gene finding
Integrated Approaches for Gene Finding
  • Programs that integrate results of similarity searches with ab initio techniques (GenomeScan, FGENESH+, Procrustes)
  • Programs that use synteny between organisms (ROSETTA, SLAM)
  • Integration of programs predicting different elements of a gene (EuGène)
  • Combining predictions from several gene finding programs (combination of experts)
combining programs predictions
Combining Programs’ Predictions
  • Set of methods used and they way they are integrated differs between individual programs
  • Different programs often predict different elements of an actual gene

they could complement each other yielding better prediction

related work
Related Work
  • This approach was suggested by several authors
  • Burset and Guigó (1996)
    • Investigated correlation between 9 gene-finding programs
    • 99% of exons predicted by all programs were correct
    • 1% of exons completely missed by all programs
  • Murakami and Tagaki (1998)
    • Five methods for combining the prediction by 4 gene-finding programs
    • Nucleotide level accuracy measures improved by 3-5% in comparison with the best single
and and or methods
AND and OR Methods

exon 1

exon 2

union

intersection

combining genscan and hmmgene
Combining Genscan and HMMgene
  • High prediction accuracy as well as reliability of their exon probability made them the best candidates for our study
  • Genscan predicted 77% of exons correctly, HMMgene 75%, both 87%

Genscan

111

624

91

HMMgene

ad