”Gene Finding in Eukaryotic Genomes”
This presentation is the property of its rightful owner.
Sponsored Links
1 / 56

Nikolaj Blom Center for Biological Sequence Analysis BioCentrum-DTU PowerPoint PPT Presentation


  • 70 Views
  • Uploaded on
  • Presentation posted in: General

”Gene Finding in Eukaryotic Genomes” PhD course #27803 Spring 2003. Nikolaj Blom Center for Biological Sequence Analysis BioCentrum-DTU Technical University of Denmark nikob @cbs.dtu.dk. Human Genome Published HUGO: Nature, 15.feb.2001 Celera: Science, 16.feb.2001.

Download Presentation

Nikolaj Blom Center for Biological Sequence Analysis BioCentrum-DTU

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Nikolaj blom center for biological sequence analysis biocentrum dtu

”Gene Finding in Eukaryotic Genomes”

PhD course #27803

Spring 2003

Nikolaj Blom

Center for Biological Sequence Analysis BioCentrum-DTU

Technical University of Denmark

[email protected]


Nikolaj blom center for biological sequence analysis biocentrum dtu

Human Genome Published

HUGO: Nature, 15.feb.2001

Celera: Science, 16.feb.2001


We have the human genome sequence now what

We Have the Human Genome Sequence...now what?

  • So, what is the problem?

    • Well...

    • We don’t know how many genes there are!

    • We don’t know where they are!

    • We don’t know what they do!


The cellular machinery recognize genes without access to genbank swissprot or computers can we

The cellular machinery recognize genes without access to GenBank, SwissProt or computers – can we?


Needles in haystacks

Needles in Haystacks...

  • Only 2% of human genome is coding regions

  • Intron-exon structure of genes

    • Large introns (average 3365 bp )

    • Small exons (average 145 bp)

    • Long genes (average 27 kb)


Nikolaj blom center for biological sequence analysis biocentrum dtu

AAGAGGTAATTAAAGCTAAATGAAGTTGTAAGAGTGGCCCTATCGCATAGGACTAGTGTCCCTATAAGAACACGAAGAAATCACCTTAGAAAGGCTGAGAAAGGGCTGCAGGGCAGTGGGAGTGCAGACTGAAAGATGCAGACCACTGGGCTTCTACTTCTGTTTCCATTTCTGATCCGGCCTGCATCTGCCTCCTTCCTGAACAGGCCAGAGAATTCATCTAAATAGCCTAAGCAGGCTGGGTGCTGTGGCTCACCTGTAATCCCAACACTTGGGAGGCCGAGGTGGGCAGATCACCTGAGGTCAGGAGTTCAAGGCTAGCCTAGCCAACATGACAAAACCCCATCTCTACTAAAAAAATACAAAAATTAGCCAGGCATAGTGGCGCCTATAGTTCCAGCTACTTGGGGGCTGAGGTAGGAAGATCGCTAGAGCCTGGGAGGTTAAGGCTGCGGTGAGCTGTGATTGTGCCACTGCACTCCAGCCTGGGTGACAGAGCAAGACCCTGCCTCAAAAATAAATAAATAAATAAATAAATAAAAATAAGAGTGCTTGGCAGCTTGATCAAGCTATGCCAGGAACCCATCTCTCAAGCAGCAGCTCTTCTCCTGTGCCATTGTCAGCTTTGTCCTGTCTGAGTCCATGGGACTCTTCTGTTTGATGGTGGTCTTCCTCATCCTCTTCATCATGTGAAGCTCCATGGAGATCACCTACCCATACCTGCTTCTGTGACCTCATGCCATTCCTGGTGTTGGAATGTGCCAAGGTTTGCCATTAAACACACATTTCTCATTTCATAATTTCATATATATTATATATATGTGTGTGTGTGTGTGTTTATATATGCGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTATATATATATATATATATATATATATATATATATATATATAAAATATATAGGAAGAGGCACCAGAGAGCTCTCTGCATAGTCACAGAGGAAAGGTCATGTGAGGACAGCCAGAAGGCAGATGTCACAAGCCTCACCAGCAACCTACCATACCCTGCTTGTACCTCCATCCTGGAAGTCCAGCTTCTAAAATTAGAAGAAAATAGTCGGGTGTGGTGGCTCGCACCTATAATCCCAGCACTTTGGGAGGCTGATGTGGGAGGATCATTTGAGGTCAAGAGTTTGAAACCAGCCTAGGCAACATAGGGAGACCCTGTCTTTAAAAAAAATTTTTTTTTGTTTTAATTAGCTGGGTGTGATGGTGCACACCTGAGTCCTAGCTACTTGGGAGGCTGAGGTAGGAGGATCCCCTGAGCCCAGGGAAGTGGAGGCTGCAGTGAGCCATGATCACACCACTGCAATACAGCCTGGGTGACAGAGCAAGACCTTATCTCAAAATAAACAAACAAACAAAAAAGATGACAAAATAAATGTCTGTCGTTTAAGTCACCCATTCTGTGATATCTTGTTACGGCAGCCTGAACTGACCAATACACTTCCTCACCCAGTTTAAATTCCATGCTCAATCATAATCAGCCATTGCAATTACCCTCAACTGTATTATCAACCCTCAATTTGTATTAGTTGCTTGGCAAAACCCAAACCCTTGTGAAATCCAGTTCTTCTATATCTACATCGATGCTGCCGAATATGGCTGAAGAAAAGCAACTGTGTTGACTGGACTGCTTTAAATTCATGACCACTTACCTCAAGTGGGCACTTAACTTCCTGGCAATTATTCTACATTTTTCTAGTCCATTAACTCTCCTCCTCTCTGAGTTAATTATTTCACAGCTTTTCCTCCCTCTTTATACATGTTCCATCCTAACTCTCTGCTGATGACCTTGTTTCTTATTTCACTAATGGAGGCCACCAGGAGAGAACTCCCACAGCCATCAAATTCACCAAGCCAACAGCATCCTTACACAAATCCTCTGCCTTCTCTCTGGGCTGGCTGTGCCCTCTCTTTGCTCCTGCAATTTCCCTAACTCTCCTATACTGTTGTTATTCACTCTCCAGTGGATAATCACCATCAGGATGCAAAGATGCTGTACTAGCTTCTGAACTCTCCAAAAACCCAGGAAACAAAAAGGCAAAGGCTAAGCTTTTTCTTATTCCCCCTTCCAGCTATTGTACTGTTTCTCTGCTTTTAATTTATTTTTATTTATTTATTTATTTATTTATTTATTTATTTTTGAGATGGAGCTTCACTCTTGTTGCCCAGGCTGGAGCGCAATGGCGCGATCTCAGCTCACCGCAACCTCTACTTCCCGAATTCAAGTGATTGTCCTGCCTCAGCCTCCCGAGTAGCCGGGATTACAGGCATGCGCCACCACGCCTGGCTAATTTTGTACTTTTAGTAGAGACGGGGTTTCTCCATGTTGCTCAGCCTGGTCACAAACTCCCGATCTCAGGTGATCTGCCTGCCTCGGCCTCCCAAAGTGCTGGGATTACAGGCGTGAGCCACCACGCCCCACCGTCTCTGTTCTCTTTTAAAGCACAATCCCTCAACACAAGTGTCTATACTCAGCGTCTCCACTTTCCCTCCATCTGGTCTTCCCAGTGCCCCCTTGTCAGGTTTTCACCCCATGCTCCTCCAGGGCTAGTCTGCTCTTGCTTCCCGTCTTACTGGAAGACCAGCAGCATTTGACAGAGTTGGTCACTCTCTCCTCCTTGGACACCTTTTCTTCACTTGGTTTCCAGAACAGCATTATCTCCTGCTTATTGTCTTCCTCAGTCTACCTCAGTGAAAAGCTTTACTGGTTCCTCCACATCTCCCAGACCTCCAGTAATAACAGGAATGTACCATGCCATTGCTCTCTCTCTCTCCTTTTTTTTTTTTTTTTTTTTTTTTTGTTGAGACAGAGTCTCAATTTTATCACCCAGACTGAAGCACAATGGCATGATCATAGCTCATTGCAGTCTCGAACTCGTGGGCTCAAGCAATCCTCCCACCTCAGCCTCCTGAATAGCTGGGACTACAAGCAACACCACCATGCCCAGCTAACTTTCTATTTTTTATTTTTATTTTTTGTAGAGATGAGGTTTTACTATGTTGCCTAGGCTAGTCTTGAACTCCTGGGCCCAAATGATCCTCCCACCTTGGTCTCCCAAAGTGCTGGGATTATAGGCGTGAGCCACCGTGTCCAACTTCTCTTTCTTAATGGAATTTAGGCAAAAGTTATTACTCATGGCCTTGGAATGCTCTTTCCTCAGATAGCCACATGGCTCACCATTACTTCCTTCCAGCTTTCTTCAAAGATCCACTTCTCAGTGAAGCTTTGTCCTGACCACCCAGCTGAAAATTGCAATCCTCTTCTGTCTACCATGTACATACTCTCTATTTGCTTTCCTTCCTTTATTTCTCTCTGTAGGTGTGACCTAACATAACATATAATTTACTTCTGTACCTTGTTTGCTTTCTGTCTTCCCCTTTAGAACATAAGCTCCATGAGGGAAGGCGTTTTTGCCTGCTTTAGTCACTTTATCTCCAGCAACTACAACTATATGTATATATACACACACATATATATACACACACATATATATACACACACATATATATATACATATATATATATAGTAGGCACTCAATAAACATTCACTGAATGAATGAACAGTAATGCTCACTTGCCCATAAATACAAGTACCTCATCTTTTACCACAAAGGGTATTTGTAAATATTTAGGTTGTTTCTACCCAGATTATGGCTTGGTAATTCTTTTTTTTTTTTTCTAATTTTTATTTTTTTTCTAGGGACAGGGTCTCACTATGTTGCCCAGGATGGTCTTGAACTCCTGGGCTCAAGCATTCTGCCTGCCTTGGCCTCCTAAAGTGCTGAGATTACAGGCATGAGCCACCGTGCCTGCCTTCATGTATGTTTTTAGAACACAGAGAAAATGTGTTCTAAATGTGCTCATTGCTCAGCAATGAGCAAAGGCTTATGCAGTCACCACCAATCAAAAACTTTTTTTTTTTTTTTTGAGACAAGATCTTGCTCTGTTGCCCAGGCTGGAGTGCAGTGGCAGGATCATAGCAAGCTGCAGTCTTGACCTCATAGGCCTAAATCATCCTCCCACCTCAGCCTCACAAGTAGCTAAGACCACAGGTACAAGCCACCGTATCTAGCTAACTTTCAAAATTTTTTGAATTTTTAAATTTAAAAATTTTGAGGCCAGGCTGGCCTCAAACTCCTGAGCTCAAGCAATCCTCCCACCTTGGCTTCCCAAAGTGCTGGGATTATAGGCGTGAGCAACTGTACCTGGCAAAAACTTTTTAAGAGCTTCGCTTCCAGGATTAGGCAACTTTAACCTTCAACAGTGATCATAACCCTTAGTTTTCAGATCCGATTAAGGGAAATGTGTAATGTCTTACTGACACACTAATCCCATCACTGCTCACACCACCCACAATTAGCTGAG


Nikolaj blom center for biological sequence analysis biocentrum dtu

AAGAGGTAATTAAAGCTAAATGAAGTTGTAAGAGTGGCCCTATCGCATAGGACTAGTGTCCCTATAAGAACACGAAGAAATCACCTTAGAAAGGCTGAGAAAGGGCTGCAGGGCAGTGGGAGTGCAGACTGAAAGATGCAGACCACTGGGCTTCTACTTCTGTTTCCATTTCTGATCCGGCCTGCATCTGCCTCCTTCCTGAACAGGCCAGAGAATTCATCTAAATAGCCTAAGCAGGCTGGGTGCTGTGGCTCACCTGTAATCCCAACACTTGGGAGGCCGAGGTGGGCAGATCACCTGAGGTCAGGAGTTCAAGGCTAGCCTAGCCAACATGACAAAACCCCATCTCTACTAAAAAAATACAAAAATTAGCCAGGCATAGTGGCGCCTATAGTTCCAGCTACTTGGGGGCTGAGGTAGGAAGATCGCTAGAGCCTGGGAGGTTAAGGCTGCGGTGAGCTGTGATTGTGCCACTGCACTCCAGCCTGGGTGACAGAGCAAGACCCTGCCTCAAAAATAAATAAATAAATAAATAAATAAAAATAAGAGTGCTTGGCAGCTTGATCAAGCTATGCCAGGAACCCATCTCTCAAGCAGCAGCTCTTCTCCTGTGCCATTGTCAGCTTTGTCCTGTCTGAGTCCATGGGACTCTTCTGTTTGATGGTGGTCTTCCTCATCCTCTTCATCATGTGAAGCTCCATGGAGATCACCTACCCATACCTGCTTCTGTGACCTCATGCCATTCCTGGTGTTGGAATGTGCCAAGGTTTGCCATTAAACACACATTTCTCATTTCATAATTTCATATATATTATATATATGTGTGTGTGTGTGTGTTTATATATGCGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTATATATATATATATATATATATATATATATATATATATATAAAATATATAGGAAGAGGCACCAGAGAGCTCTCTGCATAGTCACAGAGGAAAGGTCATGTGAGGACAGCCAGAAGGCAGATGTCACAAGCCTCACCAGCAACCTACCATACCCTGCTTGTACCTCCATCCTGGAAGTCCAGCTTCTAAAATTAGAAGAAAATAGTCGGGTGTGGTGGCTCGCACCTATAATCCCAGCACTTTGGGAGGCTGATGTGGGAGGATCATTTGAGGTCAAGAGTTTGAAACCAGCCTAGGCAACATAGGGAGACCCTGTCTTTAAAAAAAATTTTTTTTTGTTTTAATTAGCTGGGTGTGATGGTGCACACCTGAGTCCTAGCTACTTGGGAGGCTGAGGTAGGAGGATCCCCTGAGCCCAGGGAAGTGGAGGCTGCAGTGAGCCATGATCACACCACTGCAATACAGCCTGGGTGACAGAGCAAGACCTTATCTCAAAATAAACAAACAAACAAAAAAGATGACAAAATAAATGTCTGTCGTTTAAGTCACCCATTCTGTGATATCTTGTTACGGCAGCCTGAACTGACCAATACACTTCCTCACCCAGTTTAAATTCCATGCTCAATCATAATCAGCCATTGCAATTACCCTCAACTGTATTATCAACCCTCAATTTGTATTAGTTGCTTGGCAAAACCCAAACCCTTGTGAAATCCAGTTCTTCTATATCTACATCGATGCTGCCGAATATGGCTGAAGAAAAGCAACTGTGTTGACTGGACTGCTTTAAATTCATGACCACTTACCTCAAGTGGGCACTTAACTTCCTGGCAATTATTCTACATTTTTCTAGTCCATTAACTCTCCTCCTCTCTGAGTTAATTATTTCACAGCTTTTCCTCCCTCTTTATACATGTTCCATCCTAACTCTCTGCTGATGACCTTGTTTCTTATTTCACTAATGGAGGCCACCAGGAGAGAACTCCCACAGCCATCAAATTCACCAAGCCAACAGCATCCTTACACAAATCCTCTGCCTTCTCTCTGGGCTGGCTGTGCCCTCTCTTTGCTCCTGCAATTTCCCTAACTCTCCTATACTGTTGTTATTCACTCTCCAGTGGATAATCACCATCAGGATGCAAAGATGCTGTACTAGCTTCTGAACTCTCCAAAAACCCAGGAAACAAAAAGGCAAAGGCTAAGCTTTTTCTTATTCCCCCTTCCAGCTATTGTACTGTTTCTCTGCTTTTAATTTATTTTTATTTATTTATTTATTTATTTATTTATTTATTTTTGAGATGGAGCTTCACTCTTGTTGCCCAGGCTGGAGCGCAATGGCGCGATCTCAGCTCACCGCAACCTCTACTTCCCGAATTCAAGTGATTGTCCTGCCTCAGCCTCCCGAGTAGCCGGGATTACAGGCATGCGCCACCACGCCTGGCTAATTTTGTACTTTTAGTAGAGACGGGGTTTCTCCATGTTGCTCAGCCTGGTCACAAACTCCCGATCTCAGGTGATCTGCCTGCCTCGGCCTCCCAAAGTGCTGGGATTACAGGCGTGAGCCACCACGCCCCACCGTCTCTGTTCTCTTTTAAAGCACAATCCCTCAACACAAGTGTCTATACTCAGCGTCTCCACTTTCCCTCCATCTGGTCTTCCCAGTGCCCCCTTGTCAGGTTTTCACCCCATGCTCCTCCAGGGCTAGTCTGCTCTTGCTTCCCGTCTTACTGGAAGACCAGCAGCATTTGACAGAGTTGGTCACTCTCTCCTCCTTGGACACCTTTTCTTCACTTGGTTTCCAGAACAGCATTATCTCCTGCTTATTGTCTTCCTCAGTCTACCTCAGTGAAAAGCTTTACTGGTTCCTCCACATCTCCCAGACCTCCAGTAATAACAGGAATGTACCATGCCATTGCTCTCTCTCTCTCCTTTTTTTTTTTTTTTTTTTTTTTTTGTTGAGACAGAGTCTCAATTTTATCACCCAGACTGAAGCACAATGGCATGATCATAGCTCATTGCAGTCTCGAACTCGTGGGCTCAAGCAATCCTCCCACCTCAGCCTCCTGAATAGCTGGGACTACAAGCAACACCACCATGCCCAGCTAACTTTCTATTTTTTATTTTTATTTTTTGTAGAGATGAGGTTTTACTATGTTGCCTAGGCTAGTCTTGAACTCCTGGGCCCAAATGATCCTCCCACCTTGGTCTCCCAAAGTGCTGGGATTATAGGCGTGAGCCACCGTGTCCAACTTCTCTTTCTTAATGGAATTTAGGCAAAAGTTATTACTCATGGCCTTGGAATGCTCTTTCCTCAGATAGCCACATGGCTCACCATTACTTCCTTCCAGCTTTCTTCAAAGATCCACTTCTCAGTGAAGCTTTGTCCTGACCACCCAGCTGAAAATTGCAATCCTCTTCTGTCTACCATGTACATACTCTCTATTTGCTTTCCTTCCTTTATTTCTCTCTGTAGGTGTGACCTAACATAACATATAATTTACTTCTGTACCTTGTTTGCTTTCTGTCTTCCCCTTTAGAACATAAGCTCCATGAGGGAAGGCGTTTTTGCCTGCTTTAGTCACTTTATCTCCAGCAACTACAACTATATGTATATATACACACACATATATATACACACACATATATATACACACACATATATATATACATATATATATATAGTAGGCACTCAATAAACATTCACTGAATGAATGAACAGTAATGCTCACTTGCCCATAAATACAAGTACCTCATCTTTTACCACAAAGGGTATTTGTAAATATTTAGGTTGTTTCTACCCAGATTATGGCTTGGTAATTCTTTTTTTTTTTTTCTAATTTTTATTTTTTTTCTAGGGACAGGGTCTCACTATGTTGCCCAGGATGGTCTTGAACTCCTGGGCTCAAGCATTCTGCCTGCCTTGGCCTCCTAAAGTGCTGAGATTACAGGCATGAGCCACCGTGCCTGCCTTCATGTATGTTTTTAGAACACAGAGAAAATGTGTTCTAAATGTGCTCATTGCTCAGCAATGAGCAAAGGCTTATGCAGTCACCACCAATCAAAAACTTTTTTTTTTTTTTTTGAGACAAGATCTTGCTCTGTTGCCCAGGCTGGAGTGCAGTGGCAGGATCATAGCAAGCTGCAGTCTTGACCTCATAGGCCTAAATCATCCTCCCACCTCAGCCTCACAAGTAGCTAAGACCACAGGTACAAGCCACCGTATCTAGCTAACTTTCAAAATTTTTTGAATTTTTAAATTTAAAAATTTTGAGGCCAGGCTGGCCTCAAACTCCTGAGCTCAAGCAATCCTCCCACCTTGGCTTCCCAAAGTGCTGGGATTATAGGCGTGAGCAACTGTACCTGGCAAAAACTTTTTAAGAGCTTCGCTTCCAGGATTAGGCAACTTTAACCTTCAACAGTGATCATAACCCTTAGTTTTCAGATCCGATTAAGGGAAATGTGTAATGTCTTACTGACACACTAATCCCATCACTGCTCACACCACCCACAATTAGCTGAG


Nikolaj blom center for biological sequence analysis biocentrum dtu

Genes and Signals


Gene features

Gene Features

  • Codon frequency/bias

    • Organism dependent

    • Hexamer statistics

  • Transcriptional

    • Promoters/enhancers

  • Exon/introns

    • Length distributions

    • ORFs

  • Splicing

    • Donor/acceptor sites

    • Branchpoints

  • Translational

    • Ribosome binding sites


Nikolaj blom center for biological sequence analysis biocentrum dtu

Codon Bias

  • Gene Finders are often organism specific

  • Coding regions often modelled by 5th order Markov chain (hexamers/di-codons)


Exon size

Exon Size


Intron size

Intron Size


Intron prevalence

Intron Prevalence


Gene finding challenges

Gene Finding Challenges

  • Need the correct reading frame

    • Introns can interrupt an exon in mid-codon

  • There is no hard and fast rule for identifying donor and acceptor splice sites

    • Signals are very weak


Overpredicting genes

Overpredicting Genes

  • Easy to predict all exons

  • Report all sequences flanked by ..AG and GT.. as exons

  • Sensitivity = 100%

  • Specificity ~ 0%


Sensor based methods

Sensor-based methods

  • Similarity searches misses some/many genes

  • cDNA/EST libraries are not perfect

  • Ab initio Gene Finders

    • HMM-based

      • GenScan

      • HMMgene

    • Neural network-based

      • GRAIL

      • NetGene2 (splice sites)


Gene prediction

Gene Prediction

  • ”Isolated” methods

    • Predict individual features

      • E.g. splice sites, coding regions

      • NetGene (Neural network)

        • http://www.cbs.dtu.dk/services/NetGene2/

  • ”Integrated” methods

    • Predict genes in context

      • ”Grammar” of genes

      • Certain elements in specific order are required

        • HMMgene http://www.cbs.dtu.dk/services/HMMgene/

        • GenScan (HMM-based) http://genes.mit.edu/GENSCAN.html


Gene grammar

Gene Grammar

Isolated features

HAPPYEUGENEAWASGUYFINDER


Gene grammar1

Gene Grammar

Isolated features

HAPPYEUGENEAWASGUYFINDER

Intron 3’UTR Exon Promoter Exon RBS


Gene grammar2

Gene Grammar

Integrated features

EUGENEFINDERWASAHAPPYGUY

HAPPYEUGENEAWASGUYFINDER


Gene grammar3

Gene Grammar

Integrated features

EUGENEFINDERWASAHAPPYGUY

PromRBSExonIntronExon3’UTR


Gene grammar4

Gene Grammar

”Isolated” methods (e.g.NN):

HAPPYEUGENEAWASGUYFINDER

”Integrated” methods (e.g.HMM):

EUGENEFINDERWASAHAPPYGUY


Hmms for genefinding

HMMs for genefinding

  • GenScan principle

    • E=exon

    • I=intron

    • F=5’ UTR

    • T=3’ UTR

    • P=promoter

    • N=intergenic


Genscan http genes mit edu genscan html

Genscan http://genes.mit.edu/GENSCAN.html


Genscan

Genscan


Genscan http genes mit edu genscan html1

Genscan http://genes.mit.edu/GENSCAN.html


Genscan1

Genscan


Genscan2

Genscan


Hmmgene http www cbs dtu dk services hmmgene

HMMgene http://www.cbs.dtu.dk/services/HMMgene/


Hmmgene http www cbs dtu dk services hmmgene1

HMMgene http://www.cbs.dtu.dk/services/HMMgene/

  • Columns

    • Sequence identifier

    • Program name

    • Prediction (see table below for the meaning).

    • Beginning

    • End

    • Score between 0 and 1

    • Strand: $+$ for direct and $-$ for complementary

    • Frame (for exons it is the position of the donor in the frame)

    • Group to which prediction belong. If several CDS's are found they will be called cds_1, cds_2, etc. `bestparse:' is there because alternative predictions will also be available (see below).

NameMeaning

firstex The coding part of the first coding exon starting with the first base of the start codon.

exon_N The N'th predicted internal coding exon.

lastex The coding part of the last coding exon ending with the last base of the stop codon.

singleex The coding part of an exon in a gene with only one coding exon.

CDS Coding region composed of the exon predictions prior to this line.


Defining the term exon

Defining the term ’exon’

  • Gene Prediction programs often use

    • Exon = CDS (coding sequence)

  • Real exons may contain 5’ or 3’ UTRs (untranslated regions)


Gene prediction netgene 2

Gene Prediction – NetGene2


Gene prediction netgene 21

Gene Prediction – NetGene2


Gene prediction netgene 22

Gene Prediction – NetGene2


Gene prediction netgene 23

Gene Prediction – NetGene2


Nix visualizing gene predictions

NIX – Visualizing Gene Predictions

http://www.hgmp.mrc.ac.uk/NIX/


Nikolaj blom center for biological sequence analysis biocentrum dtu

Gene Prediction – Performance of Genscan


Nikolaj blom center for biological sequence analysis biocentrum dtu

Performance of Genscan – Exon Length


Repeatmasker

Repeatmasker

  • Repetitive sequences in human/eukaryotic genomes are a problem

  • Run gene predictions on large genomic regions before and after masking of repetitive sequence:

    • http://ftp.genome.washington.edu/cgi-bin/RepeatMasker

  • Up to 45% of human genomic sequence derived from transposable/repetitive elements


Repeatmasker1

Repeatmasker


Future challenges

Future Challenges

  • Bootstrapping: prediction improves as more genes become known

    • ’Extreme’ genes (long/short) still difficult

    • Initial and terminal exons are predicted with lower confidence

  • Combine with Sequence Similarity Matches

  • Non-coding RNAs

    • Most gene prediction programs only predict protein-coding genes

    • tRNA and rRNA genes are not predicted

  • Prokaryotic gene finding

    • Much easier (no introns), but still not perfect

    • Especially short genes (<300 bp) difficult


Gene prediction1

Gene Prediction

  • Take home messages

    • Human genome sequence is known

    • Number of human genes is unknown!

      • Before 2001: est.30,000-140,000

      • Anno 2003: 30,000-40,000

    • Location, structure and function of many human genes is unknown!

    • Genes may be discovered by different means and methods

    • ...


Gene prediction2

Gene Prediction

  • Take home messages

    • Genes may be predicted by computer programs

    • Masking of repetitive sequences may be required for large genomic sequences

    • ’Unusual’ genes are difficult (high GC%, short or terminal exons)

    • HMM-based gene prediction programs are suitable for “Gene Grammar”

  • Prediction methods are not perfect!


Nikolaj blom center for biological sequence analysis biocentrum dtu

The End


Nikolaj blom center for biological sequence analysis biocentrum dtu

Gene Prediction Exercises

I. Gene Finding in Prokaryotic Sequence

II. Gene Finding in Eukaryotic Sequence

Exercises at:

http://www.cbs.dtu.dk/phdcourse/programme.html

http://www.cbs.dtu.dk/phdcourse/cookbooks/genefinding/pro.html

http://www.cbs.dtu.dk/phdcourse/cookbooks/genefinding/euk.html


Gene prediction exercise

Gene Prediction Exercise

http://www.cbs.dtu.dk/dtucourse/cookbooks/nikob/exercises/gf_exercise_solution.html


Nikolaj blom center for biological sequence analysis biocentrum dtu

Gene Prediction – Performance of Genscan


Genome browsing exercise 1

Genome Browsing - Exercise #1

  • How many exons are encoded by the hoxA10 gene?

    • 2 exons

  • How many basepairs is the transcript length ?

    • 2542 bp


Genome browsing exercise 11

Genome Browsing - Exercise #1

  • On what chromosome is the hoxA10 gene?

    • Human chr.7

  • On which arm (short/p or long/q) ?

    • p

  • What gene is located ca. 500 kb downstream of HoxA10 ?

    • Scap2

  • On what mouse chromosome is the ortholog/homolog of human HoxA10 located?

    • Mouse chr.6

  • In the overview panel, there is a gene located ca. 300 kb downstream of HoxA10, what is the name?

    • Scap2


Nikolaj blom center for biological sequence analysis biocentrum dtu

http://www.cbs.dtu.dk/dtucourse/cookbooks/nikob/exercises/gf_exercise_solution.html


  • Login