”Gene Finding in Eukaryotic Genomes”
This presentation is the property of its rightful owner.
Sponsored Links
1 / 56

Nikolaj Blom Center for Biological Sequence Analysis BioCentrum-DTU PowerPoint PPT Presentation


  • 71 Views
  • Uploaded on
  • Presentation posted in: General

”Gene Finding in Eukaryotic Genomes” PhD course #27803 Spring 2003. Nikolaj Blom Center for Biological Sequence Analysis BioCentrum-DTU Technical University of Denmark nikob @cbs.dtu.dk. Human Genome Published HUGO: Nature, 15.feb.2001 Celera: Science, 16.feb.2001.

Download Presentation

Nikolaj Blom Center for Biological Sequence Analysis BioCentrum-DTU

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


”Gene Finding in Eukaryotic Genomes”

PhD course #27803

Spring 2003

Nikolaj Blom

Center for Biological Sequence Analysis BioCentrum-DTU

Technical University of Denmark

[email protected]


Human Genome Published

HUGO: Nature, 15.feb.2001

Celera: Science, 16.feb.2001


We Have the Human Genome Sequence...now what?

  • So, what is the problem?

    • Well...

    • We don’t know how many genes there are!

    • We don’t know where they are!

    • We don’t know what they do!


The cellular machinery recognize genes without access to GenBank, SwissProt or computers – can we?


Needles in Haystacks...

  • Only 2% of human genome is coding regions

  • Intron-exon structure of genes

    • Large introns (average 3365 bp )

    • Small exons (average 145 bp)

    • Long genes (average 27 kb)


AAGAGGTAATTAAAGCTAAATGAAGTTGTAAGAGTGGCCCTATCGCATAGGACTAGTGTCCCTATAAGAACACGAAGAAATCACCTTAGAAAGGCTGAGAAAGGGCTGCAGGGCAGTGGGAGTGCAGACTGAAAGATGCAGACCACTGGGCTTCTACTTCTGTTTCCATTTCTGATCCGGCCTGCATCTGCCTCCTTCCTGAACAGGCCAGAGAATTCATCTAAATAGCCTAAGCAGGCTGGGTGCTGTGGCTCACCTGTAATCCCAACACTTGGGAGGCCGAGGTGGGCAGATCACCTGAGGTCAGGAGTTCAAGGCTAGCCTAGCCAACATGACAAAACCCCATCTCTACTAAAAAAATACAAAAATTAGCCAGGCATAGTGGCGCCTATAGTTCCAGCTACTTGGGGGCTGAGGTAGGAAGATCGCTAGAGCCTGGGAGGTTAAGGCTGCGGTGAGCTGTGATTGTGCCACTGCACTCCAGCCTGGGTGACAGAGCAAGACCCTGCCTCAAAAATAAATAAATAAATAAATAAATAAAAATAAGAGTGCTTGGCAGCTTGATCAAGCTATGCCAGGAACCCATCTCTCAAGCAGCAGCTCTTCTCCTGTGCCATTGTCAGCTTTGTCCTGTCTGAGTCCATGGGACTCTTCTGTTTGATGGTGGTCTTCCTCATCCTCTTCATCATGTGAAGCTCCATGGAGATCACCTACCCATACCTGCTTCTGTGACCTCATGCCATTCCTGGTGTTGGAATGTGCCAAGGTTTGCCATTAAACACACATTTCTCATTTCATAATTTCATATATATTATATATATGTGTGTGTGTGTGTGTTTATATATGCGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTATATATATATATATATATATATATATATATATATATATATAAAATATATAGGAAGAGGCACCAGAGAGCTCTCTGCATAGTCACAGAGGAAAGGTCATGTGAGGACAGCCAGAAGGCAGATGTCACAAGCCTCACCAGCAACCTACCATACCCTGCTTGTACCTCCATCCTGGAAGTCCAGCTTCTAAAATTAGAAGAAAATAGTCGGGTGTGGTGGCTCGCACCTATAATCCCAGCACTTTGGGAGGCTGATGTGGGAGGATCATTTGAGGTCAAGAGTTTGAAACCAGCCTAGGCAACATAGGGAGACCCTGTCTTTAAAAAAAATTTTTTTTTGTTTTAATTAGCTGGGTGTGATGGTGCACACCTGAGTCCTAGCTACTTGGGAGGCTGAGGTAGGAGGATCCCCTGAGCCCAGGGAAGTGGAGGCTGCAGTGAGCCATGATCACACCACTGCAATACAGCCTGGGTGACAGAGCAAGACCTTATCTCAAAATAAACAAACAAACAAAAAAGATGACAAAATAAATGTCTGTCGTTTAAGTCACCCATTCTGTGATATCTTGTTACGGCAGCCTGAACTGACCAATACACTTCCTCACCCAGTTTAAATTCCATGCTCAATCATAATCAGCCATTGCAATTACCCTCAACTGTATTATCAACCCTCAATTTGTATTAGTTGCTTGGCAAAACCCAAACCCTTGTGAAATCCAGTTCTTCTATATCTACATCGATGCTGCCGAATATGGCTGAAGAAAAGCAACTGTGTTGACTGGACTGCTTTAAATTCATGACCACTTACCTCAAGTGGGCACTTAACTTCCTGGCAATTATTCTACATTTTTCTAGTCCATTAACTCTCCTCCTCTCTGAGTTAATTATTTCACAGCTTTTCCTCCCTCTTTATACATGTTCCATCCTAACTCTCTGCTGATGACCTTGTTTCTTATTTCACTAATGGAGGCCACCAGGAGAGAACTCCCACAGCCATCAAATTCACCAAGCCAACAGCATCCTTACACAAATCCTCTGCCTTCTCTCTGGGCTGGCTGTGCCCTCTCTTTGCTCCTGCAATTTCCCTAACTCTCCTATACTGTTGTTATTCACTCTCCAGTGGATAATCACCATCAGGATGCAAAGATGCTGTACTAGCTTCTGAACTCTCCAAAAACCCAGGAAACAAAAAGGCAAAGGCTAAGCTTTTTCTTATTCCCCCTTCCAGCTATTGTACTGTTTCTCTGCTTTTAATTTATTTTTATTTATTTATTTATTTATTTATTTATTTATTTTTGAGATGGAGCTTCACTCTTGTTGCCCAGGCTGGAGCGCAATGGCGCGATCTCAGCTCACCGCAACCTCTACTTCCCGAATTCAAGTGATTGTCCTGCCTCAGCCTCCCGAGTAGCCGGGATTACAGGCATGCGCCACCACGCCTGGCTAATTTTGTACTTTTAGTAGAGACGGGGTTTCTCCATGTTGCTCAGCCTGGTCACAAACTCCCGATCTCAGGTGATCTGCCTGCCTCGGCCTCCCAAAGTGCTGGGATTACAGGCGTGAGCCACCACGCCCCACCGTCTCTGTTCTCTTTTAAAGCACAATCCCTCAACACAAGTGTCTATACTCAGCGTCTCCACTTTCCCTCCATCTGGTCTTCCCAGTGCCCCCTTGTCAGGTTTTCACCCCATGCTCCTCCAGGGCTAGTCTGCTCTTGCTTCCCGTCTTACTGGAAGACCAGCAGCATTTGACAGAGTTGGTCACTCTCTCCTCCTTGGACACCTTTTCTTCACTTGGTTTCCAGAACAGCATTATCTCCTGCTTATTGTCTTCCTCAGTCTACCTCAGTGAAAAGCTTTACTGGTTCCTCCACATCTCCCAGACCTCCAGTAATAACAGGAATGTACCATGCCATTGCTCTCTCTCTCTCCTTTTTTTTTTTTTTTTTTTTTTTTTGTTGAGACAGAGTCTCAATTTTATCACCCAGACTGAAGCACAATGGCATGATCATAGCTCATTGCAGTCTCGAACTCGTGGGCTCAAGCAATCCTCCCACCTCAGCCTCCTGAATAGCTGGGACTACAAGCAACACCACCATGCCCAGCTAACTTTCTATTTTTTATTTTTATTTTTTGTAGAGATGAGGTTTTACTATGTTGCCTAGGCTAGTCTTGAACTCCTGGGCCCAAATGATCCTCCCACCTTGGTCTCCCAAAGTGCTGGGATTATAGGCGTGAGCCACCGTGTCCAACTTCTCTTTCTTAATGGAATTTAGGCAAAAGTTATTACTCATGGCCTTGGAATGCTCTTTCCTCAGATAGCCACATGGCTCACCATTACTTCCTTCCAGCTTTCTTCAAAGATCCACTTCTCAGTGAAGCTTTGTCCTGACCACCCAGCTGAAAATTGCAATCCTCTTCTGTCTACCATGTACATACTCTCTATTTGCTTTCCTTCCTTTATTTCTCTCTGTAGGTGTGACCTAACATAACATATAATTTACTTCTGTACCTTGTTTGCTTTCTGTCTTCCCCTTTAGAACATAAGCTCCATGAGGGAAGGCGTTTTTGCCTGCTTTAGTCACTTTATCTCCAGCAACTACAACTATATGTATATATACACACACATATATATACACACACATATATATACACACACATATATATATACATATATATATATAGTAGGCACTCAATAAACATTCACTGAATGAATGAACAGTAATGCTCACTTGCCCATAAATACAAGTACCTCATCTTTTACCACAAAGGGTATTTGTAAATATTTAGGTTGTTTCTACCCAGATTATGGCTTGGTAATTCTTTTTTTTTTTTTCTAATTTTTATTTTTTTTCTAGGGACAGGGTCTCACTATGTTGCCCAGGATGGTCTTGAACTCCTGGGCTCAAGCATTCTGCCTGCCTTGGCCTCCTAAAGTGCTGAGATTACAGGCATGAGCCACCGTGCCTGCCTTCATGTATGTTTTTAGAACACAGAGAAAATGTGTTCTAAATGTGCTCATTGCTCAGCAATGAGCAAAGGCTTATGCAGTCACCACCAATCAAAAACTTTTTTTTTTTTTTTTGAGACAAGATCTTGCTCTGTTGCCCAGGCTGGAGTGCAGTGGCAGGATCATAGCAAGCTGCAGTCTTGACCTCATAGGCCTAAATCATCCTCCCACCTCAGCCTCACAAGTAGCTAAGACCACAGGTACAAGCCACCGTATCTAGCTAACTTTCAAAATTTTTTGAATTTTTAAATTTAAAAATTTTGAGGCCAGGCTGGCCTCAAACTCCTGAGCTCAAGCAATCCTCCCACCTTGGCTTCCCAAAGTGCTGGGATTATAGGCGTGAGCAACTGTACCTGGCAAAAACTTTTTAAGAGCTTCGCTTCCAGGATTAGGCAACTTTAACCTTCAACAGTGATCATAACCCTTAGTTTTCAGATCCGATTAAGGGAAATGTGTAATGTCTTACTGACACACTAATCCCATCACTGCTCACACCACCCACAATTAGCTGAG


AAGAGGTAATTAAAGCTAAATGAAGTTGTAAGAGTGGCCCTATCGCATAGGACTAGTGTCCCTATAAGAACACGAAGAAATCACCTTAGAAAGGCTGAGAAAGGGCTGCAGGGCAGTGGGAGTGCAGACTGAAAGATGCAGACCACTGGGCTTCTACTTCTGTTTCCATTTCTGATCCGGCCTGCATCTGCCTCCTTCCTGAACAGGCCAGAGAATTCATCTAAATAGCCTAAGCAGGCTGGGTGCTGTGGCTCACCTGTAATCCCAACACTTGGGAGGCCGAGGTGGGCAGATCACCTGAGGTCAGGAGTTCAAGGCTAGCCTAGCCAACATGACAAAACCCCATCTCTACTAAAAAAATACAAAAATTAGCCAGGCATAGTGGCGCCTATAGTTCCAGCTACTTGGGGGCTGAGGTAGGAAGATCGCTAGAGCCTGGGAGGTTAAGGCTGCGGTGAGCTGTGATTGTGCCACTGCACTCCAGCCTGGGTGACAGAGCAAGACCCTGCCTCAAAAATAAATAAATAAATAAATAAATAAAAATAAGAGTGCTTGGCAGCTTGATCAAGCTATGCCAGGAACCCATCTCTCAAGCAGCAGCTCTTCTCCTGTGCCATTGTCAGCTTTGTCCTGTCTGAGTCCATGGGACTCTTCTGTTTGATGGTGGTCTTCCTCATCCTCTTCATCATGTGAAGCTCCATGGAGATCACCTACCCATACCTGCTTCTGTGACCTCATGCCATTCCTGGTGTTGGAATGTGCCAAGGTTTGCCATTAAACACACATTTCTCATTTCATAATTTCATATATATTATATATATGTGTGTGTGTGTGTGTTTATATATGCGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTATATATATATATATATATATATATATATATATATATATATAAAATATATAGGAAGAGGCACCAGAGAGCTCTCTGCATAGTCACAGAGGAAAGGTCATGTGAGGACAGCCAGAAGGCAGATGTCACAAGCCTCACCAGCAACCTACCATACCCTGCTTGTACCTCCATCCTGGAAGTCCAGCTTCTAAAATTAGAAGAAAATAGTCGGGTGTGGTGGCTCGCACCTATAATCCCAGCACTTTGGGAGGCTGATGTGGGAGGATCATTTGAGGTCAAGAGTTTGAAACCAGCCTAGGCAACATAGGGAGACCCTGTCTTTAAAAAAAATTTTTTTTTGTTTTAATTAGCTGGGTGTGATGGTGCACACCTGAGTCCTAGCTACTTGGGAGGCTGAGGTAGGAGGATCCCCTGAGCCCAGGGAAGTGGAGGCTGCAGTGAGCCATGATCACACCACTGCAATACAGCCTGGGTGACAGAGCAAGACCTTATCTCAAAATAAACAAACAAACAAAAAAGATGACAAAATAAATGTCTGTCGTTTAAGTCACCCATTCTGTGATATCTTGTTACGGCAGCCTGAACTGACCAATACACTTCCTCACCCAGTTTAAATTCCATGCTCAATCATAATCAGCCATTGCAATTACCCTCAACTGTATTATCAACCCTCAATTTGTATTAGTTGCTTGGCAAAACCCAAACCCTTGTGAAATCCAGTTCTTCTATATCTACATCGATGCTGCCGAATATGGCTGAAGAAAAGCAACTGTGTTGACTGGACTGCTTTAAATTCATGACCACTTACCTCAAGTGGGCACTTAACTTCCTGGCAATTATTCTACATTTTTCTAGTCCATTAACTCTCCTCCTCTCTGAGTTAATTATTTCACAGCTTTTCCTCCCTCTTTATACATGTTCCATCCTAACTCTCTGCTGATGACCTTGTTTCTTATTTCACTAATGGAGGCCACCAGGAGAGAACTCCCACAGCCATCAAATTCACCAAGCCAACAGCATCCTTACACAAATCCTCTGCCTTCTCTCTGGGCTGGCTGTGCCCTCTCTTTGCTCCTGCAATTTCCCTAACTCTCCTATACTGTTGTTATTCACTCTCCAGTGGATAATCACCATCAGGATGCAAAGATGCTGTACTAGCTTCTGAACTCTCCAAAAACCCAGGAAACAAAAAGGCAAAGGCTAAGCTTTTTCTTATTCCCCCTTCCAGCTATTGTACTGTTTCTCTGCTTTTAATTTATTTTTATTTATTTATTTATTTATTTATTTATTTATTTTTGAGATGGAGCTTCACTCTTGTTGCCCAGGCTGGAGCGCAATGGCGCGATCTCAGCTCACCGCAACCTCTACTTCCCGAATTCAAGTGATTGTCCTGCCTCAGCCTCCCGAGTAGCCGGGATTACAGGCATGCGCCACCACGCCTGGCTAATTTTGTACTTTTAGTAGAGACGGGGTTTCTCCATGTTGCTCAGCCTGGTCACAAACTCCCGATCTCAGGTGATCTGCCTGCCTCGGCCTCCCAAAGTGCTGGGATTACAGGCGTGAGCCACCACGCCCCACCGTCTCTGTTCTCTTTTAAAGCACAATCCCTCAACACAAGTGTCTATACTCAGCGTCTCCACTTTCCCTCCATCTGGTCTTCCCAGTGCCCCCTTGTCAGGTTTTCACCCCATGCTCCTCCAGGGCTAGTCTGCTCTTGCTTCCCGTCTTACTGGAAGACCAGCAGCATTTGACAGAGTTGGTCACTCTCTCCTCCTTGGACACCTTTTCTTCACTTGGTTTCCAGAACAGCATTATCTCCTGCTTATTGTCTTCCTCAGTCTACCTCAGTGAAAAGCTTTACTGGTTCCTCCACATCTCCCAGACCTCCAGTAATAACAGGAATGTACCATGCCATTGCTCTCTCTCTCTCCTTTTTTTTTTTTTTTTTTTTTTTTTGTTGAGACAGAGTCTCAATTTTATCACCCAGACTGAAGCACAATGGCATGATCATAGCTCATTGCAGTCTCGAACTCGTGGGCTCAAGCAATCCTCCCACCTCAGCCTCCTGAATAGCTGGGACTACAAGCAACACCACCATGCCCAGCTAACTTTCTATTTTTTATTTTTATTTTTTGTAGAGATGAGGTTTTACTATGTTGCCTAGGCTAGTCTTGAACTCCTGGGCCCAAATGATCCTCCCACCTTGGTCTCCCAAAGTGCTGGGATTATAGGCGTGAGCCACCGTGTCCAACTTCTCTTTCTTAATGGAATTTAGGCAAAAGTTATTACTCATGGCCTTGGAATGCTCTTTCCTCAGATAGCCACATGGCTCACCATTACTTCCTTCCAGCTTTCTTCAAAGATCCACTTCTCAGTGAAGCTTTGTCCTGACCACCCAGCTGAAAATTGCAATCCTCTTCTGTCTACCATGTACATACTCTCTATTTGCTTTCCTTCCTTTATTTCTCTCTGTAGGTGTGACCTAACATAACATATAATTTACTTCTGTACCTTGTTTGCTTTCTGTCTTCCCCTTTAGAACATAAGCTCCATGAGGGAAGGCGTTTTTGCCTGCTTTAGTCACTTTATCTCCAGCAACTACAACTATATGTATATATACACACACATATATATACACACACATATATATACACACACATATATATATACATATATATATATAGTAGGCACTCAATAAACATTCACTGAATGAATGAACAGTAATGCTCACTTGCCCATAAATACAAGTACCTCATCTTTTACCACAAAGGGTATTTGTAAATATTTAGGTTGTTTCTACCCAGATTATGGCTTGGTAATTCTTTTTTTTTTTTTCTAATTTTTATTTTTTTTCTAGGGACAGGGTCTCACTATGTTGCCCAGGATGGTCTTGAACTCCTGGGCTCAAGCATTCTGCCTGCCTTGGCCTCCTAAAGTGCTGAGATTACAGGCATGAGCCACCGTGCCTGCCTTCATGTATGTTTTTAGAACACAGAGAAAATGTGTTCTAAATGTGCTCATTGCTCAGCAATGAGCAAAGGCTTATGCAGTCACCACCAATCAAAAACTTTTTTTTTTTTTTTTGAGACAAGATCTTGCTCTGTTGCCCAGGCTGGAGTGCAGTGGCAGGATCATAGCAAGCTGCAGTCTTGACCTCATAGGCCTAAATCATCCTCCCACCTCAGCCTCACAAGTAGCTAAGACCACAGGTACAAGCCACCGTATCTAGCTAACTTTCAAAATTTTTTGAATTTTTAAATTTAAAAATTTTGAGGCCAGGCTGGCCTCAAACTCCTGAGCTCAAGCAATCCTCCCACCTTGGCTTCCCAAAGTGCTGGGATTATAGGCGTGAGCAACTGTACCTGGCAAAAACTTTTTAAGAGCTTCGCTTCCAGGATTAGGCAACTTTAACCTTCAACAGTGATCATAACCCTTAGTTTTCAGATCCGATTAAGGGAAATGTGTAATGTCTTACTGACACACTAATCCCATCACTGCTCACACCACCCACAATTAGCTGAG


Genes and Signals


Gene Features

  • Codon frequency/bias

    • Organism dependent

    • Hexamer statistics

  • Transcriptional

    • Promoters/enhancers

  • Exon/introns

    • Length distributions

    • ORFs

  • Splicing

    • Donor/acceptor sites

    • Branchpoints

  • Translational

    • Ribosome binding sites


Codon Bias

  • Gene Finders are often organism specific

  • Coding regions often modelled by 5th order Markov chain (hexamers/di-codons)


Exon Size


Intron Size


Intron Prevalence


Gene Finding Challenges

  • Need the correct reading frame

    • Introns can interrupt an exon in mid-codon

  • There is no hard and fast rule for identifying donor and acceptor splice sites

    • Signals are very weak


Overpredicting Genes

  • Easy to predict all exons

  • Report all sequences flanked by ..AG and GT.. as exons

  • Sensitivity = 100%

  • Specificity ~ 0%


Sensor-based methods

  • Similarity searches misses some/many genes

  • cDNA/EST libraries are not perfect

  • Ab initio Gene Finders

    • HMM-based

      • GenScan

      • HMMgene

    • Neural network-based

      • GRAIL

      • NetGene2 (splice sites)


Gene Prediction

  • ”Isolated” methods

    • Predict individual features

      • E.g. splice sites, coding regions

      • NetGene (Neural network)

        • http://www.cbs.dtu.dk/services/NetGene2/

  • ”Integrated” methods

    • Predict genes in context

      • ”Grammar” of genes

      • Certain elements in specific order are required

        • HMMgene http://www.cbs.dtu.dk/services/HMMgene/

        • GenScan (HMM-based) http://genes.mit.edu/GENSCAN.html


Gene Grammar

Isolated features

HAPPYEUGENEAWASGUYFINDER


Gene Grammar

Isolated features

HAPPYEUGENEAWASGUYFINDER

Intron 3’UTR Exon Promoter Exon RBS


Gene Grammar

Integrated features

EUGENEFINDERWASAHAPPYGUY

HAPPYEUGENEAWASGUYFINDER


Gene Grammar

Integrated features

EUGENEFINDERWASAHAPPYGUY

PromRBSExonIntronExon3’UTR


Gene Grammar

”Isolated” methods (e.g.NN):

HAPPYEUGENEAWASGUYFINDER

”Integrated” methods (e.g.HMM):

EUGENEFINDERWASAHAPPYGUY


HMMs for genefinding

  • GenScan principle

    • E=exon

    • I=intron

    • F=5’ UTR

    • T=3’ UTR

    • P=promoter

    • N=intergenic


Genscan http://genes.mit.edu/GENSCAN.html


Genscan


Genscan http://genes.mit.edu/GENSCAN.html


Genscan


Genscan


HMMgene http://www.cbs.dtu.dk/services/HMMgene/


HMMgene http://www.cbs.dtu.dk/services/HMMgene/

  • Columns

    • Sequence identifier

    • Program name

    • Prediction (see table below for the meaning).

    • Beginning

    • End

    • Score between 0 and 1

    • Strand: $+$ for direct and $-$ for complementary

    • Frame (for exons it is the position of the donor in the frame)

    • Group to which prediction belong. If several CDS's are found they will be called cds_1, cds_2, etc. `bestparse:' is there because alternative predictions will also be available (see below).

NameMeaning

firstex The coding part of the first coding exon starting with the first base of the start codon.

exon_N The N'th predicted internal coding exon.

lastex The coding part of the last coding exon ending with the last base of the stop codon.

singleex The coding part of an exon in a gene with only one coding exon.

CDS Coding region composed of the exon predictions prior to this line.


Defining the term ’exon’

  • Gene Prediction programs often use

    • Exon = CDS (coding sequence)

  • Real exons may contain 5’ or 3’ UTRs (untranslated regions)


Gene Prediction – NetGene2


Gene Prediction – NetGene2


Gene Prediction – NetGene2


Gene Prediction – NetGene2


NIX – Visualizing Gene Predictions

http://www.hgmp.mrc.ac.uk/NIX/


Gene Prediction – Performance of Genscan


Performance of Genscan – Exon Length


Repeatmasker

  • Repetitive sequences in human/eukaryotic genomes are a problem

  • Run gene predictions on large genomic regions before and after masking of repetitive sequence:

    • http://ftp.genome.washington.edu/cgi-bin/RepeatMasker

  • Up to 45% of human genomic sequence derived from transposable/repetitive elements


Repeatmasker


Future Challenges

  • Bootstrapping: prediction improves as more genes become known

    • ’Extreme’ genes (long/short) still difficult

    • Initial and terminal exons are predicted with lower confidence

  • Combine with Sequence Similarity Matches

  • Non-coding RNAs

    • Most gene prediction programs only predict protein-coding genes

    • tRNA and rRNA genes are not predicted

  • Prokaryotic gene finding

    • Much easier (no introns), but still not perfect

    • Especially short genes (<300 bp) difficult


Gene Prediction

  • Take home messages

    • Human genome sequence is known

    • Number of human genes is unknown!

      • Before 2001: est.30,000-140,000

      • Anno 2003: 30,000-40,000

    • Location, structure and function of many human genes is unknown!

    • Genes may be discovered by different means and methods

    • ...


Gene Prediction

  • Take home messages

    • Genes may be predicted by computer programs

    • Masking of repetitive sequences may be required for large genomic sequences

    • ’Unusual’ genes are difficult (high GC%, short or terminal exons)

    • HMM-based gene prediction programs are suitable for “Gene Grammar”

  • Prediction methods are not perfect!


The End


Gene Prediction Exercises

I. Gene Finding in Prokaryotic Sequence

II. Gene Finding in Eukaryotic Sequence

Exercises at:

http://www.cbs.dtu.dk/phdcourse/programme.html

http://www.cbs.dtu.dk/phdcourse/cookbooks/genefinding/pro.html

http://www.cbs.dtu.dk/phdcourse/cookbooks/genefinding/euk.html


Gene Prediction Exercise

http://www.cbs.dtu.dk/dtucourse/cookbooks/nikob/exercises/gf_exercise_solution.html


Gene Prediction – Performance of Genscan


Genome Browsing - Exercise #1

  • How many exons are encoded by the hoxA10 gene?

    • 2 exons

  • How many basepairs is the transcript length ?

    • 2542 bp


Genome Browsing - Exercise #1

  • On what chromosome is the hoxA10 gene?

    • Human chr.7

  • On which arm (short/p or long/q) ?

    • p

  • What gene is located ca. 500 kb downstream of HoxA10 ?

    • Scap2

  • On what mouse chromosome is the ortholog/homolog of human HoxA10 located?

    • Mouse chr.6

  • In the overview panel, there is a gene located ca. 300 kb downstream of HoxA10, what is the name?

    • Scap2


http://www.cbs.dtu.dk/dtucourse/cookbooks/nikob/exercises/gf_exercise_solution.html


  • Login