The issue
This presentation is the property of its rightful owner.
Sponsored Links
1 / 88

the issue PowerPoint PPT Presentation


  • 27 Views
  • Uploaded on
  • Presentation posted in: General

the issue.

Download Presentation

the issue

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


The issue

the issue


The issue

TCCTGGCCTACATGTTCTTTGGCAAAGGATCTTCAAAATCAACGGCTCCCGGTGCGGCGATCATCCATTTCTTCGGAGGGATTCACGAGATTTACTTCCCGTACATTCTGATGAAACCTGGCCCTGATTCTCGCAGCCATTGCCGGCGGAGCAAGCGGACTCTTAACATTACGATCTTTAATGCCGGACTTGTCGCGGCAGCGTCACCGGGAAGCATTATCGCATTGATGGCAATGACGCCAAGAGGAGGCTATTTCGGCGTATTGGCGGGTGTATTGGTCGCTGCAGCTGTATCGTTCATCGTTTCAGCAGTGATCCTGAAATCCTCTAAAGCTAGTGAAGAAGACCTGGCTGCCGCAACAGAAAAAATGCAGTCCATGAAGGGGAAGAAAAGCCAAGCAGCAGCTGCTTTAGAGGCGGAACAAGCCAAAGCAGAGAAGCGTCTGAGCTGTCTCCTGAAAGCGCGAACAAAATTATCTTTTCGTGTGATCCGGGATGGGATCAAGTGCCATGGGGGCATCCATCTTAAGAAACAAAGTGAAAAAGCGGAGCTTGACATCAGTGTGACCAACACGGCCATTAACAATCTGCCAAGCGATGCGGATATTGTCATCACCCACAAAGATTTAACAGACCGCGCGAAAGCAAAGCTGCCGAACGCGACGCACATATCAGTGGATAACTTCTTAAACAGCCCGAAATACGACGAGCTGATTGAAAAGCTGAAAAGTAATCTTATAGAAAGAGAGTATTGTCATGCAAGTACTCGCAAAGGAAACATTAAACTCAATCAAACGGTATCATCAAAAGAAGAGGCTATCAAATTGGCAGGCCAGACGCTGATTGACAACGGCTACGTGACAGAGGATTACATTAGCAAAATGTTTGACCGTGAAGAAACGTCTTCTACGTTTATGGGGAATTTCATTGCCATTCCACACGGCACAGAAGAAGCGAAAAGCGAGGTGCTTCACTCAGGAATTTCAATCATACAGATTCCAGAGGGCGTTGAGTACGGAGAAGGCAACACGGCAAAAGTGGTATTCGGCATTGCGGGTAAAAATAATGAGCATTTAGACATTTTGTCTAACATCGCCATTATCTGTTCAGAAGAAGAAACATTGAACGCCTGATCTCCGCTAAAGCGAAGAAGATTTGATCGCCATTTCAACGAGGTGAACTGACATGATCGCCTTACATTTCGGTGCGGGAAATATCGGGAGAGGATTTATCGGCGCGCTGCTTCACCACTCCGGCTATGATGTGGTGTTTGCGGATGTGAACGAAACGATGGTCAGCCTCCTCAATGAAAAAAAAGAATACACAGTGGAACTGGCGGAAGAGGGACGTTCATCGGAGATCATTGGCCCGGTGAGCGCTATTAACAGCGGCAGTCAGACCGAGGAGCTGTACCGGCTGATGAATGAGGCGGCGCTCATCACAACAGCTGTCGGCCCGAATGTCCTGAAGCTGATTGCCCCGTCTATCGCAGAAGGTTTAAGACGAAGAAATACTGCAAACACACTGAATATCATTGCCTGCGAAAATATGATTGGCGGAAGCAGCTTCTTAAAGAAAGAAATATACAGCCATTTAACGGAAGCAGAGCAGAAATCCGTCAGTGAAACGTTAGGTTTTCCGAATTCTGCCGTTGACCGGATCGTCCCGATTCAGCATCATGAAGACCCGCTGAAAGTATCGGTTGAACCATTTTTCGAATGGGTCATTGATGAATCAGGCTTTAAAGGGAAAACACCAGTCATAAACGGCGCACTGTTTGTTGATGATTTAACGCCGTACATCGAACGGAAGCTGTTTACGGTCAATACCGGACACGCGGTCACAGCGTATGTCGGCTATCAGCGCGGACTCAAAACGGTCAAAGAAGCAATTGATCATCCGGAAATCCGCCGTGTTGTTCATTCGGCGCTGCTTGAAACTGGTGACTATCTCGTCAAATCGTATGGCTTTAAGCAAACTGAACACGAACAATATATTAAAAATCAGCGGTCGCTTTTAAAATCCTTTCATTTCGGACGATGTGACCCGCGTAGCGAGGTCACCTCTCAGAAAACTGGGAGAAAATGTAGACTTGTAGGCCCGGCAAAGAAAATAAAAGAACCGAATGCACTGGCTGAAGGAATTGCCGCAGCACTGCGCTTCGATTTCACCGGTGACCCTGAAGCGGTTGAACTGCAAGCGCTGATCGAAGAAAAGGATACAGCGGCGTACTTCAAGAGGTGTGCGGCATTCAGTCCCATGAACCGTTGCACGCCATCATTTTAAAGAAACTTAATCAATAACCGACCACCCGTGACACAATGTCACGGGCTTTTTACTATCTCGCAATCTAGTATAATAGAAAGCGCTTACGATAACAGGGGAAGGAGAATGACGATGAAACAATTTGAGATTGCGGCAATACCGGGAGACGGAGTAGGAAAGAGGTTGTAGCGGCTGCTGAGAAAGTGCTTCATACAGCGGCTGAGGTACACGGAGGTTTGTCATTCTCATTCACAGCTTTTCCATGGAGCTGTGATTATTACTTGGAGCACGGCAAAAATGATGCCCGAAGATGGAATACATACGCTTACTCAATTTGAAGCAGTTTTTGGGAGCTGTCGGAAATCCGAAGCTGGTTCCCGATCATATATCGTTATGGGGCTGCTGCTGAAATCCGGAGGGAGCTTGAGCTTTCCATTAATATGAGACCCGCCAAACAAATGGCAGGCATTACGTCGCCGCTTCTGCATCCAAATGATTTTTGACTTCGTGGTGATTCGCGAGAACAGTGAAGGTGAATACAGTGAAGTTGTCGGGCGCATTCACAGAGGCGATGATGAAATCGCCATCCAGAATGCCGTGTTTACGAGAAAAGCGACAGAACGTGTCATGCGCTTTGCCTTCGAATTGGCGAAAAAACGGCGCACACTCGTGACAAGCGCCACAAAGTCTAACGGCATTTATCACGCGATGCCGTTTTGGGATGAAGTCTTTCAGCAGACAGCCGCTGATTATAGCGGAATCGAGACATCATCTCAGCATATTGATGCGCTGGCCGCTTTTTTTGTGACGCGTCCGGAAACGTTTGATGTCATTGTGGCGAGCAAATTGTTCGGTGATATTTTAACCGACATCAGCTCAAGCCTGATGGAAAGCATCGGCATTGCGCCTCCCGACATCAATCCATCCGGCAAATATCCGTCCATGTTTGAACCGGTTCACGGCTCAGCTCCTGACATTGCCGGACAGGCCTTGCCAATCCGATCGGCCAGATTTGGACAGCGAAGCTGATGCTCGACCACTTCGGAGAGGAAGAATTGGGGGCGAAAATTCTGGATGTAATGGAGCAAGTGACTGCCGACGGCATCAAAACACGCGACATTGGGGGACAAAGCACAACGGCTGAGGTCACTGATGAAATCTGTTCGCGCTTAAGAAAGCTCTGATGAATCAGGCCGGTGGCAGATGGCTGCCCCGGTCTGTCCATTTCCTTACGAAAATTTCCACGAAAGTCTAACCAAGCAGATCCAAATGCTGTATAATAATTTGGAATTCTTAGGAAAGCATCGGGTGAAGGAAGTTGAATGCAAAAACAATCACGTTAAAGAAAAAAAGAAAAATCAAAACGATCGTTGTACTCAGTATCATTATGATCGCAGCTCTCATTTTTACGATCAGATTGGTGTTTTACAAGCCTTTTCTTATTGAAGGATCATCAATGGCCCCAACGCTTAAAGACTCAGAAAGAATTCTGGTTGATAAAGCAGTCAAATGGACTGGCGGGTTTCACAGAGGAGACATCATAGTCATTCATGACAAAAAGAGCGGCCGCTCATTTGTCAAACGTTTAATCGGTTTGCCTGGTGACAGCATTAAAATGAAAAATGATCAGCTATACATAAATGATAAAAAGGTGGAAGAACCATACTTAAAGGAATATAAACAGGAGGTCAAAGAGTCGGGTGTAACCTTAACAGGTGACTTCGAAGTTGAGGTTCCTTCCGGTAAATATTTTGTGATGGGAGATAACCCTGATATAAGTGGAGCAATTAAACAAAATGGCGCCAAAGGATGTACGCGCCCTGATACGAGAGGGGAAAATAAACGGGCCGACCGCAGGCATGTCCGGCGGCTACGCCCAAGCGAATCTTGTGGTTTTGAAAAAGGACCTTGCGTTTGATTTTCTGCTGTTTTGCCAGCGAAATCAAAAGCCCTGCCCCGTGCTGGATGTGACTGAAGCAGGTTCGCCTGTGCCGTCTCTGCTGCGCCGGATGCTGATATCCAGAACGGACTTTCCGAAATACCGTATTTACAGGCACGGTATCCTAACGGAAGAAGTATCTGATATTACGCCATACT


The issue

Annotation of the 400Kb contig around AP2 on chromosome IV


The issue

The gene

internal exons

5’UTR exon

start exon

stop exon

non coding

coding

coding

non coding

stop

ATG

stop

3’UTR exon

ATG

Translation

initiation

Transcription

Start Site

3’UTR intron

5’UTR intron

internal introns

CDS

5’UTR

3’UTR

Coding SEQUENCE

CAP

AAAAAAA

ATG

stop

The transcript


The issue

the different strategies tobuild the structure of genes . experimental. predictiveextrinsic / comparativeintrinsic / ab-initio


The experimental approach

the experimental approach


Methods to localize genes on genome sequences

Methods to localize genes on genome sequences

  • The experimental approachidentify & clone the cognate transcripts (as cDNA), sequence it and compare cDNA and gDNAit is the ONLY secure method!


The issue

  • The experimentalapproachEven this method has its bottlenecks :cDNA are rarely full length ...There are often alternative transcripts … but only one or a few cloned or considered for analysisThe nucleic acid sequence does not provide experimental information on translation product(s) a minimum of bioinformatics is needed:cDNA and gDNA sequence comparison ...and exact localization ofsplice sitesat intron-exon borders: NNNag/Gtaagt……AG/gtNNNthis requires a specific software for high throughput:e.g. Sim4


The predictive approaches

the predictive approaches


Methods to localize genes on genome sequences1

Methods to localize genes on genome sequences

  • Predictive Methods theextrinsic (comparative) method


Methods to localize genes on genome sequences2

Methods to localize genes on genome sequences

  • Predictive Methodsthe extrinsic method search for similarities in protein & nucleic acid sequence databasesrationale:many genes and proteins are already documentedthe genomic DNA may contain such one, or at least a close or distant homologue


The issue

  • Predictive Methodsthe extrinsic method protein databasesdue to a richer alphabet (20 amino acids compared to 4 nucleotides) protein sequence databases are the most efficient and the most informativein the best case, a hit in a database search indicatesthe existence of a genethe complete exon-intron structure of this genefor which function this gene codes for


The issue

  • Predictive Methodsthe extrinsic methodlimits & bottlenecksthere is a need for closely homologous sequences to be in databases : orphan and fast evolving genes are typically not found this way partial and wrong sequences are causing problemsthis approach identify and give the structure for a fraction of genes in a complete genome (e.g. 40%) and incomplete information for another fraction (e.g. 20%)


The issue

  • Predictive Methodsthe extrinsic methodflaws & bottlenecksprotein searches rely on correct gene annotation in databases …does a given database hit refer to an experimentally documented or to a virtual entity ?how to track the source of information and validate the features given in databases ?


The issue

  • Predictive Methodsthe extrinsic methodgDNA versus mRNAsThe EST case : what is it for real ?Expressed Sequence Tagsobtained from mRNA isolated from a given organcloned as cDNA in large librariessequenced from one extremity (often 3’) in a single pass as far as possible (100-800 bp)


The issue

  • Predictive Methodsthe extrinsic methodEST pros& cons +the closest to the experimental methodno assumption neededalternative transcripts are often found this way -poor quality of EST sequences (error range >1%)unequal coverage, depending on gene expression level partial sequences (though may be assembled)directional: 3’ (and 5’) exons best coveredmany ESTs needed for correct annotation: >106 for human


The issue

  • Predictive Methodsthe extrinsic methodgDNA versus gDNAThe “Conserved Exon” Method: comparison of non-documented genomic DNA with another non-documented gDNARationale : the coding sequences being more conserved in evolution, (coding) exons should be seen as more similar to each other than introns and intergenics No need for transcript or protein data. Applies well to comparison between genomes of closely related species : e.g. mouse-human…


Methods to localize genes on genome sequences3

Methods to localize genes on genome sequences

  • Predictive Methods theintrinsic (ab initio) method


Intrinsic gene prediction

Intrinsic Gene Prediction

  • Not every DNA sequence is a gene

  • Sequences of genes have specific features, which are often linked to the expression of these genes :

  • this apply to properties of sequences as a whole

    • Coding sequences : 3bp-periodicity, codon usage, GC content

  • or to local signals

    • translation start and stops, splice sites, polyA site, TATA box, promoter cis-acting motifs....


Intrinsic gene prediction1

Intrinsic Gene Prediction

Relies on combinatorial, statistical and/or A.I. methods

may integrate several individual sensors

Needs training sets of documented genes


Intrinsic gene prediction2

Intrinsic Gene Prediction

Is not universal !

Each (group of) species has its own genome “style”.

Therefore :

each method has to be trained and even adapted for a given genome, and need a species-specific gene set for this purpose

the performance of a given algorithm or integrated software may vary a lot from one species to another...


The software march 2002

THE SOFTWAREmarch 2002


Splice site prediction

splice site prediction


The issue

Program

Organism

Method

GeneSplicer (92)

Arabidopsis

human

HMM + MDD

NetPlantGene (66)

http://www.cbs.dtu.dk/services/NetPGene/

Arabidopsis

NN

NetGene2 (93)

http://www.cbs.dtu.dk/services/NetGene2/

NN + HMM

SpliceView (94)

http://l25.itba.mi.cnr.it/~webgene/wwwspliceview.html

Eukaryotes

Score with consensus

NNSplice (95)

http://www.fruitfly.org/seq_tools/splice.html

Drosophila

Human others

NN

SplicePredictor(96, 97)http://gremlin1.zool.iastate.edu/cgi-bin/sp.cgi

Arabidopsis

Maize

Logitlinear models

(i) score with consensus

(ii) local composition

BCM-SPL

Http://dot.imgen.bcm.tmc.edu:9331/gene-finder/gf.html

Http://dot.imgen.bcm.tmc.edu:9331/seq-search/gene-search.html

Linear

Discriminant

Analysis

Human

C. elegans

Arabidopsis

Human

Drosophila

C .elegans

Yeast, Plants

MM: Markov model

IMM: Interpolated MM

HMM: Hidden Markov model

CHMM: class HMM

GHMM: generalized HMM

DP: dynamic programming

MDD: maximal dependence decomposition

ML: maximum likelihood

NN: Neural Network

WAM: weight array matrix


Exon prediction and gene modeling

exon predictionand gene modeling


The issue

Program

Organism

Gene elements

Gene Model

Database

similarity

DAGGER (98)

Human

Winnow classifier for ATG, splice sites and stop. Exon 3-periodic MM

Path deletion algorithm 

EuGène (99)

http://www.inra.fr/bia/T/EuGene

Arabidopsis

3-periodic IMM for exons, 1 IMM for introns, 1 for intergenic regions, 1 for UTR. NetGene2 /SplicePredictor for splice sites.

DP

EST/cDNA

protein

GeneId3(100)

http://apolo.imim.es/geneid.html

Vertebrates Plants

Rule-based method

WAM, discriminant analysis.

DP

EST

GeneFinder (FGENE, FEX,..) (101)http://dot.imgen.bcm.tmc.edu:9331/gene-finder/gf.html

http://genomic.sanger.ac.uk/gf/gf.html

Human, Yeast, Drosophila, C.elegans, Plants

Linear discriminant analysis

DP

protein

GeneFinder ( Phil Green) http://www.ibc.wustl.edu/bio_data/genefinder/

Log Likelihood ratio score matrix on MM

DP

GeneGenerator (102)

Maize

Logitlinear models for splice sites, start. 3rd to 5th order MM for exons and introns

DP

GeneMark (103)http://genemark.biology.gatech.edu/GeneMark/

Prokaryotes, Eukaryotes

5th order MM (homogeneous for introns, 3-periodic for exons)

No

GeneMark.HMM (104)

http://genemark.biology.gatech.edu/GeneMark/webgenemark.html

Human, C. elegans, Arabidopsis, rice,

Drosophila, chicken

Chlamydomonas, …

5th order MM (homogeneous for introns, 3-periodic for exons)

GHMM

DP

Under development


The issue

GeneModeler (37)ftp://ftp.tigr.org/pub/software/gm/

Eukaryotes

Nucleotide and di-nucleotide composition, consensus for splice sites

Rule-based method

GeneParser (105)

http://beagle.colorado.edu/~eesnyder/GeneParser.html

Vertebrates

NN

DP

EST

Genie (95, 106)

http://www-hgc.lbl.gov/projects/genie.html

Drosophila, Human

others.

NN

GHMM

DP

protein

GenLang (107)http://www.cbil.upenn.edu/genlang/genlang_home.html

Vertebrates, Drosophila, Dicots.

Grammar rules

WAM, hextuple frequencies…

Chart parsing DP

GenScan (108)

http://CCR-081.mit.edu/GENSCAN.html

Vertebrate, Arabidopsis, Maize

WAM for acceptor ; MDD for donor

5th order MM (homogeneous for introns, 3-periodic for exons)

GHMM

DP

protein

GenomeScan

(53)

GENVIEW2 (109)http://l25.itba.mi.cnr.it/~webgene/wwwgene.html

Human, Mouse, Diptera

Linear Combination

Dicodon statistic

DP

GlimmerM (22)

[email protected]

Small eukaryotes

Arabidopsis, rice

3 IMM for exons (order 0 to 8)

1 IMM for introns, 2nd order MM for splice sites

DP

GRAIL/GAP3 (110, 111)

Http://compbio.ornl.gov/Grail-bin/EmptyGrailForm

Human, Mouse, Arabidopsis, Drosophila,

E. coli

NN

DP

EST

cDNA


The issue

GRPL (112)

Human, Drosophila, Arabidopsis

Reference point logistic for splice sites

5th order MM (homogeneous for introns, 3-periodic for exons)

GHMM

DP

protein

HMMgene (113)

Http://www.cbs.dtu.dk/services/HMMgene/

Vertebrates, C. elegans

3-periodic 4th order MM for exons

3rd order MM for introns

CHMM

Morgan (114)

Http://www.cs.jhu.edu/labs/compbio/morgan.html

Vertebrates

Decision tree system

DP

MZEF (115)

http://sciclio.cshl.org/genefinder/

Human, Mouse, Arabidopsis, Fission Yeast

Quadratic Discriminant Analysis

No

SORFIND (116)

Matrix method for start and splice sites

Hexamer usage (Fourier measure)

No

Twinscan (117)

Mouse, human

Genscan's method

5th order MM for UTR and intergenic

WAM for acceptor sites

GHMM

Genomic sequence

VEIL (118)

http://www.cs.jhu.edu/labs/compbio/veil.html

Vertebrates

HMM

DP

Xpound (119)

Http://bioweb.pasteur.fr/seqanal/interfaces/xpound-simple.html

Human

3 periodic 1st order MM for exons

1st order MM for introns and intergenic.

HMM


The spliced alignment software

the spliced alignmentsoftware


The issue

Program

Organism

Databank

Alignment

Gene reconstruction

AAT(120)

http://genome.cs.mtu.edu/aat.html

Primates, rodents, other

cDNA

Protein

DDS (improved BLASTX)

DPS (improved BLASTN)

NAP

GAP2

ALN(121)

Protein

tron code

PAM 250

CEM(122)

2 genomic seq

BLASTX output

WMM for sites

DP

EbEST(42)

Http://ares.ifrc.mcw.edu/EBEST/ebest.html

Human/other

dbEST

BLASTN

EST clustering

Smith Waterman based gapped alignment

3’UTR detection

assembly of EST-tagged exons

EST_genome(123)

EST or cDNA

Preferably BLASTN output

Modified Smith-Waterman

Needleman-Wunsh algorithm

No

GeneSeqer(124, 125)

http://gremlin1.zool.iastate.edu/cgi-bin/gs.cgi

Arabidopsis, Maize, Generic plant

dbEST or your database

+ proteins

Spliced alignment

Splice recognition with SplicePredictor if missing EST match.

Yes

GeneWise

1 protein or a HMM profile

DP (Dynamite)

GENQUEST

http://compbio.ornl.gov/Grail-bin/EmptyGenquestForm

dbEST, SwissProt,, Prosite,

BLOCKS, GSDB

Smith-Waterman,

Blast, Fasta.


The issue

ICE (126)

http://theory.lcs.mit.edu/ice

dbEST

OWL

lookup

DP

INFO(127)

Nr

25-mer lookup table

Protein/protein alignments scored with PAM40, PAM 120, PAM250, BLOSUM62

No

ORFgene2 (128)

http://l25.itba.mi.cnr.it/~webgene/wwworfgene2.html

Human, Mouse, Drosophila, Aspergillus, Arabidopsis, Caenorhabditis

SwissProt

BlastP

WAM for splice sites

Identity score on frequencies of dipeptides

Compatibility graph

DP

PredictGenes

http://cbrg.inf.ethz.ch/

SwissProt

PAM 250

DP

PROCRUSTES (39)

http://wwwhto.usc.edu/software/

procrustes/wwwserv.html

1 homologous protein

Protein/protein alignments scored with PAM 120

DP

ROSETTA (129)

http://plover.lcs.mit.edu/cgi-bin/rosetta.cgi/

2 genomic sequences

GLASS (Global Alignment SyStem)

PAM20

Genscan method for splice sites

DP

SYNCOD(130)

Http://l25.itba.mi.cnr.it/~webgene/wwwsyncod.html

Human, Mouse, Drosophila, Arabidopsis, Aspergillus,

Caenorhabditis

BLASTN output

Silent/Replacement ratio

Monte Carlo simulations

No

TAP(131)

http://sapiens.wustl.edu/~zkan/TAP/

Human

Mouse

dbEST

WU-BLASTN

SIM4

Yes

Utopia(132)

2 genome sequences


Literature on eukaryote gene prediction

literature on eukaryote gene prediction


The issue

Mathé C, Sagot MF, Schiex T and Rouzé P (2002) Current methods of gene prediction, their strengths and weaknesses. Nucl Acids Res 30:4103-4117Zhang M Q (2002) Computational prediction of eukaryotic protein-coding genes. Nature Rev. 3: 698-709


The issue

1. Lander, E. S., Linton, L. M., Birren, B., Nusbaum, C., Zody, M. C., Baldwin, J., Devon, K., Dewar, K., Doyle, M., FitzHugh, W., et al. (2001) Initial sequencing and analysis of the human genome. Nature 409, 860-921.

2. The Arabidopsis Genome Initiative (2000) Analysis of the genome sequence of the flowering plant Arabidopsis thaliana..[In Process Citation]. Nature 408, 796-815

3. Goff, S. A., Ricke, D., Lan, T. H., Presting, G., Wang, R., Dunn, M., Glazebrook, J., Sessions, A., Oeller, P., Varma, H., et al.(2002) A draft sequence of the rice genome (Oryza sativa L. ssp. japonica). Science 296, 92-100.

4. Myers, E., Sutton, G., Delcher, A., Dew, I., Fasulo, D., Flanigan, M., Kravitz, S., Mobarry, C., Reinert, K., Remington, K., et al. (2000) A whole-genome assembly of Drosophila. Science 287, 2196-2204

5. Claverie, J. M., Poirot, O. and Lopez, F. (1997) The difficulty of identifying genes in anonymous vertebrate sequences. Comput. Chem. 21, 203-214

6. Cho, Y. and Walbot, V. (2001) Computational methods for gene annotation: the Arabidopsis genome. Curr Opin Biotechnol 12, 126-130

7. Borodovsky, M., Rudd, K. E. and Koonin, E. V. (1994) Intrinsic and extrinsic approaches for detecting genes in a bacterial genome. Nucleic Acids Res. 22, 4756-4767

8. Fickett, J. W. (1996) The gene identification problem: an overview for developer. Comput. Chem 20, 103-118

9. Rouzé, P., Pavy, N. and Rombauts, S. (1999) Genome annotation: which tools do we have for it ? Current Opinion in Plant Biology 2, 90-95

10. Fickett, J. W. (1996) Finding genes by computer: the state of the art. Trends genet., 316-320

11. Claverie, J. M. (1997) Computational methods for the identification of genes in vertebrate genomes sequences. Human Molecular Genetics 6, 1735-1744

12. Guigó, R. (1997) Computational gene identification: an open problem. Comput. Chem. 21, 215-222

13. Haussler, D. (1998) Computational genefinding. Trends in Biotechnology, 12-15

14. Burge, C. and Karlin, S. (1998) Finding the genes in genomic DNA. Current Opinion in Structural Biology 8, 346-354

15. Burset, M. and Guigó, R. (1996) Evaluation of gene structure prediction programs. Genomics 34, 353-367

16. Rogic, S., Mackworth, A. and Ouellette, F. (2001) Evaluation of gene-finding programs on mammalian sequences. Genome Res. 11, 817-832

17. Pavy, N., Rombauts, S., Déhais, P., Mathé, C., Ramana, D. V. V., Leroy, P. and Rouzé, P. (1999) Evaluation of gene prediction software using a genomic data set: application to Arabidopsis thaliana sequences. Bioinformatics 15, 887-899

18. Mignone, F., Gissi, C., Liuni, S. and Pesole, G. (2002) Untranslated regions of mRNAs. Genome Biol 3

19. Pearson, W. R. and Lipman, D. J. (1988) Improved tools for biological sequence comparison. Proc. Natl. Acad. Sci. U.S.A. 85, 2444-2448

20. Altschul, S. F., Gish, W., Miller, W., Myers, E. W. and Lipman, D. J. (1990) Basic local alignment search tool. J. Mol. Biol. 215, 403-410

21. Bailey, L. C., Searls, D. B. and Overton, G. C. (1998) Analysis of EST-driven gene annotation in human genomic sequence. Genome Res. 8, 362-376

22. Fickett, J. W. (1995) ORFs and Genes: How Strong a Connection ? J. Comput. Biol. 2, 117-123


The issue

23. Fickett, J. W. and Tung, C. S. (1992) Assessment of protein coding measures. Nucleic Acids Res. 20, 6441-6450

24. Hutchinson, G. B. and Hayden, M. R. (1992) The prediction of exons through an analysis of spliceable open reading frames. Nucleic Acids Res. 20, 3453-3462

25. Milanesi, L., Kolchanov, N. A., Rogozin, I. B., Ischenko, I. V., Kel, A. E., Orlov, Y. L., Ponomarenko, M. P. and Vezzoni, P. (1993) GenView: a computing tool for protein-coding regions prediction in nucleotide sequences. In In: "Proceedings of the Second International Conference on Bioinformatics, Supercomputing and Complex Genome Analysis" (Lim, H. A., Fickett, J. W., Cantor, C. R. and Robbins, R. J., eds) pp. 573-588, World Scientific Publishing, Singapore

26. Zhang, M. Q. (1997) Identification of protein coding regions in the human genome by quadratic discriminant analysis [published erratum appears in Proc Natl Acad Sci U S A 1997 May 13;94(10):5495]. Proc. Natl. Acad. Sci. U.S.A. 94, 565-568

27. Snyder, E. E. and Stormo, G. D. (1995) Identification of protein coding regions in genomic DNA. J. Mol. Biol. 248, 1-18

28. Solovyev, V. and Salamov, A. (1997) The Gene-Finder computer tools for analysis of human and model organisms genome sequences. In The Fifth International Conference on Intelligent Systems for Molecular Biology (Gaasterland, T., Karp, P., Karplus, K., Ouzounis, C., Sander, C. and Valencia, A., eds) pp. 294-302, AAAI Press, Halkidiki, Greece

29. Borodovsky, M. and McIninch, J. (1993) GENMARK: parallel gene recognition for both DNA strands. Comput. Chem. 17, 123-133

30. Burge, C. and Karlin, S. (1997) Prediction of complete gene structures in human genomic DNA. J. Mol. Biol. 268, 78-94

31. Schiex, T., Moisan, A. and Rouzé, P. (2001) EuGène: an eukaryotic gene finder that combines several sources of evidence. In First International Conference on Biology, Informatics, and Mathematics, JOBIM 2000 (Gascuel , O. and Sagot, M.-F., eds) Vol. 2006, Lecture Notes in Computer Science. Springer-Verlag

32. Salzberg, S., Delcher, A., Kasif, S. and White, O. (1998) Microbial gene identification using interpolated Markov models. Nucleic Acids Res. 26, 544-548.

33. Salzberg, S. L., Pertea, M., Delcher, A. L., Gardner, M. J. and Tettelin, H. (1999) Interpolated Markov Models for Eukaryotic Gene Finding. Genomics 59, 24-31

34. Delcher, A. L., Harmon, D., Kasif, S., White, O. and Salzberg, S. L. (1999) Improved microbial gene identification with GLIMMER. Nucleic Acids Res. 27, 4636-4641

35. Fields, C. A. and Soderlund, C. A. (1990) gm: a practical tool for automating DNA sequence analysis. Comput. Appl. Biosc. 6, 263-270

36. Bernardi, G. (1989) The isochore organization of the human genome. Annu. Rev. Genet. 23, 637-661

37. Montero, L. M., Salinas, J., Matassi, G. and Bernardi, G. (1990) Gene distribution and isochore organization in the nuclear genome of plants. Nucleic Acids Res 18, 1859-1867

38. Duret, L., Mouchiroud, D. and Gautier, C. (1995) Statistical analysis of vertebrate sequences reveals that long genes are scarce in GC-rich isochores. J. Mol. Evol. 40, 308-317

39. Rogozin, I. B. and Milanesi, L. (1997) Analysis of donor splice signals in different organisms. J. Mol. Evol. 45, 50-59

40. Kleffe, J., Hermann, K., Vahrson, W., Wittig, B. and Brendel, V. (1996) Logitlinear models for the prediction of splice sites in plant pre-mRNA sequences. Nucleic Acids Res. 24, 4709-4718


The issue

41. Brunak, S., Engelbrecht, J. and Knudsen, S. (1991) Prediction of human mRNA donor and acceptor sites from the DNA sequence. J. Mol. Biol. 220, 49-65

42. Hebsgaard, S. M., Korning, P. G., Tolstrup, N., Engelbrecht, J., Rouzé, P. and Brunak, S. (1996) Splice site prediction in Arabidopsis thaliana pre mRNA by combining local and global sequence information. Nucleic Acids Res. 24, 3439-3452.

43. Tolstrup, N., Rouzé, P. and Brunak, S. (1997) A Branch Point Consensus From Arabidopsis Found By Non Circular Analysis Allows For Better Prediction of Acceptor Sites. Nucleic Acids Res. 25, 3159-3163.

44. Reese, M. G., Eeckman, F. H., Kulp, D. and Haussler, D. (1997) Improved splice site detection in Genie. In First Annual International Conference on Computational Molecular Biology (RECOMB), ACM Press, New York., Santa Fe, NM

45. Zhang, M. Q. and Marr, T. G. (1993) A weight array method for splicing signal analysis. Comput. Appl. Biosci. 9, 499-509

46. Salzberg, S. L. (1997) A method for identifying splice sites and translational start sites in eukaryotic mRNA. Comput. Appl. Biosci. 13, 365-376

47. Henderson, J., Salzberg, S. and Fasman, K. (1997) Finding Genes in Human DNA with a Hidden Markov Model. J. Comput. Biol. 4, 127-141.

48. Salzberg, S., Delcher, A., Fasman, K. and Henderson, J. (1998) A Decision Tree System for Finding Genes in DNA. J. Comput. Biol. 5, 667-680

49. Rabiner, L. R. (1989) A tutorial on Hidden Markov models and Selected Applications for Speech Recognition. Proceedings of the IEEE 77, 257-285

50. Krogh, A. (1998) An Introduction to Hidden Markov Models for Biological Sequences. In Computational Methods in Molecular Biology (Salzberg, S. L., Searls, D. B. and Kasif, S., eds) pp. 46-63, Elsevier

51. Patterson, D. J., Yasuhara, K. and Ruzzo, W. L. (2002) Pre-mRNA Secondary Structure Prediction Aids Splice Site Prediction. In Pacific Symposium on Biocomputing (Altman, R. B., Dunker, A. K., Hunter, L., Lauderdale, K. and Klein, T. E., eds) Vol. 7 pp. 223-234, Hawaii, U.S.A.

52. Ohler, U. and Niemann, H. (2001) Identification and analysis of eukaryotic promoters: recent computational approaches. Trends Genet. 17, 56-60

53. Pedersen, A. G., Baldi, P., Chauvin, Y. and Brunak, S. (1999) The biology of eukaryotic promoter prediction - a review. Computer & Chemistry (informatics and the genome issue) 23, 191-207

54. Pedersen, A. G. and Nielsen, H. (1997) Neural network prediction of translation initiation sites in eukaryotes: perspectives for ESTand genome analysis. In The Fifth International Conference on Intelligent Systems for Molecular Biology (Gaasterland, T., Karp, P., Karplus, K., Ouzounis, C., Sander, C. and Valencia, A., eds) pp. 226-233, AAAI Press, Halkidiki, Greece

55. Zien, A., Ratsch, G., Mika, S., Scholkopf, B., Lengauer, T. and Muller, K. (2000) Engineering support vector machine kernels that recognize translation initiation sites. Bioinformatics 16, 799-807

56. Nishikawa, T., Ota, T. and Isogai , T. (2000) Prediction whether a human cDNA sequence contains initiation codon by combining statistical information and similarity with protein sequences. Bioinformatics 16, 960-967

57. Hatzigeorgiou, A. G. (2002) Translation initiation start prediction in human cDNAs with high accuracy. Bioinformatics 18, 343-350.

58. Gelfand, M. S. (1990) Computer prediction of the exon-intron structure of mammalian pre-mRNAs. Nucleic Acids Res. 18, 5865-5869

59. Gelfand, M. S., Mironov, A. A. and Pevzner, P. A. (1996) Gene recognition via spliced sequence alignment. Proc. Natl. Sci. U.S.A. 93, 9061-9066


The issue

60. Birney, E. and Durbin, R. (1997) Dynamite: a flexible code generating language for dynamic programming methods used in sequence comparison. Proc Int Conf Intell Syst Mol Biol 5, 56-64

61. Rogozin, I. B., Milanesi, L. and Kolchanov, N. A. (1996) Gene structure prediction using information on homologous protein sequence. Comput. Applic. Biosci. 12, 161-170.

62. Gotoh, O. (2000) Homology-based gene structure prediction: simplified matching algorithm using a translated codon (tron) and improved accuracy by allowing for long gaps. Bioinformatics 16, 190-202

63. Laub, M. T. and Smith, D. W. (1998) Finding Intron/Exon Splice Junctions Using INFO, INterruption Finder and Organizer. J. Comput.Biol. 5, 307-321

64. Pachter, L., Batzoglou, S., Spitkovsky, V. I., Banks, E., Lander, E. S., Kleitman, D. J. and Berger, B. (1999) A Dictionary-Based Approach for Gene Annotation. Journal of Computational Biology 6, 419-430

65. Thayer, E., Bystroff, C. and Baker, D. (2000) Detection of protein coding sequences using a mixture model for local protein amino acid sequence. Journal of Computational Biology 7, 317-327

66. Huang, X., Adams, M. D., Zhou, H. and Kerlavage, A. R. (1997) A tool for analyzing and annotating genomic sequences,. Genomics 46, 37-45

67. Usuka, J. and Brendel, V. (2000) Gene structure prediction by spliced alignment of genomic DNA with protein sequences: Increased accuracy by differential splice site scoring. Journal of Molecular Biology 297, 1075-1085

68. Usuka, J., Zhu, W. and Brendel, V. (2000) Optimal spliced alignment of homologous cDNA to a genomic DNA template. Bioinformatics 16, 203-211

69. Florea, L., Hartzell, G., Zhang, Z., Rubin, G. M. and Miller, W. (1998) A computer program for aligning a cDNA sequence with a genomic DNA sequence. Genome Research 8, 967-974

70. Wheelan, S. J., Church, D. M. and Ostell, J. M. (2001) Spidey: a tool for mRNA-to-genomic alignments. Genome Res 11, 1952-1957.

71. Fukunishi, Y., Suzuki, H., Yoshino, M., Konno, H. and Hayashizaki, Y. (1999) Prediction of human cDNA from its homologous mouse full-length cDNA and human shotgun database. FEBS Lett 464, 129-132

72. Rogozin, I. B., D'Angelo, D. and Milanesi, L. (1999) Protein-coding regions prediction combining similarity searches and conservative evolutionary properties of protein-coding sequences. Gene 226, 129-137

73. Jiang, J. and Jacob, H. J. (1998) EbEST: An Automated Tool Using Expressed Sequence Tags to Delineate Gene Structure. Genome Res. 8

74. Mott, R. (1997) EST_GENOME: a program to align spliced DNA sequences to unspliced genomic DNA. Comput. Applic. Biosci. 13, 477-478

75. Kan, Z., Rouchka, E. C., Gish, W. R. and States, D. J. (2001) Gene structure prediction and alternative splicing analysis using genomically aligned ESTs. Genome Research 11, 889-900

76. Delcher, A. L., Kasif, S., Fleischmann, R. D., Peterson, J., White, O. and Salzberg, S. L. (1999) Alignment of whole genomes. Nucleic Acids Res 27, 2369-2376.

77. Kent, W. J. and Zahler, A. M. (2000) Conservation, regulation, synteny, and introns in a large-scale C. briggsae-C. elegans genomic alignment. Genome Res 10, 1115-1125.


The issue

78. Schwartz, S., Zhang, Z., Frazer, K., Smit, A., Riemer, C., Bouck, J., Gibbs, R., Hardison, R. and Miller, W. (2000) PipMaker--a web server for aligning two genomic DNA sequences. Genome Research 10, 577-586

79. Morgenstern, B. (2000) A space-efficient algorithm for aligning large genomic sequences. Bioinformatics 16, 948-949

80. Batzoglou, S., Pachter, L., Mesirov, J., Berger, B. and Lander, E. S. (2000) Human and mouse gene structure: comparative analysis and application to exon prediction. Genome Research 10, 950-958

81. Bafna, V. and Huson, D. (2000) The conserved exon method for gene finding. In Eighth International Conference on Intelligent Systems for Molecular Biology (Bourne, P., Gribskov, M., Altman, R., Jensen, N., Hope, D., Lengauer, T., Mitchell, J., Scheeff, E., Smith, C., Strande, S. and Weissig, H., eds), AAAI Press, San Diego, California (USA)

82. Blayo, P., Rouzé, P. and Sagot, M.-F. (2001) Orphan gene finding- An exon assembly approach. Theor. Comput. Sci., in press

83. Wiehe, T., Gebauer-Jung, S., Mitchell-Olds, T. and Guigo, R. (2001) SGP-1: prediction and validation of homologous genes based on sequence alignments. Genome Res 11, 1574-1583.

84. Novichkov, P. S., Gelfand, M. S. and Mironov, A. A. (2001) Gene recognition in eukaryotic DNA by comparison of genomic sequences. Bioinformatics 17, 1011-1018.

85. Jurka, J., Klonowski, P., Dagman, V. and Pelton, P. (1996) CENSOR-a program for identification and elimination of repetitive elements from DNA sequences. Comput. Chem. 20, 119-112

86. Roytberg, M. A., Astakhova, T. V. and Gelfand, M. S. (1997) Combinatorial approaches to gene recognition. Comput. Chem. 21, 229-235

87. Guigó, R. (1998) Assembling Genes from Predicted Exons in Linear Time with Dynamic Programming. J. Comput. Biol. 5, 681-702

88. Guigó, R., Knudsen, S., Drake, N. and Smith, T. (1992) Prediction of gene structure. J. Mol. Biol. 226, 141-157

89. Xu, Y., Mural, R. J. and Uberbaker, E. C. (1994) Constructing gene models from accurately predicted exons: an application of dynamic programming. Comput. Appl. Biosci. 10, 613-623

90. Chuang, J. S. and Roth, D. (2001) Gene recognition based on DAG shortest paths. Bioinformatics 1, 1-9

91. Kleffe, J., Hermann, K., Vahrson, W., Wittig, B. and Brendel, V. (1998) GeneGenerator--a flexible algorithm for gene prediction and its application to maize sequences. Bioinformatics, 232-243

92. Viterbi, A. (1967) Error bounds for convolutional codes and an asymptotically optimal decoding algorithm. IEEE Trans. Informat. Theory IT-13, 260-269

93. Bellman, R. E. (1957) Dynamic Programming, Princeton Univ. Press, Princeton, New Jersey

94. Krogh, A., Mian, I. S. and Haussler, D. (1994) A hidden Markov model that finds genes in E. coli DNA. Nucleic Acids Res. 22, 4768-4778

95. Kulp, D., Haussler, D., Reese, M. G. and Eeckman, F. H. (1996) A generalized Hidden Markov Model for the recognition of human genes in DNA. In Proceedings of the Fourth International Conference on Intelligent Systems for Molecular Biology, (States, D. J., Agarwal, P., Gaasterland, T., Hunter, L. and Smith, R. F., eds), AAAI Press, St. Louis, MO, U.S.A.

96. Lukashin, A. V. and Borodovsky, M. (1998) GeneMark.hmm: New solutions for gene finding. Nucleic Acids Res. 26, 1107-1115


The issue

97. Hooper, P., Zhang, H. and Wishart, D. (2000) Prediction of genetic structure in eukaryotic DNA using reference point logistic regression and sequence alignment. Bioinformatics 16, 425-438

98. Krogh, A. (1997) Two methods for improving performace of a HMM and their application for gene finding. In The Fifth International Conference on Intelligent Systems for Molecular Biology (Gaasterland, T., Karp, P., Karplus, K., Ouzounis, C., Sander, C. and Valencia, A., eds) pp. 179-186, AAAI Press, Halkidiki, Greece

99. Yeh, R.-F., Lim, L. P. and Burge, C. B. (2001) Computational inference of homologous gene structures in the human genome. Genome Research 11, 803-816

100. Korf, I., Flicek, P., Duan, D. and Brent, M. R. (2001) Integrating genomic homology into gene structure prediction. Bioinformatics 17, S140-S148

101. Murakami, K. and Tagaki, T. (1998) Gene recognition by combination of several gene-finding programs. Bioinformatics 14, 665-675

102. Solovyev, V. V. and Salamov, A. A. (1999) INFOGENE: a database of known gene structures and predicted genes and proteins in sequences of genome sequencing projects. Nucleic Acids Res. 27, 248-250

103. Pavlovic, V., Garg, A. and Kasif, S. (2002) A Bayesian framework for combining gene predictions. Bioinformatics 18, 19-27.

104. Tabaska, J., Davuluri, R. and Zhang, M. (2001) Identifying the 3'-terminal exon in human DNA. Bioinformatics 17, 602-607.

105. Davuluri, R. V., Grosse, I. and Zhang, M. Q. (2001) Computational identification of promoters and first exons in the human genome. Nat Genet 29, 412-417.

106. Down, T. A. and Hubbard, T. J. (2002) Computational detection and location of transcription start sites in Mammalian genomic DNA. Genome Res 12, 458-461.

107. Graber, J. H., Cantor, C. R., Mohr, S. C. and Smith, T. F. (1999) In silico detection of control signals: mRNA 3'-end-processing sequences in diverse species. Proc. Natl. Acad. Sci. U. S. A. 96, 14055-14060

108. Guigó, R., Agarwal, P., Abril, J., Burset, M. and Fickett, J. (2000) An assessment of gene prediction accuracy in large DNA sequences. Genome Research 10, 1631-1642

109. Nobile, C., Marchi, J., Nigro, V., Roberts, R. G. and Danieli, G. A. (1997) Exon-intron organization of the human dystrophin gene. Genomics 45, 421-424

110. Duret, L., Dorkeld, F. and Gautier, C. (1993) Strong conservation of vertebrate non-coding sequences during vertebrate evolution: potentiel involvement in post-transcriptional regulation of gene expression. Nucleic Acids Res. 21, 2315-2322

111. Quesada, V., Ponce, M. R. and Micol, J. L. (1999) OTC and AUL1, two convergent and overlapping genes in the nuclear genome of Arabidopsis thaliana. FEBS Lett. 461, 101-106

112. Henikoff, S., Keene, M. A., Fechtel, K. and Fristrom, J. W. (1986) Gene within a gene: nested Drosophila genes encode unrelated proteins on opposite DNA strands. Cell 44, 33-42

113. Leader, D. J., Clark, G. P., Watters, J., Beven, A. F., Shaw, P. J. and Brown, J. W. (1997) Clusters of multiple different small nucleolar RNA genes in plants are expressed as and processed from polycistronic pre-snoRNAs. Embo J 16, 5742-5751.

114. Blumenthal, T. (1998) Gene clusters and polycistronic transcription in eukaryotes. Bioessays 20, 480-487


The issue

115. Mironov, A. A., Novichkov, P. S. and Gelfand , M. S. (2001) Pro-Frame: similarity-based gene recognition in eukaryotic DNA sequences with errors. Bioinformatics 17, 13-15

116. Fichant, G. A. and Quentin, Y. (1995) A frameshift error detection algorithm for DNA sequencing projects. Nucleic Acids Res. 23, 2900-2908

117. Salanoubat, M., Genin, S., Artiguenave, F., Gouzy, J., Mangenot, S., Arlat, M., Billault, A., Brottier, P., Camus, J. C., Cattolico, L., et al.(2002) Genome sequence of the plant pathogen Ralstonia solanacearum. Nature 415, 497 - 502

118. Iseli, C., Jongeneel, C. V. and Bucher, P. (1999) ESTScan: a program for detecting, evaluating, and reconstructing potential coding regions in EST sequences. Proc Int Conf Intell Syst Mol Biol, 138-148.

119. Klein, M., Pieri, I., Uhlmann, F., Pfizenmaier, K. and Eisel, U. (1998) Cloning and characterization of promoter and 5'-UTR of the NMDA receptor subunit epsilon 2: evidence for alternative splicing of 5'-non-coding exon. Gene 208, 259-269

120. Sharp, P. A. and Burge, C. B. (1997) Classification of introns: U2-type or U12-type. Cell 91, 875-879

121. Burset, M., Seledtsov, I. and Solovyev, V. (2000) Analysis of canonical and non-canonical splice sites in mammalian genomes. Nucleic Acids Res. 28, 4364-4375

122. Hanke, J., Brett, D., Zastrow, I., Aydin, A., Delbrück, S., Lehmann, G., Luft, F., Reich, J. and Bork, P. (1999) Alternative splicing of human genes: more the rule than the exception? Trends in Genetics 15, 389-390

123. Mironov, A. A., Fickett, J. W. and Gelfand, M. S. (1999) Frequent Alternative Splicing of Human Genes. Genome Res. 9, 1288-1293

124. Croft, L., Schandorff, S., Clark, F., Burrage, K., Arctander, P. and Mattick, J. (2000) ISIS, the intron information system, reveals the high frequency of alternative splicing in the human genome. Nature Genetics 24, 340-341

125. Modrek, B., Resch, A., Grasso, C. and Lee, C. (2001) Genome-wide detection of alternative splicing in expressed sequences of human genes. Nucleic Acids Res 29, 2850-2859

126. Hastings, M. L. and Krainer, A. R. (2001) Pre-mRNA splicing in the new millennium. Curr Opin Cell Biol 13, 302-309.

127. Gautheret, D., Poirot, O., Lopez, F., Audic, S. and Claverie, J.M. (1998) Alternate polyadenylation in human mRNAs: A large-scale analysis by EST clustering. Genome Res. 8, 524-530

128. Kozak, M. (1999) Initiation of translation in prokaryotes and eukaryotes. Gene 234, 187-208

129. Riechmann, J. L., Toshiro, I. and Meyerowitz, E. (1999) Non-AUG Initiation of AGAMOUS mRNA Translation in Arabidpsis thaliana.Mol. Cell. Biol. 19, 8505-8512

130. Audic, S. and Claverie, J.-M. (1998) Self-identification of protein-coding regions in microbial genomes. Proc. Natl. Sci. U.S.A. 95, 10026-10031

131. Besemer, J. and Borodovsky, M. (1999) Heuristic approach to deriving models for gene finding. Nucleic Acids Res. 27, 3911-3920

132. Médigue, C., Rouxel, T., Vigier, P., Hénaut, A. and Danchin, A. (1991) Evidence for horizontal gene transfer in Escherichia coli speciation. J. Mol. Biol. 222, 851-856

133. Mathé, C., Peresetsky, A., Déhais, P., Van Montagu, M. and Rouzé, P. (1999) Classification of Arabidopsis thaliana gene sequences: clustering of coding sequences into two groups according to codon usage improves gene prediction. J. Mol. Biol. 285., 1977-1991.


The issue

134. Borodovsky, M., McIninch, J. D., Koonin, E. V., Rudd, K. E., Médigue, C. and Danchin, A. (1995) Detection of new genes in a bacterial genome using Markov models for three gene classes. Nucleic Acids Res. 23, 3554-3562

135. Hayes, W. S. and Borodovsky, M. (1998) How to Interpret an Anonymous Bacterial Genome: Machine Learning Approach to Gene Identification. Genome Res. 8, 1154-1171

136. Besemer, J., Lomsadze, A. and Borodovsky, M. (2001) GeneMarkS: a self-training method for prediction of gene starts in microbial genomes. Implications for finding sequence motifs in regulatory regions. Nucleic Acids Res 29, 2607-2618.

137. Mathé, C., Déhais, P., Pavy, N., Rombauts, S., Van Montagu, M. and Rouzé, P. (2000) Gene prediction and gene classes in Arabidopsis thaliana.J. Biotechnol. 78, 293-299

138. Pennisi, E. (1999) Keeping Genome Databases Clean and Up to Date. Science 286, 447-450

139. Smith, T. F. (1998) Functional genomics--bioinformatics is ready for the challenge. Trends Genet 14, 291-293

140. The Gene Ontology Consortium (2001) Creating the Gene Ontology Ressource: Design and Implementation. Genome Research 11, 1425-1433

141. Brazma, A. (2001) On the importance of standardisation in life sciences. Bioinformatics 17, 113-114

142. Miller, W. (2001) Comparison of genomic DNA sequences: solved and unsolved problems. Bioinformatics 17, 391-397

143. Makalowski, W. (2000) Genomic scrap yard: how genomes utilize all that junk. Gene 259, 61-67.

144. Bergman, C. and Kreitman, M. (2001) Analysis of conserved noncoding dna in drosophila reveals similar constraints in intergenic and intronic sequences. Genome Res. 11, 1335-1345

145. Eddy, S. R. (1999) Noncoding RNA genes. Current Opinion in Genetics and Development 9, 695-699

146. Erdmann, V., Szymanski, M., Hochberg, A., Groot, N. and Barciszewski , J. (2000) Non-coding, mRNA-like RNAs database Y2K. Nucleic Acids Res. 28, 197-200

147. Rivas, E. and Eddy , S. (2000) Secondary structure alone is generally not statistically significant for the detection of noncoding RNAs. Bioinformatics 16, 583-605

148. Pertea, M., Lin, X. and Salzberg, S. (2001) GeneSplicer: a new computational method for splice site prediction. Nucleic Acids Research, 29, 1185-1190.

149. Brendel, V., Kleffe, J., Carle Urioste, J. C. and Walbot, V. (1998) Prediction of splice sites in plant pre-mRNA from sequence properties. J. Mol. Biol. 276, 85-104

150. Dong, S. and Searls, D. B. (1994) Gene Structure Prediction by Linguistic Methods. Genomics 23, 540-551

151. Xu, Y. X. and Uberbacher, E. C. (1997) Automated Gene Identification in Large-Scale Genomic Sequences. J. Comput. Biol. 4, 325-338

152. Thomas, A. and Skolnick, M. H. (1994) A probabilistic model for detecting coding regions in DNA sequences. IMA Journal of Mathematics Applied in Medicine and Biology 11, 149-160


The issue

The additional slides hereafter were not part of the course given in Brussels and are only there for the ones that would like to go any further by themselves …


How does it work 1 coding sequence codon usage markov models

how does it work ?1. coding sequence. codon usage. Markov models


Genetic code

Genetic Code

nd

2

base

U

C

A

G

st

rd

1

base

3

base

U

U

Phenyl-

alanine

Cysteine

Tyrosine

C

Serine

A

Stop

G

Trypto

-phane

Pyrimidines

(Y)

U

C

Histidine

Leucine

C

Proline

Arginine

A

Glutamine

G

U

A

Serine

Asparagine

C

Isoleucine

Threonine

A

Arginine

Lysine

G

Méthionine

Purines

(R)

U

G

Aspartate

C

Valine

Alanine

Glycine

A

Glutamate

G


Codon usage and gene classes

Codon Usage and Gene Classes

  • Escherichia coli

  • 3 gene classes (Médigue et al., 1991)

    • class 1 : low or moderate expression

    • class 2 : high constitutive expression

    • class 3 : horizontally transferred genes

      This has impact on gene prediction : learning

      sets have to be built for each class

      But what about the eukaryotes ?

Arabidopsis ?


The issue

Arabidopsis Codon Usage

Principal Component Analysis

Second principal component (7%)

First principal component (68%)


Two classes of codon usage

Weighted relative frequencies of codons

Principal Component Analysis

0.6

CAT

TAT

AAT

0.4

0.04

TTT

GAT

0.2

0.02

TGT

GAA

ATA

AAA

ACA

GCA

ATT

Second principal component

GTT

CAA

CCT

0.0

TTA

GTA

TCA

AGA

GGG

AGT

CCA

ACT

CGA

CTG

0.00

CGG

TTG

GCT

GGA

CTA

TCT

CTT

AGG

AGC

GTG

TCG

GCG

GGC

ACG

CGC

GGT

CAG

CCC

CCG

-0.20

TGC

CGT

TCC

GCC

AAG

-0.02

GAG

GTC

ACC

CTC

-0.4

GAC

TTC

ATC

-0.04

AAC

-0.6

TAC

CAC

0

1

2

3

0

.

0

2

0

.

0

4

0

0

.

6

0

0

.

8

0

1

0

.

First principal component

Two classes of codon usage

(Mathé et al., 1999, J. Mol. Biol.285: 1977-1991)


Relative contribution of codons

Relative Contribution of codons

0.04

0.02

0.0

Second principal component (7%)

-0.02

-0.04

First principal component (68%)


Codon usage for the two a thaliana classes

Codon Usage for the two A. thaliana Classes


Which genes in each class

CU1

DNA metabolism

signal transduction phosphatases, kinases..

Mitochondrial and

chloroplastic proteins

CU2

ribosomal proteins

Photosynthesis

AA metabolism

Other highly expressed genes

Which genes in each class ?

correlation with expression level

and prokaryotic origin


Constraints on codon usage

Constraints on codon usage

CU1

moderate

T

41,4 (+/- 4,2)

1315 (+/- 782)

CU2

high

C

49,7 (+/- 5,8)

986 (+/- 543)

Expression

Codon Usage

%(G+C)3

length (bp)

The major constraint comes from translation efficiency

Are they other constraints ?


Codon bias and cds length

Codon bias and CDS length


The issue

Translation Initiation Codon

eIF-4E se lie à la coiffe 5’

et eIF-4G à eIF-4E

eIF-4A déroule les structure en 5’

eIF-4B continue le déroulement

eIF-3 est fixé à la sous-unité 40S

eIF-3 aide à la liaison du ribosome associé au complexe ternaire, avec l’ARNm

La sous-unité 40S se déplace le long de l’ARNm jusqu’au codon AUG

eIF-5 GTP-as permet l’ajout de la grosses sous-unité 60S

eIF-2 et eIF-3 sont libérés

AUG

AUG

CU1 : 364 genes

CU2 : 268 genes


The issue

how does it work ? 2. Splice sites. sites, a problem of information . NetPlantGene as an example. neural networks, rules, ..


Splicing mechanism

Splicing mechanism


Gt ag splice sites

GT/AG splice sites


Gc ag splice sites 1

GC/AG splice sites = 1%


The issue

Donneurs

Accepteurs

2

Type 0

1

intron

2

Type 1

1

2

Type 2

1


Validation of gene prediction

Validation of Gene Prediction

Pavy et al., Bioinformatics, 15:887-899, 1999


Gene modeling the challenge

Gene Modeling: The Challenge

OK

OK

?

http://pgec-genome.pv.usda.gov


Gene splitting gene merging

Gene Splitting & Gene merging

prediction

reality

The prediction of exons is good but... internal or external ?

Problems of prediction when dealing with gene extremities :

  • introns and intergenic regions have the same base composition

  • there are long introns and short intergenic regions

  • difficulty of the untranslated exons

  • few experimental data about promotor sequences and first ATG


The issue

The aim

is to allow a realistic evaluation of individual gene prediction software performance as well as to analyze their strength and complementarity

A proper validation should therefore deal with

* multiple genes on the two DNA strands

* the various levels of prediction : sites, exons, genes

* genome style : Arabidopsis here

* gene borders: ability to distinguish genic regions from intergenic regions

* the effect of gene modeling on further protein database searches and structural genomics


The issue

AraSet The Arabidopsis data set

74 gene contigs57 x 2114

566014 nt14 x 3 42

3 x 4 12

168 genes1028 exons

860 introns

94 intergenic sequences2010 nt / genic region

2446 nt / intergenic region197 nt/ exon

4456 nt / gene 154 nt / intron


The issue

AraSetHow was it built ?

1. Search by eyes into AGI BAC contigs for several documented genes in a row. Found 240

2. Checking of individual annotations : discard every entry with dubious assignments,doubts on intergenic regions or containing a redundant gene. The obviously wrong assignments are corrected.

3. Discard entries with similarity to genes deposited before January 1997, which may have been used for the training of the prediction programs.

4.Cut the flanking sequences: 2000 nt on both sides for use as program input, 300 nt for output analysis

5. Araset is documented and available at http://sphinx.rug.ac.be:8080/biocomp/napav


The issue

INTERGENIC SEQUENCES IN ARABIDOPSIS

size(in bp)

11258

9649

10000

17

sequences

16

sequences

8000

6000

65 sequences

3372

4000

2000

396

179

339

0

1 promoter + 1 terminator

2 promoters

2 terminators


The issue

Distance between Arabidopsis genes

1

2

5

3

4

intergenic sequences

>> 1,7 kb (+/- 1,5) [100 bp : 6 kb]

>< 761 bp (+/- 774) [32 bp : 2,3 kb]

<> 3,2 kb (+/- 2) [304 bp : 7,1 kb]

4 cases of overlapping genes on opposite strand (3’UTR)


The issue

Effect of sequencing errors

observed decrease in sensitivity when insertions & deletions are randomly introduced in Araset

GenScanGM.hmm

10-4 11

10-3 10.2 5.8

10-2 44 29.9


The issue

Evaluation metrics taking the frame into account

In this example, exons 2.x & 3.2 are correctly predicted. Exons 1.2, 3.1, 4.1 and 4.2 are overlapping and exon 1.1 is missing. Genes 3 and 4 are merged, gene 5 is splitted.

The only correct gene model is the one for gene2.


The issue

sensitivity and specificity

sensitivity: true positives / actual coding

specificity : true positives / predicted as coding

calculated at the nucleotide & exon levels as in Burset and Guigo’s (1996)

Sn = TP/(TP+FN)Sp = TN/(TN+FP)

Sne = ce/aeSpe = ce/pe

frame-wise, some true positives become false positives according to the frame :

FPf = FP+FPw


The issue

EVALUATION OF THE PREDICTION OF PROTEIN SEQUENCES

  • How good the exon prediction and gene models are with respect to the encoded protein ?

  • Identify nucleotides & exons predicted in the wrong strand or a wrong frame

  • Compute the performance according to this additional criteria


The issue

The longest correctly predicted protein sequence

Efficient protein database search depends not only on the fraction of protein correctly predicted in a gene, but also of the “patchiness” of the prediction

One criterion for this is given by the computation of the longest correctly predicted sequence.

lgs = oeli + ce i+1 + ..+ cen+ oern+1

ce i+1, .., cen being contiguous correct exons, oeli the left-most overlapping exon and oern+1 the right-most overlapping exon.


The issue

EVALUATION OF EXON PREDICTION


The issue

Evaluation of some exon prediction programs (1)

Sn=true predicted/actual exons

Sp=true predicted/total predicted

Pavy et al. (1999) Bioinformatics15 (11): 887-899


Evaluation of some gene prediction programs 1

Evaluation of some gene prediction programs (1)

Correct gene model = all exons are well predicted

Sn=true predicted/actual genes

Sp=true predicted/total predicted

Pavy et al. (1999) Bioinformatics15 (11): 887-899


The issue

LONGEST CODING SEQUENCES PREDICTED BY GENSCAN AND GENEMARK.HMM


The issue

Gene modeling and exon number


Evaluation of new gene prediction programs 2

Evaluation of new gene prediction programs (2)

Exon level


Evaluation of some gene prediction programs 2

Evaluation of some gene prediction programs (2)

Gene level


The issue

take-home messagegene finding is improving fast, but is still far fromperfect, even for “simple” genomes like Arabidopsisexons are much better predicted than genesgene finding is genome-specific : software have to be adapted and trained for each genomethe best sofware for species A (e.g. GenScan for human) is not necessarily the best for species B


The issue

  • An important step forward !

  • Two papers were recently published describing software addressing the “5’gene border“ issue :

  • First EF: Computational identification of promoters and first exons in the human genome. (2001) Nature Genetics, 29:412-417, Davuluri R.V., Grosse I. & Zhang M.

  • Eponine : Computational detection and location of transcription start sites in mammalian genomic DNA. (2002) Genome Research, 12: 458-461, Down T. A. & Hubbard T.J.P.


The issue

  • There is room left for improvement

  • yet to be addressed :

  • locating alternative gene transcripts :

  • transcription start and stop, splicing

  • locating other important genome elements :

  • SAR/MAR, promoters & enhancers

  • make use of other genomics data, besides sequence (transcriptome, proteome, …)


The issue

What to do for species for which there are NO SOFTWARE developed yet ?

1. remember : extrinsic predictions (relying on comparison) are universal. This is especially true when using protein sequence for searching

2. for nucleic acid sequences, similarities become meaningless very fast according to the divergence of the species used for comparison

3. Intrinsic prediction can still be used when the species remain close enough of the model, … and if the genome size does not differ so much.


  • Login