Gene finding and gene structure prediction

Gene finding and genestructure prediction M. Fatih BÜYÜKAKÇALI 2008639500 ComputationalBioinformatics 2012

Outline Introduction to genes and proteins Geneticcode Open readingframes

Outline Abinitiomethods Principles: signal detection and coding statistics Methods to integrate signal detection and coding statistics Examples of software

Outline Homology methods Principles An overview of the homology methods

Introduction to genes and proteins Proteins are the main building block for many tasks in livingorganisms. They are themselves build up as a chain ofamino-acids (AA) (200-300, typically). The chain of amino-acids of a protein is produced bytranslationof an RNAsequence (via the ribosome; whiletranslation takes place, the protein foldsprogressively to takeits three-dimensional structure).

Introduction to genes and proteins The RNA sequence needed to produce a given protein isnormally obtained by transcribing a part of the DNA containedin the genome (it is then called mRNA) and thecorrespondingsubsequence of DNA is called a gene coding for that protein.

Introduction to genes and proteins A given genome can contain as few as 500 genes or as manyas 30,000 genes. Central dogma: DNARNAProtein

Introduction: gene structure

Genetic code Correspondence between tri-mers (codons) of nucleotides andamino-acids 20 amino-acids, but 64 codons (see book, or Internet, forexplanations) Some amino-acids correspond to several codons: A (Alanine)corresponds to GCA, GCG, GCT

Genetic code Some codons do not correspond to an amino-acid: TAA,TAG, TGA (these are stop codons, see below). One codon is special: ATG, it is the sole codon correspondingto Methionine, and is also called start codon (see below). NB. Although it is RNA that is translated into amino-acids, we use the DNAalphabet (T instead of U) to describe the genetic code, because we will directlysearch DNA sequences for protein coding sequences.

Open reading frames An open reading frame, is a sequence of DNA nucleotides thatcould be translated into a protein. We know that: Translation goes from 5’ to 3’ end of a strain (sense, oranti-sense) Translation always starts with a methionine codon (ATG)

Open reading frames

Open reading frames Translation always stops, as soon as a stop codon is found (and the AA-sequence ends with the AA corresponding to the last non-stop codon).

What is gene finding? From a genomic DNA sequence we want to predict the regions that will encodefor a protein: the genes. Gene finding is about detecting these coding regions and infer the gene structurestarting from genomic DNA sequences.

What is gene finding? We need to distinguish coding from non-coding regions using properties specificto each type of DNA region. Gene finding is not an easy task!

What is gene finding? Gene finding is not an easy task! DNA sequence signals have low information content (small alphabet and short sequences); It is difficult to discriminate real signals from noise (degenerated and highly unspecificsignals); Gene structure can be complex (sparse exons, alternative splicing, ...); DNA signals may vary in different organisms; Sequencingerrors (frameshifts, ...).

Gene structure inprokaryotes High gene density and simple gene structure. Short genes have little information. Overlapping genes.

Gene structure in eukaryotes Low gene density and complex gene structure. Alternative splicing. Pseudo-genes.

Gene finding strategies Abinitio methods: Based on statistical signals within the DNA: Signals: short DNA motifs (promoters, start/stop codons, splicesites, ...) Codingstatistics: nucleotidecompositionalbias in codingandnon-codingregions

Gene finding strategies Strengths: easy to run and fast execution time only require the DNA sequence as input

Gene finding strategies Weaknesses: prior knowledge is required (training sets) high number of mispredicted gene structures

Gene finding strategies Homologymethods: Gene structure is deduced using homologous sequences (EST, mRNA, protein). Veryaccurate results when using homologous sequences with high similarity.

Gene finding strategies Strengths: accurate Weaknesses: need of good homologous sequences execution is slow

Gene finding: Ab initiomethods

Ab initio methods: a simple view

Methods for signal detection Detect short DNA motifs (promoters, start/stop codons, splice sites, intronbranching point, ...).

Methods for signal detection A number of methods are used for signal detection: Consensus string: based on most frequently observed residues at a given position. Pattern recognition: flexible consensus strings. Weight matrices: based on observed frequencies of residues at a given position. Usesstandard alignment algorithms.

Methods for signal detection A number of methods are used for signal detection: Weight array matrices: weight matrices based on dinucleotides frequencies. Takesinto account the non-independence of adjacent positions in the sites. Maximal dependence decomposition (MDD): MDD generates a model whichcaptures significant dependencies between non-adjacent as well adjacent positions, startingfrom an aligned set of signals.

Methods for signal detection Methodsforsignaldetection: HiddenMarkovModels(HMMs): HMMs use a probabilistic framework to infer the probability that a sequencecorrespond to a real signal. NeuralNetworks (NNs): NNs are trained with positive and negative examples. NNs ”discover” the featuresthat distinguish the two sets.

Methods for signal detection

Signal detection limitations Problemswithsignaldetection: DNA sequence signals have low information content. Signals are highly unspecific and degenerated. Difficult to distinguish between true and false positive. How to improve signal detection: Take context into consideration (ex. acceptor site must be flanked by an intron and anexon). Combine with coding statistics (compositional bias).

Types of coding statistics Inter-genic regions, introns, and exons have different nucleotides contents. This compositional differences can be used to infer gene structure. Examples of coding statistics: ORF length: Assuming an uniform random distribution, stop codons are present every 64/3 codons(≈ 21 codons) in average. In coding regions stop codon average decrease

Types of coding statistics This measure is sensitive to frame shift errors. Can’t detect short coding regions Biasin nucleotidecontent in codingregions: Generally coding regions are G+C rich. There are exceptions! For example coding regions of P. falciparum are A+T rich.

Integrating signal and compositional information forgene structure prediction A number of methods exists for gene structure prediction which integratedifferent techniques to detect signals (splicing sites, promoters, etc.) andcoding statistics. All these methods are classifiers based on machine learning theory. Training sets are required to train the algorithms.

Ab initio methods: Generalized HMMs

TheOther Ab initiomethods GENSCAN HMMgene Linear and quadraticdiscrimination analysis FGENES MZEF Decisiontrees Neuralnetwork GRAIL

Gene finding: Homologymethods

Homology methods: a simple view

Homologymethods: Procrustes Procrustes: robber who altered his victims to fit his bed by stretching them or cutting off their legs (Classical Mythology)

Homologymethods: Procrustes

Homologymethods: Genewise Uses HMMs to compare DNA sequences to protein sequences at the level ofits conceptual translation, regardless of sequencing errors and introns. Principle: The exon model used in genewise is a HMM with 3 base states (match, insert, delete)with the addition of more transitions between states to consider frame-shifts. Intron states have been added to the base model. Genewisedirectly compare HMM-profiles of proteins or domains to the gene structureHMM model.

Homologymethods: Genewise Genewiseis a powerful tool, but time consuming. Requires strong similarities (>70% identity) to produce good predictions. Genewiseis part of the Wise2 package: http://www.ebi.ac.uk/Wise2/.

Homologymethods: Genewise

Homologymethods: sim4 Align cDNA to genomic sequences. sim4 performs standard dynamic programming: models splice sites introns are treated as special kind of gaps with low penalties sim4 performs very well, but needs strong similarity between the sequences.

Homologymethods: BLAST BLAST can be used to find genomic sequences similar to proteins, ESTs,cDNAs. A BLAST hit doesn’t mean necessarily an exon. Some post-processing isrequired. BLAST can indicate the rough position of exons, but nothing about the genestructure.

Homologymethods: BLAST However, BLAST is fast! and can reduce the search space for others programs.

Homology methods: Trimming with BLAST

Gene finding and gene structure prediction