Gone Fishing: An Introduction to Gene Finding Methods

Gone Fishing:An Introduction to Gene Finding Methods Jarek Meller Biomedical Informatics, CHRF Additional materials for those who missed The Intro to Functional Genomics course Introduction to bioinformatics

A couple of definitions: • A short history of genes: from “hereditary basis for traits” to “one gene – one polypeptide” • Modern definition of the gene: “a complete chromosomal segment responsible for making a functional product” • Codon: a triplet of nucleotides encoding an amino acid • Open Reading Frame (ORF): a string of codons bounded by start and stop signals (codons) • Pseudogene: a potential gene with an impaired ability to make viable transcription (or translation) product

ATG…….… GT.. …AG E2 Donor Intron Acceptor The Canonical Structure of Eukaryotic Genes: …TATA… …AATAAA… 5’ Pr 5’UTR E1 I1 E2 I2 E3 3’UTR polyA 3’ Eukaryotic genes are in general neither contiguous nor continuous: coding regions are typically split in a number of coding fragments (exons), separated by non-coding intervening fragments known as introns.

Motifs and Processes: TATA – TBP – transcription initiation AATAAA – poly-A polymerase – poly-A tail attachment (pre-mRNA processing) GT … AG – splicesome complex – splicing (pre-mRNA processing) ATG – ribosome complex – translation initiation TGA, TAA, TAG – ribosome complex – translation termination

5’ CAP AAA…AA 5’ UTR 3’ UTR From Transcription to pre-RNA Processing to Splicing to Translation:

Finding Genes in Prokaryotic Genomes: AAAATGGGGGTGGGTGATGAGAGACTTAGATGAATAA f0 AAAATGGGGGTGGGTGATGAGAGACTTAGATGAATAa f1 aAAATGGGGGTGGGTGATGAGAGACTTAGATGAATAA f2 aaAATGGGGGTGGGTGATGAGAGACTTAGATGAATaa • A simple algorithm – for each reading frame on both strands: • Find start (ATG) and stop (TGA, TAG, TAA) codons • Find sufficiently long (threshold) Open Reading Frames (ORF) • For each ORF compute a “coding potential”, e.g. using codon usage • ORFs with sufficiently high score become candidate genes • Refinements: alternative coding measures, homology, regulatory motifs

Finding Genes in Eukaryotic Genomes: AAAATGGGGGTGGGTGATGAGAGACTTAGATGAATAA g1 aaaATGGGGgtgggtgatgagagACTTAGatgaataa MetGly Thr g2 aaaATGGGGGTGGgtgatgagagACTTAGATGAATAA MetGlyValA spLeuAspGlu A legal parse (candidate gene) must have a single ORF spanning all coding regions from the start to the stop codon.

Further complications: • Alternative splicing • Alternative transcription initiation sites and start codons • Overlapping (and embedded) genes • Regulatory sites often separated by long intervening non-coding sequences • Pseudogenes

where P(b,i) is the probability of observing base b at position i, derived from a set of true examples used for the training, and P(b) is the prior (background) probability of observing b in the data. Signal score is then a sum over individual scores for a window around the splice site. Refinements: conditional probabilities and Markov models: Signals: a simple approach by using Weight Matrices GAGGTAAGC CAGGTCAGT TCGGTAATT ATGGTAACT TAGGTCATT Further refinements: supervised machine learning approaches e.g. NN

Coding measures: a simple codon usage model The decomposition of sequence S into codons Ck is reading frame dependent and all reading frames are considered for prediction (that is maximum score over all reading frames with a sufficiently long sliding window is taken). However, only the reading frames are used to generate probabilities of each codon (see Codon Usage Table) in the training set of true exons. The background probabilities, in turn, may be computed from all the sequences (including introns) in the training, taking into account all the reading frames. Refinement: use homology and splice alignments

R. Guigo, sliding window of length 120 b, human beta-globulin

Combining Sites and Coding Statistics • Variety of approaches proposed, e.g. MORGAN, FGENES, GeneID, GRAIL • The dynamic programming framework: find the best legal parse up to position n, given the best scoring and consistent parses up to position n-1 (analogy to sequence alignment) • Hidden Markov Model statistical learning framework for gene finding Introduction to bioinformatics

Problems and assignments: • Use a eukaryotic genomic sequence from the GENBANK of length larger than 20 kb to estimate the frequency of putative donor and acceptor sites • Use true splice sites in your sequence to derive 0-th and first order Markov models (weight matrices) • Compare the results of the two models for false sites in the sequence • Consider splice alignments into protein (cDNA) sequence databases as a method to detect coding sequences. What would be the role of the six reading frames in such an exercise? Introduction to bioinformatics

Gone Fishing: An Introduction to Gene Finding Methods

Gone Fishing: An Introduction to Gene Finding Methods

Presentation Transcript

Gene Finding

Gene Finding

Gene Finding

Fishing Methods

Introduction to Gene-Finding: Linkage and Association

An Introduction to Phylogenetic Methods

An Introduction to Phylogenetic Methods

Gene Finding

Gene Finding

Gene Finding

An introduction to CRISPR gene engineering

Gone Fishing (3) Deep Sea Fishing

Gene Finding

Gene Finding

An Introduction to Research Methods

An introduction to gene prediction

Gene Finding

Gone Fishing

An Introduction To Insane Fishing Knowledge