140 likes | 158 Views
Explore gene finding methods in bioinformatics, covering the history of genes, structures of eukaryotic genes, motifs, and processes involved in gene identification. Learn how to find genes in prokaryotic and eukaryotic genomes, including algorithms, coding potentials, regulatory motifs, and refinements. Discover supervised machine learning approaches, hidden Markov models, and the role of reading frames in gene prediction. Dive into how to estimate frequency of donor and acceptor sites, derive Markov models, compare results, and use splice alignments to detect coding sequences.
E N D
Gone Fishing:An Introduction to Gene Finding Methods Jarek Meller Biomedical Informatics, CHRF Additional materials for those who missed The Intro to Functional Genomics course Introduction to bioinformatics
A couple of definitions: • A short history of genes: from “hereditary basis for traits” to “one gene – one polypeptide” • Modern definition of the gene: “a complete chromosomal segment responsible for making a functional product” • Codon: a triplet of nucleotides encoding an amino acid • Open Reading Frame (ORF): a string of codons bounded by start and stop signals (codons) • Pseudogene: a potential gene with an impaired ability to make viable transcription (or translation) product
ATG…….… GT.. …AG E2 Donor Intron Acceptor The Canonical Structure of Eukaryotic Genes: …TATA… …AATAAA… 5’ Pr 5’UTR E1 I1 E2 I2 E3 3’UTR polyA 3’ Eukaryotic genes are in general neither contiguous nor continuous: coding regions are typically split in a number of coding fragments (exons), separated by non-coding intervening fragments known as introns.
Motifs and Processes: TATA – TBP – transcription initiation AATAAA – poly-A polymerase – poly-A tail attachment (pre-mRNA processing) GT … AG – splicesome complex – splicing (pre-mRNA processing) ATG – ribosome complex – translation initiation TGA, TAA, TAG – ribosome complex – translation termination
5’ CAP AAA…AA 5’ UTR 3’ UTR From Transcription to pre-RNA Processing to Splicing to Translation:
Finding Genes in Prokaryotic Genomes: AAAATGGGGGTGGGTGATGAGAGACTTAGATGAATAA f0 AAAATGGGGGTGGGTGATGAGAGACTTAGATGAATAa f1 aAAATGGGGGTGGGTGATGAGAGACTTAGATGAATAA f2 aaAATGGGGGTGGGTGATGAGAGACTTAGATGAATaa • A simple algorithm – for each reading frame on both strands: • Find start (ATG) and stop (TGA, TAG, TAA) codons • Find sufficiently long (threshold) Open Reading Frames (ORF) • For each ORF compute a “coding potential”, e.g. using codon usage • ORFs with sufficiently high score become candidate genes • Refinements: alternative coding measures, homology, regulatory motifs
Finding Genes in Eukaryotic Genomes: AAAATGGGGGTGGGTGATGAGAGACTTAGATGAATAA g1 aaaATGGGGgtgggtgatgagagACTTAGatgaataa MetGly Thr g2 aaaATGGGGGTGGgtgatgagagACTTAGATGAATAA MetGlyValA spLeuAspGlu A legal parse (candidate gene) must have a single ORF spanning all coding regions from the start to the stop codon.
Further complications: • Alternative splicing • Alternative transcription initiation sites and start codons • Overlapping (and embedded) genes • Regulatory sites often separated by long intervening non-coding sequences • Pseudogenes
where P(b,i) is the probability of observing base b at position i, derived from a set of true examples used for the training, and P(b) is the prior (background) probability of observing b in the data. Signal score is then a sum over individual scores for a window around the splice site. Refinements: conditional probabilities and Markov models: Signals: a simple approach by using Weight Matrices GAGGTAAGC CAGGTCAGT TCGGTAATT ATGGTAACT TAGGTCATT Further refinements: supervised machine learning approaches e.g. NN
Coding measures: a simple codon usage model The decomposition of sequence S into codons Ck is reading frame dependent and all reading frames are considered for prediction (that is maximum score over all reading frames with a sufficiently long sliding window is taken). However, only the reading frames are used to generate probabilities of each codon (see Codon Usage Table) in the training set of true exons. The background probabilities, in turn, may be computed from all the sequences (including introns) in the training, taking into account all the reading frames. Refinement: use homology and splice alignments
R. Guigo, sliding window of length 120 b, human beta-globulin
Combining Sites and Coding Statistics • Variety of approaches proposed, e.g. MORGAN, FGENES, GeneID, GRAIL • The dynamic programming framework: find the best legal parse up to position n, given the best scoring and consistent parses up to position n-1 (analogy to sequence alignment) • Hidden Markov Model statistical learning framework for gene finding Introduction to bioinformatics
Problems and assignments: • Use a eukaryotic genomic sequence from the GENBANK of length larger than 20 kb to estimate the frequency of putative donor and acceptor sites • Use true splice sites in your sequence to derive 0-th and first order Markov models (weight matrices) • Compare the results of the two models for false sites in the sequence • Consider splice alignments into protein (cDNA) sequence databases as a method to detect coding sequences. What would be the role of the six reading frames in such an exercise? Introduction to bioinformatics