genome annotation n.
Skip this Video
Loading SlideShow in 5 Seconds..
Genome Annotation PowerPoint Presentation
Download Presentation
Genome Annotation

Loading in 2 Seconds...

play fullscreen
1 / 41

Genome Annotation - PowerPoint PPT Presentation

  • Uploaded on

Genome Annotation. What we are going to discuss. Finding RNA-only genes Gene prediction Prokaryotes vs. eukaryotes Introns and exons Transcription signals ESTs Functional annotation Biochemical pathways and subsystems Metabolic reconstruction of whole organisms. Genome Overview.

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
Download Presentation

Genome Annotation

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
what we are going to discuss
What we are going to discuss
  • Finding RNA-only genes
  • Gene prediction
    • Prokaryotes vs. eukaryotes
    • Introns and exons
    • Transcription signals
    • ESTs
  • Functional annotation
  • Biochemical pathways and subsystems
  • Metabolic reconstruction of whole organisms
genome overview
Genome Overview
  • What’s in a genome?
    • Protein coding genes.
      • In long open reading frames
      • ORFs interrupted by introns in eukaryotes
      • Take up most of the genome in prokaryotes, but only a small portion of the eukaryotic genome
    • RNA-only genes
      • Transfer RNA, ribosomal RNA, snoRNAs (guide ribosomal and transfer RNA maturation), intron splicing, guiding mRNAs to the membrane for translation, gene regulation—this is a growing list
    • Gene control sequences
      • Promoters
      • Regulatory elements
    • Transposable elements, both active and defective
      • DNA transposons and retrotransposons
      • Many types and sizes
    • Repeated sequences.
      • Centromeres and telomeres
      • Many with unknown (or no) function
    • Unique sequences that have no obvious function
  • As a general rule, each part of a genomic sequence has only one function: protein-coding gene, RNA gene, control signal, transposable element, repeat sequence, maybe no functional at all. But, most sequence elements overlap only slightly if at all.
rna genes
RNA Genes
  • The most universal genes, such as tRNA and rRNA, are very conserved and thus easy to detect. Finding them first removes some areas of the genome from further consideration.
    • One easy approach to finding common RNA genes is just looking for sequence homology with related species: a BLAST search will find most of them quite easily
  • Functional RNAs are characterized by secondary structure caused by base pairing within the molecule.
    • Determining the folding pattern is a matter of testing many possibilities to find the one with the minimum free energy, which is the most stable structure.
      • The free energy calculations are in turn based on experiments where short synthetic RNA molecules are melted
    • Related to this is the concept that paired regions (stems) will be conserved across species lines even if the individual bases aren’t conserved. That is, if there is an A-U pairing on one species, the same position might be occupied by a G-C in another species.
      • This is an example of concerted evolution: a deleterious mutation at one site is cancelled by a compensating mutation at another site.
rna structures
RNA Structures
  • RNA differs from DNA in having fairly common G-U base pairs. Also, many functional RNAs have unusual modified bases such as pseudouridine and inosine.
  • The pseudoknot, pairing between a loop and a sequence outside its stem, is especially difficult to detect: computationally intense and not subject to the normal situation that RNA base pairing follows a nested pattern
    • But pseudoknots seem to be fairly rare.
  • Essentially, RNA folding programs start with all possible short sequences, then build to larger ones, adding the contribution of each structural element.
    • There is an element of dynamic programming here as well.
    • And, “stochastic context-free grammars”, something I really don’t want to approach right now!
finding trnas
Finding tRNAs
  • tRNAs have a highly conserved structure, with 3 main stem-and-loop structures that form a cloverleaf structure, and several conserved bases. Finding such sequences is a matter of looking in the DNA for the proper features located the proper distance apart.
  • Looking for such sequences is well-suited to a decision tree, a series of steps that the sequence must pass.
  • In addition, a score is kept, rating how well the sequence passed each step. This allows a more stringent analysis later on, to eliminate false positives.
trnascan decision tree
tRNAscan Decision Tree
  • tRNAscan is estimated to have
  • an error rate of 1 in 3 million bases.
  • This is very suitable for prokaryotes,
  • whose genomes are approximately this size.
prokaryotic genes
Prokaryotic Genes
  • Gene finding in prokaryotes is relatively simple compared to eukaryotes:
    • no introns, so all genes are in open reading frames starting at a start codon and ending at a stop codon
    • most of the DNA is involved in coding from proteins.
  • Thus, you can achieve 100% accuracy, if you don’t mind false positives, by simply listing all possible ORFs above a certain size.
    • There is a problem in that it is not clear how many short ORFs (say less than 100 bp) are real genes.
  • If you compare predicted genes with actual genes, you can classify each base according to whether it is:
    • true positive: predicted correctly to be in a gene
    • true negative: predicted correctly to not be in a gene
    • false positive: predicted to be in a gene but actually not
    • false negative: predicted to not be in a gene but actually is within a gene.
  • The sensitivity (Sn) of a prediction is the fraction of bases in real genes that are predicted to be within genes.
  • The specificity (Sp) is the fraction of bases predicted to be in a gene that actually are.
  • Both of these parameters need to be optimized.

Sn = TP / (TP + FN)

Sp = TP / (TP + FP)

general considerations
General Considerations
  • Bacteria use ATG as their main start codon, but GTG and TTG are also fairly common, and a few others are occasionally used.
    • Remember that start codons are also used internally: the actual start codon may not be the first one in the ORF.
  • The stop codons are the same as in eukaryotes: TGA, TAA, TAG
    • stop codons are (almost) absolute: except for a few cases of programmed frameshifts and the use of TGA for selenocysteine, the stop codon at the end of an ORF is the end of protein translation.
  • Genes can overlap by a small amount. Not much, but a few codons of overlap is common enough so that you can’t just eliminate overlaps as impossible.
  • Cross-species homology works well for many genes. It is very unlikely that non-coding sequence will be conserved.
    • But, a significant minority of genes (say 20%) are unique to a given species.
  • Translation start signals (ribosome binding sites; Shine-Dalgarno sequences) are often found just upstream from the start codon
    • however, some aren’t recognizable
    • genes in operons sometimes don’t always have a separate ribosome binding site for each gene
compositional methods
Compositional Methods
  • The frequency of various codons is different in coding regions as compared to non-coding regions.
    • This extends to G-C content, dinucleotide frequencies, and other measures of composition. Dicodons (groups of 6 bases) are often used
    • Well documented experimentally.
  • The composition varies between different proteins of course, and it is affected within a species by the amounts of the various tRNAs present
    • horizontally transferred genes can also confuse things: they tend to have compositions that reflect their original species.
    • A second group with unusual compositions are highly expressed genes.
  • GeneMark uses fifth order Markov chains to examine dicodons. That is, every base is evaluated in terms of its probability given the previous 5 bases.
    • P(a|x1x2x3x4x5), the probability that the sixth base in the sequence is a given that the bases preceding it are x1x2x3x4x5, so the final sequence is x1x2x3x4x5a.
    • The necessary parameters are obtained by looking at pentamers (5-mers) within known genes and counting the number of times each base appears in the sixth position. This is the training set. Possible use of pseudocounts here.
  • GeneMark pays attention to reading frame also. Each reading frame gets its own set of statistics. Thus there is a separate P1(a|x1x2x3x4x5), P2(a|x1x2x3x4x5), and P3(a|x1x2x3x4x5), where 1, 2, and 3 are the reading frames.
    • Based on the position of the stop codon, each base in an ORF has a unique codon position.
    • Non-coding regions are assumed to have the same statistics for all frames.
    • The final probability is given as the probability that it is coding for a specified reading frame.
  • A 96 base sliding window is moved across the genome, scoring all possible reading frames. Start and stop codons are not accurately predicted, especially with overlapping genes--they need to be identified separately.
  • GLIMMER also uses Markov chains, but they vary from zeroth order (i.e. GC content) to eighth order (what is the probability of a base given the previous 8 bases?).
    • The point of this is to help get around the need for huge sets of training data while avoiding pseudocounts.
    • Called “interpolated Markov models”
  • GLIMMER selects training data from a genome sequence by picking non-overlapping long ORFs, which are almost all genes.
    • Note that high GC-content genomes need “long” defined differently than low GC genomes, since random stop codons are rarer.
  • GLIMMER builds its Markov models from the lowest order up. At each step, there must be at least 400 observations to accept the model as valid.
    • If there are too few observations, the model is compared to each of the next order down model, using a chi-square test.
      • If the new model isn’t significantly different from the lower order model, it is discarded.
      • If the new model is significantly different, it is weighted based on the number of observations and the significance level.
    • For example, if there are less than 400 observations of x1x2x3x4x5, the P(a|x1x2x3x4x5) for each base a is tested against P(a|x2x3x4x5) probabilities.
  • After all parameters are obtained, only the highest order model is used for any given subsequence.
  • Each ORF longer than a minimum is scored (as opposed to using a sliding window that ignores ORFs)
  • New versions don’t require that the “given” bases be adjacent to the base they are scored with.
  • ORPHEUS uses Markov models of codon frequency, based on a set of high-confidence (i.e. highly conserved) genes. However, ORPHEUS also looks for ribosome binding sites.
    • The score for a given codon abc is bases on the frequency of this codon compared to the frequencies of the individual bases. This score is then summed for all codons in the training set and used as a parameter for the Markov model.
  • Each ORF in the genome is scored for the correct reading frame (as set by the stop codon) and for the other 2 forward, incorrect reading frames. If the correct frame score exceeds the incorrect frame scores by a certain amount, this ORF is accepted as protein-coding.
  • After a good ORF is found, it is extended 5’ to find possible start codons (but only allowing 6 bases of overlap with another gene).
  • Ribosome binding sites are then defined, based on genes that have only 1 possible start codon. Twenty bases upstream from the start sites are aligned.
    • RBS are not an exact distance upstream from the start codon.
    • The RBS scoring matrix derived from this is used to locate RBS for other genes.
  • The search is done progressively, starting with the longest ORFs and working towards the smaller ones. This avoids a lot of overlap problems.
genemark hmm
  • More Markov chains, but here, the probabilities are based on overall length of the ORF.
    • A true Markov model only considers the previous state, so this is a semi-Markov model.
  • It has been found that the length of coding regions can be modelled with a gamma distribution, and the length of non-coding regions can be modelled with an exponential distribution. (just empirical observations, not based on theory).
  • GeneMark.hmm changes the probability that a base is in a coding region depending on the length of the coding region defined to that point.
  • It also looks for ribosome binding sites.
eukaryotic gene prediction
Eukaryotic Gene Prediction
  • Some fundamental differences between prokaryotes and eukaryotes:
  • There is lots of non-coding DNA in eukaryotes.
    • First step: find repeated sequences and RNA genes
    • Note that eukaryotes have 3 main RNA polymerases. RNA polymerase 2 (pol2) transcribes all protein-coding genes, while pol1 and pol3 transcribe various RNA-only genes.
  • most eukaryotic genes are split into exons and introns.
  • Only 1 gene per transcript in eukaryotes.
  • No ribosome binding sites: translation starts at the first ATG in the mRNA
    • thus, in eukaryotic genomes, searching for the transcription start site (TSS) makes sense.
  • Many fewer eukaryotic genomes have been sequenced
exons and introns
Exons and Introns
  • Size distribution of exons varies according to position in the gene. It is also quite different between plants and animals.
  • Exons are generally shorter than prokaryotic ORFs, as short as 10 bp.
    • Note that the leading exon and the trailing exon always contain some non-coding bases, and sometimes they are entirely non-coding.
    • Exon-intron boundaries can occur within a codon as well as between codons.
  • Introns can be incredibly long, with some human introns over 400,000 bp. Minimum size is about 50 bp.
  • Many genes have alternate splicing patterns: a sequence that is an exon in one tissue might be an intron in another tissue.
more exon intron
More Exon-Intron
  • Each gene has a transcription start site, but promoters and other features are not well conserved, as compared to coding sequences.
  • Splicing signals are not absolute (especially given alternative splicing), and they also vary widely.
    • In general, introns start with GT and end with AG,and have a slice acceptor region just upstream from the end.
  • There are also the relatively rare (< 1%) U12 introns, which are removed by different spliceosomes than the usual U2 introns. The U12 introns start with AT and end in AC.

Human on left, Arabidopsis on right

predicting exons and introns
Predicting Exons and Introns
  • Exon sequences can often be identified by sequence conservation, at least roughly.
  • Dicodon statistics, as was used for prokaryotes, also is useful
    • eukaryotic genomes tend to contain many isochores, regions of different GC content, and composition statistics can vary between isochores.
  • The initial and terminal exons contain untranslated regions, and thus special methods are needed to detect them.
  • Predicting splice junctions is a matter of collecting information about the sequences surrounding each possible GT/AC pair, then running this information through some combination of decision tree, Markov models, discriminant analysis, or neural networks, in an attemp to massage the data into giving a reliable score.
    • In general, sites are more likely to be correct if predicted by multiple methods
    • Experimental data from ESTs can be very helpful here.
  • Experimental information about intron/exon boundaries is mostly obtained by analyzing expressed sequence tags (ESTs).
  • EST production starts out by extracting mRNA from a specific tissue, then reverse-transcribing it to make double-stranded cDNA, then cloning the cDNA into a plasmid vector.
    • After the clone is produced, it is sequenced for one or both ends, just a single time.
    • A 5’ EST from the 5’ end, which usually contains at least some protein-coding portion.
    • 3’ EST, sequenced from the 3’ end, is often 3’ untranslated region, which is less conserved across species lines.
  • This leads to an imperfect sequence, but BLAST can generally locate its position in the genome exactly.
    • Also, lots of redundancy in an EST library, especially with highly expressed sequences.
  • ESTs provide evidence that a given sequence has been expressed.
  • They also show which sequences are exons, since introns have already been spliced out of the mRNA.
  • Large numbers deposited in dbEST, part of NCBI.
    • The UniGene set organizes ESTs from individual genes to remove a lot of redundancy.
finding the transcription start site
Finding the Transcription Start Site
  • The basic idea is to first create a model of transcription start sites based on experimentally-determined starts, then devise ways to score sequences relative to this model.
  • Work by Bucher in 1990 produced scoring matrices for the GC box, CCAAT box, TATA box, and RNA initiation/cap site (Inr). (Moving 5’ to 3’ upstream from the TSS itself).
    • Only vertebrates have GC and CCAAT boxes
    • not all genes have recognizable TATA boxes
    • the cap signal is quite short, and thus noisy.
eukaryotic ab initio gene prediction
Eukaryotic ab initio Gene Prediction
  • Based on hidden Markov model (HMM)
  • As you move along the DNA sequence, a given nucleotide can be in an exon or an intron or in an intergenic region.
  • The oversimplified model on this slide doesn’t have the ”non-gene” state
  • Use a training set of known genes (from the same or closely related species) to determine transmission and emission probabilities.

Very simple HMM: each base is either in an intron or an exon, and gets emitted with different frequencies depending on which state it is in.

Genemark scoring of the likelihood each nucleotide is in an intron, based on HMM.

hmm model with intron phases
HMM Model with Intron Phases
  • A more realistic model: you can move from the non-gene state (N) to either a singleton exon (Es: a gene with just one exon ) or to an initial exon (Ei).
  • From Es you move back to N.
  • From Ei you can move to an intron, which can be in any of 3 different phases.
    • Intron and exon phases designate whether the exon/intron boundary splits a codon: in phase 0 the boundary is between codons; phase 1 splits the codon between the first and second bases, and phase 2 splits the codon between the second and third bases.
    • Also, exon/intron boundaries don’t split stop codons, which necessitates the I1T etc. intron states.
  • Then back and forth between introns and exons, until you reach a terminal exon (Et), then back to the intergenic state (N).
  • SNAP: Korf (2004) BMC Bioinformatics 5:59.

A more realistic model from SNAP

codon bias within exons
Codon Bias within Exons
  • Depending on the GC content of the organism as well as other, less well defined characteristics, the frequency with which different synonymous codons are used can vary widely. This makes it necessary to train the HMM gene finder with a set of genes from the same or a closely related species.

At: Arabidopsis thaliana; Ce: Caenorhabditis elegans; Dm: Drosophila melanogaster; Os: Oryza sativa

exon intron boundaries and start codons
Exon/Intron Boundaries and Start Codons
  • Gene finders use HMMs that look for signals in the DNA by applying a “weight matrix” to each nucleotide based on the nucleotides around it. Thus, the HMM is considering more than just the immediately preceding nucleotide.

Sequence logos around (b) the intron slice donor site (usually GT) and (c) the ATG translation start codon, in four well-studied eukaryotes.

some results with snap
Some Results with SNAP
  • Here, sensitivity (SN) and specificity (SP) are listed for:
    • Whether a given nucleotide is contained in an exon
    • Whether a given predicted exon has exactly the same boundaries as a real exon
    • Whether a given gene has exactly the same intron/exon structure and boundaries as the actual gene.
discriminant analysis
Discriminant Analysis
  • Scoring sequences for the presence of eukaryotic promoters uses several techniques, including hidden Markov models, neural networks (which we will discuss later), and other scoring schemes.
  • Discriminant analysis is a statistical technique for combining scores from several different parameters and drawing a line that discriminates between “good” and “bad”.
  • Each factor is considered an independent dimension on a multi-dimensional plot.
    • As opposed to just adding up scores for each factor
    • or, using individual scores as part of yes/no decisions
  • Each sequence from a training set is plotted, knowing in advance which sequences are genuine promoters and which are not.
  • Using a least-squares fitting method, draw the line (a hyperplane really) that best separates the two groups.
    • This is linear discriminant analysis
  • Quadratic discriminant analysis draws a parabola instead. Sometimes this works better.

Several factors used to score

promoter sequences. This is

part of a neural network

model, but the factors are

common to many programs.

discriminant analysis1
Discriminant Analysis
  • Illustrated here for 2 factors, but of course there can be many more.
  • The quadratic discriminant works much better in this case.
  • The position of each sequence in a scan of a region can be scored according to where it falls on the plot.
  • Support Vector Machines (SVM) are a fancier way of doing this: they can generate a much more complex curve than a hyperplane to separate the groups.
  • Once genes have been identified, we need to assign them names and functions.
    • In well-studied genomes, such as Drosophila, there are many already-named genes, some of which are quite whimisical. They often reflect the mutant phenotype, e.g. white eyes. A mutant whose wings are held at an unusual angle: Frodo (“lowered of the wings”).
    • But in general, gene names from genome project tend to be descriptions of function. For example, the gene for glucose 6-phosphate dehydrogenase is just called that in bacteria, but it is “Zwischenferment” (a German word) in Drosophila.
  • Who is going to do the annotations? There are a lot of genes, and no one is an expert in all of them.
    • One approach: use amateurs who are trained to follow certain guidelines and have easy access to as much useful information as possible. Problem: inconsistent results
    • Another approach: have experts in specific genes annotate all examples of that gene. Problem: getting experts for all genes and keeping them interested.
    • Yet another: do as much automated annotation as possible, with trained personnel examining only the hard cases. Problem: identifying the hard cases.
more annotation
More Annotation
  • Need for experimental evidence. All gene identification is based on experimental work: biochemistry, genetics, etc.
    • Most annotation is thus based on logic like “Gene X in my organism is similar to gene Y in another organism that has been experimentally determined to have such-and-such a function.”
    • How similar is “similar”? Are there other functions that might use similar proteins?
  • Gene function predictions vary in their reliability: how well does the current gene match previously discovered genes?
  • We need gene names that are computer-recognizable. This means using a controlled vocabulary: only certain words and punctuation is used, and standard genes are named the same way in all organisms.
    • Gene Ontology descriptions are useful, but they are not detailed enough, and they tend to focus on human genes at the expense of bacterial genes.
    • Enzyme Commission (E.C.) numbers are very useful or enzymes because they describe a function precisely.
    • Otherwise, you either follow the conventions of the group you are working with or try to mimic the best BLAST hits
confidence in name assignment
Confidence in Name Assignment
  • The basic hierarchy:
    • Confident assignment. We are almost certain we know what this gene is, based on its similarity to other genes.
      • If all of the top hits are high quality (say, better than 35% identical amino acids and within 20% of the same length), and they all have similar names, a gene name can be confidently given.
    • Some uncertainty, often with regard to exact enzyme or transporter specificity. Names are often called “putative” here.
    • We know it belongs to a gene family, or it contains a known domain, but function is unclear
    • Conserved hypothetical genes. Found in other species, but with no known function.
    • Hypothetical genes. The gene caller predicts a gene, but there is no match to any gene in another species.
  • But in fact these ideas are only loosely applied across many different annotation systems, and it is common to find highly similar genes given slightly different names.
    • Also, sometimes “hypothetical” is used too freely. It is always correct to call a gene hypothetical, but it doesn’t convey any useful information.
gene ontology
Gene Ontology
  • One of my “rules of biology” I tell the introductory students is that quite often there is more than one word used to describe the same phenomenon, and the same word is often used to describe completely different phenomena
    • The citric acid cycle is also the tricarboxylic acid cycle and the Krebs cycle
    • “nucleus” of a cell and an atom
  • The Gene Ontology (GO) consortium ( is an attempt describe gene products with a structured controlled vocabulary, a set of invariant terms that have a known relationship to each other.
  • Each GO term is given a number of the form GO:nnnnnnn (7 digits), as well as a term name. For example, GO:0005102 is “receptor binding”.
  • There are 3 root terms: biological process, cellular component, and molecular function. A gene product will probably be described by GO terms from each of these “ontologies”. (ontology is a branch of philosophy concerned with the nature of being, and the basic categories of being and their relationships.)
    • For instance, cytochrome c is described with the molecular function term “oxidoreductase activity”, the biological process terms “oxidative phosphorylation” and “induction of cell death”, and the cellular component terms “mitochondrial matrix” and “mitochondrial inner membrane”
  • The terms are arranged in a hierarchy that is a “directed acyclic graph” and not a tree. This means simply that each term can have more than one parent term, but the direction of parent to child (i.e. less specific to more specific) is always maintained.
more go
More GO
  • Cellular component describes what larger structure the gene product is part of. For example, “ribosome” or “endoplasmic reticulum” or “cytoplasm”.
  • Molecular function describes activities, such as catalytic or binding activities, that occur at the molecular level. They are always described as activities: the enzyme adenlylate cyclase is given the term “adenylate cyclase activity”
  • Biological process describes the higher level activity that the molecular function contributes to. For example, “signal transduction” or “mannose transport”
  • GO doesn’t go above the level of the cell, and it doesn’t deal with cell types.
    • It also doesn’t describe disease states or abnormal functions (cancer, for example).
    • It also doesn’t describe individual protein domains or gene structure.
  • Terms range in a hierarchy from very specific to very general. During annotation, the trick is to find terms that are as specific as possible without over-interpreting the data. This can be tricky with unfamiliar gene functions.
  • My opinion is that GO is a great tool, but hard to do well. And, it doesn’t quite get down to the level of exactly what the gene does. We really do want to name a gene “cytochrome c”, and not just use GO terms as descriptions.
enzyme nomenclature
Enzyme Nomenclature
  • Enzyme functions: which reactants are converted to which products
    • Across many species, the enzymes that perform a specific function are usually evolutionarily related. However, this isn’t necessarily true. There are cases of two entirely different enzymes evolving similar functions.
    • Often, two or more gene products in a genome will have the same E.C. number.
  • Enzyme functions are given unique numbers by the Enzyme Commission.
    • E.C. numbers are four integers separated by dots. The left-most number is the least specific
    • For example, the tripeptide aminopeptidases have the code "EC", whose components indicate the following groups of enzymes:
      • EC 3 enzymes are hydrolases (enzymes that use water to break up some other molecule)
      • EC 3.4 are hydrolases that act on peptide bonds
      • EC 3.4.11 are those hydrolases that cleave off the amino-terminal amino acid from a polypeptide
      • EC are those that cleave off the amino-terminal end from a tripeptide
  • Top level E.C. numbers:
    • E.C. 1: oxidoreductases (often dehydrogenases): electron transfer
    • E.C. 2: transferases: transfer of functional groups (e.g. phosphate) between molecules.
    • E.C. 3: hydrolases: splitting a molecule by adding water to a bond.
    • E.C. 4: lyases: non-hydrolytic addition or removal of groups from a molecule
    • E.C. 5: isomerases: rearrangements of atoms within a molecule
    • E.C. 6:ligases: joining two molecules using energy from ATP
information used in annotation
Information Used in Annotation
  • BLAST searches
  • HMM models of specific genes or gene families (Pfam, TIGRfam, FIGfam).
  • Sequence motifs and domains. If the gene is not a good match to previously known genes, these provide useful clues.
  • Cellular location predictions, especially for transmembrane proteins.
  • Genomic neighbors, especially in bacteria, where related functions are often found together in operons and divergons (genes transcribed in opposite directions that use a common control region).
  • Biochemical pathway/subsystem information. If an organism has most of the genes needed to perform a function, any missing functions are probably present too.
    • Also, experimental data about an organism’s capacities can be used to decide whether the relevant functions are present in the genome.
transmembrane predictions
Transmembrane Predictions
  • Integral membrane proteins contain amino acid sequences that go through the membrane one or several times.
    • There are also peripheral membrane proteins that stick to the hydrophilic head groups by ionic and polar interactions
    • There are also some that have covalently bound hydrophobic groups, such as myristoylate, a 14 carbon saturated fatty acid that is attached to the N-terminal amino group.
  • There are 2 main protein structures that cross membranes.
    • Most are alpha helices, and in proteins that span multiple times, these alpha helices are packed together in a coiled-coil. Length = 15-30 amino acids.
    • Less commonly, there are proteins with membrane spanning “beta barrels”, composed of beta sheets wrapped into a cylinder. An example: porins, which transport water across the membrane.
hydrophobicity and amphipathy
Hydrophobicity and Amphipathy
  • Membrane interiors are hydrophobic, so the simplest way of finding membrane-spanning regions is to look for relatively hydrophobic regions.
  • There are several measures of amino acid hydrophobicity available, based on partitioning in water vs. solvent or on crystallography of membrane proteins. No one scale dominates prediction models.
  • However, beta barrels and coiled-coils of alpha helices have interior regions that don’t need to be hydrophobic because they don’t interact with the hydrophobic fatty acid chains of the membrane.
    • Thus, many membrane-spanning regions are amphipathic: they have a hydrophobic side and a hydrophilic side.
    • The helical wheel is a simple way of visualizing this. It is a view looking down the helix. If most of the hydrophobic residues fall on one side, the sequence is likely to be membrane-spanning.
hmm prediction of transmembrane regions
HMM Prediction of Transmembrane Regions
  • Hidden Markov models seem to do a good job predicting transmembrane regions.
  • The states are: loops inside the cell, loops outside the cell, and transmembrane regions.
    • In addition, the cap amino acids (at the membrane/aqueous interface) can be a state, and it is possible to globular domains either inside or outside the cell.
  • The HMM is circular, allowing for multiple passes through the membrane.
    • Many of the states allow transition back to themselves: there is more than one amino acids in the membrane interior, for example.
  • The model is parameterized using known membrane proteins (from X-ray crystallography).
  • The model pictured here is TMHMM.
biochemical pathways and co localization
Biochemical Pathways and Co-localization
  • Operon structure is often maintained over fairly large taxonomic regions.
    • Sometimes gene order is altered, and sometimes one or more enzymes are missing.
    • But in general, this phenomenon allows recognition or verification that widely diverged enzymes do in fact have the same function.
  • This is an operon that contains part of the glycolytic pathway.
    • 1: phosphoclycerate mutase
    • 2: triosephosphate isomerase
    • 3: enolase
    • 4: phosphoglycerate kinase
    • 5: glyceraldehyde 3-phosphate dehydrogenase
    • 6: central glycolytic gene regulator
alternate pathways
Alternate pathways
  • There are often alternate ways of going through a pathway.
    • Often dependent on taxonomic group (but beware of horizontal gene transfer).
    • Reversible pathways often have irreversible steps that need alternate enzymes to get around. And some species will only have the pathway functioning in one direction.
  • This pathway is glycolysis and gluconeogenesis in Bacillus megaterium. The colored boxes indicate enzymes that are present.
    • Both glycolysis and gluconeogenesis are present
    • Several alternative enzymes are not found here.
  • BIOLOG is a company that performs batteries of tests on bacteria. The idea is to develop a complete metabolic profile for the organism.
    • They are grown in microtiter plates with standard growth media supplemented or substituted with various possible nutrients or growth inhibitors. For example, carbon sources, nitrogen sources, phosphate sources, various osmotic strengths and pHs
    • Growth is checked over several days
    • Strain comparison or individual data
  • The yellow triangles in each well position are growth curves.
    • Red = strain A grew better than strain B; green is the opposite
    • Outlined boxes were significant by the company’s standards