Genome annotation

Genome annotation The complete genome sequence by itself is just raw data and does not mean anything unless annotated Annotation means adding specific structural or functional information about a specific region of the genome or for the whole genome Eg. Total number of genes in the genome, repeat richness, GC content of the genome, location, structure and function of genes, splice variants of a gene, transcription factor binding sites, promoters etc. One of the major objectives of the human genome project was to identify all genes in the human genome which were estimated to be anywhere between 50,000 to 100,000 Gene prediction algorithms and wet lab evidence is used to predict and confirm genes in the human genome.

Approaches to in silicogene prediction A. Statistical or abinitio methods: These methods attempt to predict genes based on statistical properties of the given DNA sequence e.g. Genscan, GeneID, GENIE and FGENEH. B. Homology based methods: The given DNA string is compared with a similar DNA string from a different species at the appropriate evolutionary distance and genes are predicted in both sequences based on the assumption that exons will be well conserved, whereas introns will not. Programs are e.g. CEM (conserved exon method) and Twinscan. c. Combined evidence approach – using the two together The given DNA sequence may also be compared with known protein structures. Programs are e.g. TBLASTN or TBLASTX, Procrustes and GeneWise.

Prokaryotic genes may be organized into operons Prokaryotic gene organization Bacterial genes are organized into operons, or clusters of coregulated genes, encoded by a single mRNA In addition to being physically close in the genome, these genes are regulated such that they are all turned on or off together. Grouping related genes under a common control mechanism allows bacteria to rapidly adapt to changes in the environment Generally - No split gene structure in prokaryotes

Prokaryote Gene annotation relatively easy • Smal size • 90% or more of the genome is protein coding • No split gene structure • Operons organization – genes of the same pathway arranged in tandem • Regulatory regions are relatively simple • Thus prokaryotic genomes are more amenable to annotation • 6 reading frame : explain

Prokaryotic genome annotation Majorly involves predicting ORFs higher than a threshold length

gene prediction in prokaryotes ( heading) In prokaryotes, greater than 90% of the genome codes for proteins and therefore prediction of ORFs > 100 AMINO ACIDS is a fairly good indication of a putative gene

De novo genome annotation A well annotated genome represents the highest resolution physical map However, annotating eukaryotic genomes with respect to genes is more difficult as compared to prokaryotes because of increased size, low gene density (only 1.5% of the genome accounts for genes) and the inherent complexity of gene structure in eukaryotes – split genes

Eukaryotic gene organization Gene structure in Eukaryotes Proximal or core promoter, with distal regulatory elements like enhancer, silencer, boundary etc. Gene sequence itself has a 5’ UTR and 3’ UTR which have additional regulatory functions Split gene structure – exons, introns, Alternative splicing Genes form a very small part of the eukaryotic genomes – 1.5% of the human genome

Basic rules … For eukaryotic gene prediction – basic structural feature include start and stop codons, and splice acceptor and splice donor sites Start site – ATG (AUG), Stop site – TAA, TGA and TAG ; Splice donor site – GT and acceptor site - AG Since eukaryotic genomes are large and genes account for only 1.5% of the genome, predictions based only on the above criteria are likely to be wrong – e.g. every ATG does not represent a start codon Therefore, additional parameters need to taken account into for accurate predictions – e.g. the presence of promoter elements such as TATA box, CAAT box or GC box upstream of real start codon, coupled with the codon usage assists in positive identification

Codon usage table Codon bias The genetic code is degenerate i.e. one amino acid can be encoded by more than one codon Eg – Leucine In such cases it has generally been seen that one of the codons is used more commonly than other for coding the amino acid Eg. CUG is used in 49% cases as compared to the other five

Structural features of a eukaryotic gene GT AG Splice sites ATG STOP -35 -10 0 10 TTCCAA TATACT GGAGG The codon usage information can be overlaid on this to ensure accuracy of predictions. Helps differentiate between genic and intergenic regions Prinbow box Ribosomal binding site Transcription start site Preferred codons occur more frequently in coding regions as compared to non protein coding region

Common gene prediction programs GLIMMER (Gene Locator and Interpolated Markov ModelER) is a system for finding genes in microbial DNA, especially the genomes of bacteria and archaea using interpolated Markov models to identify coding regions GENSCAN - is a system for finding genes in eukaryotic DNA, uses modified hidden markov models based on statistical methods and on data from an annotated training set TWINSCAN – Uses both HMM and Homology search to annotate genes

Genome annotation (wet lab) All software predictions need to be confirmed by wet lab experiments, a process called curation, by isolation of corresponding RNA of the gene cDNA libraries, expressed sequence tags and cross species hybridization or PCR helped in positive identification of human genes E1 I1 E2 I2 E3 I3 E4 Genomic DNA mRNA or EST from same or closely related genomes cDNAs or ESTs confirm presence of genes However, c DNA libraries are stage and time specific, and are biased towards genes that have higher expression levels Insilico prediction of genes and confirmation in wet lab is therefore important

Annotation projects - ENCODE • ENCODE – Encyclopedia of functional DNA elements in human genome - to develop techniques and protocols to functionally annotate the genome completely • ENCODE investigators employ a variety of assays and methods to identify functional elements. The discovery and annotation of gene elements is accomplished primarily by sequencing RNA from a diverse range of sources, comparative genomics, integrative bioinformatic methods, and human curation. Regulatory elements are typically investigated through DNA hypersensitivity assays, assays of DNA methylation, and chromatin immunoprecipitation (ChIP) of proteins that interact with DNA, including modified histones and transcription factors, followed by sequencing (ChIP-Seq) http://genome.ucsc.edu/ENCODE/ • modENCODE - Encyclopedia of functional DNA elements in other model organisms - Drosophila

Repeat Annotation Relatively young, but rapidly growing field propelled by the advances in sequencing of whole genomes, high end computing and enhanced storage capacity. Revolves around transposon discovery and annotation and uses the following methods • De novo repeat discovery • Homology based methods • Structure based methods • Comparative genomics methods

De novo repeat discovery As the name indicates prior information about TE structure or homology to known TE sequences is not used. The inherent repetitive nature of transposons and other types of repeats such as tandem repeats is employed for identification and classification. De novo repeat discovery basically involves identification of repeats in the sequence and defining repeat families. Several strategies including dot-plot, k-mer, spaced–seed, self comparison and periodicity approaches have evolved to identify repeats in a given sequence ab initio.

Homology based methods Based on detecting homology to known TE protein coding sequences, it is the most common approach to detect TE families. Offers several advantages over de novo TE discovery being more accurate and in identifying TEs with very low copy numbers However, it is biased towards detection of previously identified TEs and to TEs not too diverged from the known ones Another disadvantage is that it is not applicable to TEs that consist entirely of non-coding sequences such as SINES (Small interspersed nuclear elements eg Alu in humans) or MITES (miniature inverted repeat transposable elements) Homology based methods are more commonly used for detection as in RepeatMasker and Censor

Structure based methods Structure based methods rely on identifying structural hallmarks of a particular class of TEs e.g. LTRs for LTR-retrotransposons. These have high specificity and are not biased against TEs with low copy numbers. However, these are limited by the fact that there is great diversity in TEs structure and some TEs do not have well defined architectural features so as to merit structure based identification. LTR retrotransposons are particularly suited to this approach since full length elements possess several distinct architectural features such as LTRs, target site duplications (TSDs), primer binding sites (PBSs), polypurine tracts (PPTs) and ORFs for gag and pol, which facilitate structure based identification. Structure based algorithms - LTR_STRUC and LTR_FINDER

Comparative genomics methods Largely done to annotate newly sequenced genomes as also to identify differential insertions at orthologous loci in closely related species May help in identification of species specific repeat families e.g. RISCI – Repeat Induced Sequence Changes Identifier

Genome annotation

Genome annotation

Presentation Transcript

Genome analysis and annotation

MICROBIAL GENOME ANNOTATION

Computational Genome Annotation

Genome Annotation

Genome Annotation

Eukaryotic Genome Annotation

Genome Assembly and Annotation

Genome Annotation

Genome Annotation

Bioinformatics and Genome Annotation

Genome Annotation

Basics of Genome Annotation

Genome Annotation Continued

microbial genome annotation

Genome Annotation

Genome Annotation

VectorBase genome annotation

Eukaryotic Genome Annotation

Arabidopsis Genome Annotation

Genome sequencing and annotation

Genome analysis and annotation

Bioinformatics and Genome Annotation