NGS Bioinformatics Workshop1.5 Genome Annotation April 4th, 2012 IRMACS 10900, SFU Facilitator: Richard Bruskiewich Adjunct Professor, MBB Acknowledgment: This week’s slides partly courtesy of Professor Fiona Brinkman, MBB, with material from Wyeth W. Wasserman and Shannan Ho Sui
Overview • General observations about genomes • Repeat masking • Gene finding • Gene regulation and promoter analysis
NGS Bioinformatics Workshop - 1.5 Genome Annotation some general Observations about genomes
General Variables of Genomes • Prokaryote versus Eukaryote versus Organelle • Genome size: • Number of chromosomes • Number of base pairs • Number of genes • GC/AT relative content • Repeat content • Genome duplications and polyploidy • Gene content See: Genomes, 2nd edition Terence A Brown. ISBN-10: 0-471-25046-5 See NCBI Bookshelve: http://www.ncbi.nlm.nih.gov/books/NBK21128/
Genome Size • Physical: • Amount of DNA / number of base pairs • Number of chromosomes/linkage groups • Information resources: • NCBI: http://www.ncbi.nlm.nih.gov/genome • Animals: http://www.genomesize.com/ • Plants: http://data.kew.org/cvalues/ • Fungi: http://www.zbi.ee/fungal-genomesize/ • Genetic: • Number of genes in the genome Gregory TR.2002. Genome size and developmental complexity. Genetica. May;115(1):131-46.
Size of Organelle Genomes http://www.ncbi.nlm.nih.gov/books/NBK21120/table/A5511
Size of Prokaryote Genomes http://www.ncbi.nlm.nih.gov/books/NBK21120/table/A5524
Size of Eukaryote Genomes http://www.ncbi.nlm.nih.gov/books/NBK21120/table/A5471
Number of Genes http://www.ncbi.nlm.nih.gov/genome (*) Habereret al., Structure and architecture of the maize genome. Plant Physiol. 2005 Dec139(4):1612-24
AT/GC content • Regional variations correlates with genomic content and function like transposable element distribution, gene density, gene regulation, methylation, etc. • Often introduces bias in sequencing processes (e.g. library yields, PCR amplification, NGS sequencing) Romiguieret al. 2010. Contrasting GC-content dynamics across 33 mammalian genomes: Relationship with life-history traits and chromosome sizes.Genome Res. 20: 1001-1009
Repeat Content • Large genomes generally reflect evolutionary expansion of large families of repetitive DNA (by RNA/DNA transposon amplification/insertion, genetic recombination) • Repeats drive genome mutational processes: • Recombination resulting in insertion, deletion, translocation, segmental duplication of DNA • Insertional mutagenesis, possibly including de novo creation of genes • Insert novel regulatory signals • Repeats generally confound genome sequence assembly (especially for NGS, due to short reads). Gene annotation can also be problematic as transposons mimic gene structures. Jurkaet al. 2007. Repetitive sequences in complex genomes: structure and evolution. Annu Rev Genomics Hum Genet. 2007;8:241-59.
Genome Duplications/Polyploidy • Segmental duplications (i.e. by recombination) • Tandem: direct and inverted • Whole genome duplication & loss, e.g. • Ancestral vertebrate: 2 rounds • HOX gene clusters… • Polyploidy - ~70% of all angiosperms • Genomic hybridization (allopolyploids) • Can lead to immediate and extensive changes in gene expression • Mapping of homeologous gene loci can be tricky Dehal P and BooreJL.2005. Two Rounds of Whole Genome Duplication in the Ancestral Vertebrate. PLoSBiol 3(10) : e314. doi:10.1371 Adams and Wendel. 2005. Polyploidy and genome evolution in plants. Curr. Opin. Plant Biol. 8(2):135–141
Case Study: Rice • 10 duplicated blocks identified on all 12 chromosomes of Oryzasativasubspecies indica, that contained 47% of the total predicted genes. • Possible genome duplication occurred ~70 million years ago, supporting the polyploidy origin of rice. • Additional segmental duplication identified involving chromosomes 11 and 12, ~5 million years ago. • Following the duplications, there have been large-scale chromosomal rearrangements and deletions. About 30–65% of duplicated genes were lost shortly after the duplications, leading to a rapid diploidization. Wang et al. 2005. Duplication and DNA segmental loss in the rice genome: implications for diploidizationNew Phytologist 165: 937–946
Gene Content http://www.ncbi.nlm.nih.gov/books/NBK21120/figure/A5501
The Bottom Line • All of these genomic variables: • Type of organism: i.e. prokaryote versus eukaryote • Genome size • GC/AT relative content • Repeat content • Genome duplications and polyploidy • Gene content are important factors that can drive the strategy, expected outcome and efficacy of genome sequence assembly and annotation.
NGS Bioinformatics Workshop - 1.5 Genome Annotation Genome Repeat masking
Genomic (DNA) Sequence Repeat Masking • Classic approach: search against repeat libraries • RepeatMasker http://www.repeatmasker.org/ • Uses a previously compiled library of repeat families • Uses (user configured) external sequence search program • Computationally intensive but… • …the project web site also provides “pre-masked” genomic data for many completed genomes, complete with some statistical characterization.
More Repeat Masking … • de novo identification and classification: • RECON: http://www.genetics.wustl.edu/eddy/recon • RepeatGluer: http://nbcr.sdsc.edu/euler/ • PILER: http://www.drive5.com/piler • Repeat databases: • Repbase: http://www.girinst.org/repbase/index.html • plants: http://plantrepeats.plantbiology.msu.edu/ • Related algorithms: • “probability clouds” Guet al. 2008. Identification of repeat structure in large genomes using repeat probability clouds. Anal Biochem. 380(1): 77–83
NGS Bioinformatics Workshop - 1.5 Genome Annotation Gene Finding…or “what is a gene?”
Objectives • Review of differences in prokaryotic and eukaryotic gene organization. • Understand consequences and challenges for gene finding algorithms for Prokaryotes and Eukaryotes. • Appreciate HMM as powerful tool (in many areas of computational biology!) • Be reminded that not all genes encode proteins and predictions of such genes have their own computational challenges.
Gene Annotation Questions • Which genes are present? • How did they get there (evolution)? • Are the genes present in more than one copy? • Which genes are not there that we would expect to be present? • What order are the genes in, and does this have any significance? • How similar is the genome of one organism to that of another?
Why Gene-finding? • Whole-genome annotation • Genome sequence does not give you list of all genes • Fully characterizing Yfg(“your favourite gene”) • example: A disease is associated with a SNP in a location in the human genome. BLAST finds similarity to a protein coding gene in the area, but its only similar to part of the whole protein. What’s the whole gene?
What is a gene? • A distinct functional unit encoded (at minimum transcribed) in the genome • Eukaryotes: The region that is transcribed – the mRNA defines it • Prokaryotes: The coding region – because multiple genes may be on one transcript as an operon • Both eukaryotes and prokaryotes: non-coding RNAs are treated relatively the same
High gene density mRNA transcription- translation is coupled Genes are usually contiguous stretches of coding DNA mRNAs often polycistronic gene ____________________ Low gene density mRNA transcribed then transported to cytoplasm for translation Genes’ coding DNA often split by non-coding introns mRNAs are generally monocistronicgene ___________ Raw Biological Materials Prokaryotes Eukaryotes transcript Great real-time Transcription-Translation video: http://www.youtube.com/watch?v=41_Ne5mS2ls
How many genes in human genome? 2000: must be at least 100,000 (Rice has ~40,000, C. eleganshas ~19,000) 2001: only 35,000? 2005: Ensembl NCBI 35 release: 22,218 genes (33,869 transcripts) 2006: Ensembl NCBI 36 release: 23,710protein coding genes, plus 4421 RNA genes (48,851 transcripts) Today: Ensembl 64 release, Sept 2011, is 20,900 coding genes + 14,266 RNA genes - but with alternative splicing these produce likely many more…
How many genes in human genome? 2000: must be at least 100,000 (Rice has ~40,000, C. eleganshas ~19,000) 2001: only 35,000? 2005: Ensembl NCBI 35 release: 22,218 genes (33,869 transcripts) 2006: Ensembl NCBI 36 release: 23,710protein coding genes, plus 4421 RNA genes (48,851 transcripts) Today: Ensembl 64 release, Sept 2011, is 20,900 coding genes + 14,266 RNA genes - but with alternative splicing these produce likely >100,000 proteins (178,191 currently annotated in Ensembl)
1 gene in how many basepairs?... Gene Density • 1:10,000,000 • 1:1,000,000 • 1:100,000 roughly for human • 1:10,000 (1:5000 for C. elegans) • 1:1000 roughly for most bacteria • 1:100 • 1:10
Approaches to Finding Genes • AB INITIO (‘from the beginning’): • search by content: find genes by statistical properties that distinguish protein-coding DNA from non-coding DNA • Search by signals/sites: splice donor/acceptor sites, promoters, etc. • HOMOLOGY: • search by sequence similarity to homologous sequences state-of-the-art gene finding combines these strategies and also uses other data like EST/RNA-seq/microarray data
Gene-finding in Prokaryotes:Easy? ….or not? • ORF Finder • Open reading frame (ORF) from methioninecodon to first Stop codon • ORFs linked to BLAST • http://www.ncbi.nlm.nih.gov/gorf/gorf.html Problem: All ORFs are not genes. Why? How can this be improved?
Gene Finding: Ab Initio Search by Content • Protein code affects the statistical properties of a DNA sequence – some amino acids are used more frequently than others(Leu more popular than Trp)– different numbers of codons for different amino acids (Leu has 6, Trp has 1)– for a given amino acid, usually one codon is used more frequently than others • this is termed codon preference • these preferences vary by species
Gene-finding in Prokaryotes:Improving predictions… Common way to search by content • build Markov models of coding & noncoding regions apply to ORFs or fixed-sized sequence windows Markov Model approaches: prokaryotic gene prediction • Glimmer • http://www.ncbi.nlm.nih.gov/genomes/MICROBES/glimmer_3.cgi • http://cbcb.umd.edu/software/glimmer/ • open source • GeneMark • http://opal.biology.gatech.edu/GeneMark/ • not open source but slightly more accurate
i.e. If its sunny then the next weather state is not likely to become suddenly rainy. If its cloudy then it is more likely the next weather state will be rainy. Reproduced with permission, Chris Burge
For gene prediction:Certain bases occur after others in coding sequence versus non-coding sequence Reproduced with permission, Chris Burge
What’s Hidden in a Hidden Markov Model? • The HMM is a model of what actually happened (the true gene structure) but all you can see is the sequence • Each nucleotide is not labeled with its state; it’s hidden • Based on training data – which is critical!
Glimmer version 3 • annotates microbial genomes with a sensitivity of 99% and accuracy of >98% • resolves overlapping genes • use appropriate gene dataset for training program • no annotation of rRNA and tRNA genes • employs interpolated Markov model
GeneMark.hmm • HMM plus ribosome binding site signals in primary nucleotide sequence • training also required • Most accurate prokaryotic gene predictor (by a bit over Glimmer) • GeneMarkS version - translation start prediction: 83.2% of Bacillus subtilisgenes 94.4% of Escherichia coli genes
Main issues with prokaryotic gene prediction • Which ATG or GTG start to choose? Not that accurate • Errors in sequence frame shifts that lead to errors in gene prediction • Very small genes either not predicted at all (most prokaryotic genome projects don’t annotate any genes < 90 bp) or poorly predicted • Non-coding RNA genes tend to be ignored! • RNA-seq data being incorporated now…
RNA-seq • Sequence cDNA from RNA using “next generation” sequencing • Map sequence reads to ref genome • Count number of sequence reads (depth of reads) for a given window of ref sequence
Rne (Rnase E) Upstream Region • Upstream region of Rne in E. coli forms stem loop to regulate mRNA synthesis • Region aligns reasonably well between spp. • Also predicted to encode a small non-coding RNA using QRNA and RNAz E. coli PAO1
Gene prediction in eukaryotes • Overview of eukaryotic gene structure will illustrate the gene-finding problem • Components described are some of those that gene-finding tools must model • These signals are never 100% reliable because there are (almost) as many exceptions as rules • Signals (and noise) vary with organism • 3% of human genome is “coding” vs. 25% in fugu
The problem with eukaryotes: the easy stuff?
The problem with eukaryotes:the hard stuff • ~50% of human genes undergo alternative splicing • Genes can be nested: overlapping on same or opposite strand or inside an intron • Pseudogenes: “dead” genes