Biology 224 Instructor: Tom Peavy Oct 12 & 14, 2009

Gene Structure & Genomes Biology 224 Instructor: Tom Peavy Oct 12 & 14, 2009 <Images adapted from Bioinformatics and Functional Genomics by Jonathan Pevsner>

Similarities & Differences Prokaryotic vs. Eukaryotic Genomic DNA • size of genome? • Complexity of genes? • Open reading Frames (1 gene per stretch)? • Regulatory sequences for Transcription? • Density of genes? • One gene = 1 transcript?

Finding genes in eukaryotic DNA • Types of genes include • protein-coding genes • pseudogenes • functional RNA genes: tRNA, rRNA and others • --snoRNA small nucleolar RNA • --snRNA small nuclear RNA • --miRNA microRNA There are several kinds of exons: -- noncoding -- initial coding exons -- internal exons -- terminal exons -- some single-exon genes are intronless

Eukaryotic gene prediction algorithms distinguish several kinds of exons

Gene-finding algorithms Homology-based searches (“extrinsic”) Rely on previously identified genes Algorithm-based searches (“intrinsic”) Investigate nucleotide composition, open- reading frames, and other intrinsic properties of genomic DNA (refer to Chapter 16, Eukaryotic Chromosome, Figure 16-9 for a list of extrinsic vs intrinsic based algorithms).

Extrinsic, homology-based searching: compare genomic DNA to expressed genes (ESTs) DNA intron RNA RNA protein

DNA RNA Intrinsic, algorithm-based searching: Identify open reading frames (ORFs). Compare DNA in exons (unique codon usage) to DNA in introns (unique splices sites) and to noncoding DNA.

human DNA chimpanzee DNA Comparative genomics: Compare gene models between species. (For annotation of the chimpanzee genome reported in 2005, BLAT and BLASTZ searches were used to align the two genomes.)

Finding genes in eukaryotic DNA Cautionary Notes: -- The quality of EST sequence is sometimes low -- Highly expressed genes are disproportionately represented in many cDNA libraries -- ESTs provide no information on genomic location

Finding genes in eukaryotic DNA Both intrinsic and extrinsic algorithms vary in their rates of false-positive and false-negative gene identification. Programs such as GENSCAN and Grail account for features such as the nucleotide composition of coding regions, and the presence of signals such as promoter elements.

Finding genes in eukaryotic DNA In as study using 100,000 base pairs of human DNA, intrinsic algorithms correctly identified several exons of RBP4, but failed to generate a complete gene model. As another example, initial annotation of the rice genome yielded over 75,000 gene predictions, only 53,000 of which were complete (having initial and terminal exons). Also, it is very difficult to accurately identify exon-intron boundaries. Estimates of gene content improve dramatically when finished (rather than draft) sequence is analyzed. Page 561

Genome sequencing projects There are three main resources for genomes: EBI European Bioinformatics Institute http://www.ebi.ac.uk/genomes/ NCBI National Center for Biotechnology Information http://www.ncbi.nlm.nih.gov TIGR The Institute for Genomic Research http://www.tigr.org

C value paradox: why eukaryotic genome sizes vary The haploid genome size of eukaryotes, called the C value, varies enormously. Small genomes include: Encephalotiozoon cuniculi (2.9 Mb) A variety of fungi (10-40 Mb) Takifugu rubripes (pufferfish)(365 Mb)(same number of genes as other fish or as the human genome, but 1/10th the size) Large genomes include: Pinus resinosa (Canadian red pine)(68 Gb) Protopterus aethiopicus (Marbled lungfish)(140 Gb) Amoeba dubia (amoeba)(690 Gb)

Genome sizes in nucleotide base pairs plasmids viruses bacteria fungi plants algae insects mollusks bony fish The size of the human genome is ~ 3 X 109 bp; almost all of its complexity is in single-copy DNA. The human genome is thought to contain ~30,000-40,000 genes. amphibians reptiles birds mammals 104 105 106 107 108 109 1010 1011 http://www3.kumc.edu/jcalvet/PowerPoint/bioc801b.ppt

C value paradox: why eukaryotic genome sizes vary The range in C values does not correlate well with the complexity of the organism. This phenomenon is called the C value paradox. Why?

Britten and Kohne (1968) identified repetitive DNA classes Reassociation Kinetics = isolated genomic DNA, Shear, denature (melted), & measure the rates of DNA reassociation.

Protein-coding genes in eukaryotic DNA: a new paradox Why are the number of protein-coding genes about the same for worms, flies, plants, and humans? This has been called the N-value paradox (number of genes) or the G value paradox (number of genes).

Five main classes of repetitive DNA • Interspersed repeats (RNA/DNA transposon-derived) • -- approx 45% of human genome (e.g. LINES, SINES, Alu) • 2. Processed pseudogenes (gene loss) • 3. Simple sequence repeats • -- Microsatellites (1-12 bp); Minisatellites (12-500 bp) • Segmental duplications • -- blocks of about 1 kilobase to 300 kb that are copied • intra- or interchromosomally (5% of human genome) • Blocks of tandem repeats • -- includes telomeric and centromeric repeats • and can span millions bp (often species-specific)

The spectrum of variation Category of variationSizetype Single base pair changes 1 bp SNPs, point mutations Small insertions/deletions 1 – 50 bp Short tandem repeats 1 – 500 bp microsatellites Fine-scale structural var. 50 bp – 5 kb del, dup, inv tandem repeats Retroelement insertions 0.3 – 10 kb SINEs, LINEs LTRs, ERVs Intermediate-scale struct. 5 kb – 50 kb del, dup, inv, tandem repeats Large-scale structural var. 50 kb – 5 Mb del, dup, inv, large tandem repeats Chromosomal variation >>5Mb aneuploidy Adapted from Sharp AJ et al. (2006) Annu Rev Genomics Hum Genet 7:407-42

nucleolar organizing center centromere human chromosome 21 at www.ensembl.org

centromere human chromosome 21 at UCSC Genome Browser

Chromosomes can be highly dynamic, in several ways. • Whole genome duplication (autopolyploidy) can occur, • as in yeast (Chapter 15) and some plants. • The genomes of two distinct species can merge, as in the • mule (male donkey, 2n = 62 and female horse, 2n = 64) • An individual can acquire an extra copy of a chromosome • (e.g. Down syndrome, trisomy 13 or 18) • Chromosomes can fuse; e.g. human chromosome 2 derives • from a fusion of two ancestral primate chromosomes • Chromosomal regions can be inverted or deleted • Segmental and other duplications occur Page 565

Conservative nature of chromosome evolution Among placental mammals, the number of diploid chromosomes is: 84 in black rhinoceros 46 in Homo sapiens 17 in two rodent species The process of chromosome evolution tends to remain conservative. Heterozygous carriers of most types of chromosomal rearrangements are semisterile. Thus many chromosomal changes cannot be fixed. Ohno (1970) p. 41

Diploidization of the tetraploid A species can become tetraploid. All loci are duplicated, and what was formerly the diploid chromosome complement is now the haploid set of the genome. Polyploid evolution occurs commonly in plants. For example, in the cereal plant Sorghum S. versicolor (diploid) 2n = 2 x 5; 10 chromosomes S. sudanense (tetraploid) 4n = 4 x 5; 20 chromosomes S. halepense (octoplooid) 8n = 8 x 5; 40 chromosomes Ohno (1970) pp 98- 101

“Retrotransposons constitute over 40% of the human genome and consist of several millions of family members. They play important roles in shaping the structure and evolution of the genome and in participating in gene functioning and regulation. Since L1, Alu, and SVA retrotransposons are currently active in the human genome, their recent and ongoing retrotranspositional insertions generate a unique and important class of genetic polymorphisms (for the presence or absence of an insertion) among and within human populations. As such, they are useful genetic markers in population genetics studies due to their identical-by-descent and essentially homoplasy-free nature. Additionally, some polymorphic insertions are known to be responsible for a variety of human genetic diseases. dbRIP is a database of human Retrotransposon Insertion Polymorphisms (RIPs). dbRIP contains all currently known Alu, L1, and SVA polymorphic insertion loci in the human genome.” --dbRIP Homoplasy: having some states arise more than once on a tree.

http://falcon.roswellpark.org:9090/index.html

Five main classes of repetitive DNA 2. Processed pseudogenes These genes have a stop codon or frameshift mutation and do not encode a functional protein. They commonly arise from retrotransposition, or following gene duplication and subsequent gene loss. For a superb on-line resource, visit Mark Gerstein’s website, http://www.pseudogene.org. Gerstein and colleagues (2006) suggest that there are ~19,000 pseudogenes in the human genome, slightly fewer than the number of functional protein-coding genes. (11,000 non-processed, 8,000 processed [lack introns].) Page 547

Five main classes of repetitive DNA 3. Simple sequence repeats Microsatellites: from one to a dozen base pairs Examples: (A)n, (CA)n, (CGG)n These may be formed by replication slippage. Minisatellites: a dozen to 500 base pairs Simple sequence repeats of a particular length and composition occur preferentially in different species. In humans, an expansion of triplet repeats such as CAG is associated with at least 14 disorders (including Huntington’s disease). Page 546

Successive tandem gene duplications (after Lacazette et al., 2000) observed today Fig. 16.3 Page 548

Successive tandem gene duplications (after Lacazette et al., 2000) Fig. 16.3 Page 548

Transcription factor databases In addition to identifying repetitive elements and genes, it is also of interest to predict the presence of genomic DNA features such as promoter elements and GC content. Websites that predict transcription factor binding sites and related sequences. AliBaba2 (http://www.gene-regulation.de/) Eukaryotic Promoter Database (http://www.epd.isb-sib.ch) PlantProm (http://mendel.cs.rhul.ac.uk)

Eponine predicts transcription start sites in promoter regions. The algorithm uses a set of DNA weight matrices recognizing sequence motifs that are associated with a position distribution relative to the transcription start site. The model is as follows: The specificity is good (~70%), and the positional accuracy is excellent. The program identifies ~50% of TSSs—although it does not always know the direction of transcription. http://www.sanger.ac.uk/Users/td2/eponine

The ENCODE project Goal of ENCODE: build a list of all sequence-based functional elements in human DNA. This includes: ► protein-coding genes ► non-protein-coding genes ► regulatory elements involved in the control of gene transcription ► DNA sequences that mediate chromosomal structure and dynamics.

VISTA output for an alignment of human and mouse genomic DNA (including RBP4)

Chronology of genome sequencing projects 1977 first viral genome (Sanger et. Al. bacteriophage fX174; 11 genes) 1981 Human mitochondrial genome 16,500 base pairs (encodes 13 proteins, 2 rRNA, 22 tRNA) Today, over 400 mitochondrial genomes sequenced 1986 Chloroplast genome 156,000 base pairs (most are 120 kb to 200 kb) 1995 Haemophilus influenzae genome sequenced 1996 Saccharomyces cerevisiae (1st Euk. Genome) and archaeal genome, Methanococcus jannaschii.

Chronology of genome sequencing projects 1997 More bacteria and archaea Escherichia coli 4.6 megabases, 4200 proteins (38% of unknown function) 1998 Nematode Caenorhabditis elegans (1st multicellular org.) 97 Mb; 19,000 genes. 1999 first human chromosome: Chrom 22 (49 Mb, 673 genes) 2000 Drosophila melanogaster (13,000 genes); Plant Arabidopsis thaliana &Human chromosome 21 2001: draft sequence of the human genome (public consortium and Celera Genomics)

Overview of genome analysis [1] Selection of genomes for sequencing [2] Sequence one individual genome, or several? [3] How big are genomes? [4] Genome sequencing centers [5] Sequencing genomes: strategies [6] When has a genome been fully sequenced? [7] Repository for genome sequence data [8] Genome annotation

Overview of genome analysis • [1] Selection of genomes for sequencing is based • on criteria such as: • genome size (some plants are >>>human genome) • cost • relevance to human disease (or other disease) • relevance to basic biological questions • relevance to agriculture

Overview of genome analysis [2] Sequence one individual genome, or several? --Each genome center may study one chromosome from an organism --It is necessary to measure polymorphisms (e.g. SNPs) in large populations For viruses, thousands of isolates may be sequenced. For the human genome, cost is the impediment.

Overview of genome analysis [3] How big are genomes? Viral genomes: 1 kb to 350 kb (Mimivirus: 800 kb) Bacterial genomes: 0.5 Mb to 13 Mb Eukaryotic genomes: 8 Mb to 686 Mb

Overview of genome analysis [4] 20 Genome sequencing centers contributed to the public sequencing of the human genome. Many of these are listed at the Entrez genomes site.

Overview of genome analysis • [5] There are two main strategies for sequencing genomes • Whole genome shotgun (WGS) method • -- applied to the entire genome all at once • (sequenced fragments ordered by alignment of overlaps) • VERSUS • b) hierarchical shotgun method • --applied to large overlapping DNA fragments of known location • in the genome. • (Assemble contigs from chromosomes and then systematically • sequence them and reassemble complete sequence)

Overview of genome analysis [6] When has a genome been fully sequenced? A typical goal is to obtain five to ten-fold coverage. Finished sequence: a clone insert is contiguously sequenced with high quality standard of error rate 0.01%. There are usually no gaps in the sequence. Draft sequence: clone sequences may contain several regions separated by gaps. The true order and orientation of the pieces may not be known.

Overview of genome analysis [7] Repository for genome sequence data Raw data from many genome sequencing projects are stored at the trace archive at NCBI or EBI (main NCBI page, bottom right)

Biology 224 Instructor: Tom Peavy Oct 12 & 14, 2009