Genome organisation and evolution

Genome organisation and evolution Level 3 Molecular Evolution and Bioinformatics Jim Provan Page and Holmes: Sections 3.1.4/5 and 3.3

Single-copy protein coding genes Dispersed Multigene families Tandemly repeated Regulatory sequences Satellite DNA Tandemly repeated DNA Minisatellites Transposable elements And retroviruses Microsatellites Spacer DNA The eukaryotic genome Coding DNA Non-coding DNA

Mus musculus Homo sapiens Pisum sativum Amoeba dubia Xenopus laevis Escherichia coli Lilium longiflorium Caenorhabditis elegans Protopterus aethiopicus Mycoplasma pneumoniae Saccharomyces cerevisiae Drosophila melanogaster The C-value paradox • The amount of DNA per haploid genome is known as the C-value • Contrary to expectation, the amount of DNA is not correlated with complexity: • The protist, Amoeba dubia has about 200 times more DNA (670,000,000 kbp) than humans (3,300,000 kbp) • Cannot be explained by differences in gene number

The structure of genes • There are many forms of genes: • Those which produce a protein, a tRNA or an rRNA are referred to as structural genes • Those which control how and when genes are expressed are called regulatory genes • Some housekeeping genes need to be expressed in all tissues e.g. those involved in protein synthesis • Other, tissue-specific genes, are only expressed in a particular cell or tissue type e.g. the insulin gene is only expressed in the pancreatic β-cells • Whatever their function, all genes contain a coding region which specifies a polypeptide or an RNA molecule

Regulation of gene expression • Coding regions of genes are usually flanked by regulatory regions which control gene expression through transcription and translation • Upstream promoter regions: • In bacteria, there is a Pribnow box (TATAAT) about 10 bp upstream from where transcription starts, the ‘-35 site’ (TTGACA) about 35 bp upstream and the Shine-Dalgarno box (AGGAGG) about 7 bp before the initiation codon • In eukaryotes, as well as the TATA box, some promoter regions contain a CAAT box about 40 bp before initiation codon and a GC box (GGGCGG) about 110 bp upstream • Downstream elements such as the polyadenylation signal (AATAA) signify the end of transcription and increase stability of RNA transcripts

Promoter region • TATA box • CAAT box (in mammals) • GC box (GGGCGGG) • Promoter region • Shine-Dalgarno box (AGGAGG) • Pribnow box (TATAAT) • -35 site (TTGACA) Polyadenylation signal AATAA Eukaryote Exon 1 Exon 2 Exon 3 Exon 4 5’ 3’ Intron 1 Intron 2 Intron 3 Initiation codon Initiation codon Stop codon Stop codon Prokaryote 5’ 3’ Structure of a typical gene - alcohol dehydrogenase (Adh)

Introns • Occur frequently within eukaryotic genomes and make up most of the length of very long genes • Number, size and organisation of introns varies: • Histones have no introns: chicken pro-a2-collagen gene has over fifty • SV40 virus contains an intron of 31 bp: human dystrophin gene has an intron of over 210,000 bp • Some introns have genes contained within them - the Adh gene in Drosophila is located within the intron of the outspread gene • Strong conservation of intron-exon boundaries - nearly always begin with GT and end with AG

Types of introns • Most introns in eukaryotes are spliceosomal introns (‘nuclear introns’) because they are spliced by a spliceosome of proteins and RNA • Some introns can splice without the aid of proteins (“self-splicing introns”): • One class - group I introns - are sometimes mobile because they encode proteins such as DNA endonucleases. They are found in mitochondrial and chloroplast genomes, rRNAs of some eukaryotes and in T4 bacteriophage • Group II introns are found in organelles and their bacterial ancestors and contain reverse transcriptase-like sequences • Group III introns are found in a few protists and are similar to group II introns with the central portion removed

The evolution of introns • There are two competing hypotheses for the evolution of spliceosomal introns: • The introns-early hypothesis, proposed by Walter Gilbert, suggests that introns mark the boundaries between ancient genes which encoded distinct proteins. • Throughout evolution these once-independent proteins have been put together in new combinations to produce more complex proteins by exon shuffling • An alternative hypothesis (introns-late) suggests that introns only invaded eukaryote genomes fairly recently

The evolution of introns (continued) • A crucial prediction of the introns-early hypothesis is that spliceosomal introns delineate structural or functional units within proteins: • Introns are found in the same places in all known globin genes, including myoglobin and plant leghaemoglobins • More frequently, however, introns do not appear to separate functionally distinct parts of proteins • Other problem with introns-early hypothesis is absence from Archaea and Bacteria: • Massive intron loss has been postulated but does not explain why they are found in nuclear copies of organelle genes but not in the genes of the organelles or their precursors • Exon shuffling has probably been a factor in later eukaryotes

Multigene families • Many genes are found not as individual copies but as part of multigene families, larger families of related genes: • Important evolutionary innovation: proteins with similar function can be arranged so that they are regulated efficiently • Vertebrates have a variety of multipolypeptide globin genes, produced by gene duplication, which are adapted to varying oxygen requirements of different developmental stages • Not all genes are functional: • Pseudogenes arise through gene duplications but acquire mutations since only one copy is required • Processed pseudogenes, which lack promoters and introns, have been produced by reverse transcription of mRNA

Embryonic Foetal Pseudogene Adult e g1 g2 yh d b 0 100 Millions of years ago 200 Multigene families (continued)

Evolution of multigene families • Most obvious way in which gene number can change between species is through gene duplication: • Can arise through unequal crossing-over • May occur by duplication of entire genomes (polyploidy): • Common in plants: around 50% of angiosperms are polyploid • Xenopus laevis is tetraploid: normal meiosis is possible • Other members of the genus Xenopus have chromosome numbers ranging from 20 to 108 • Another mechanism of geneduplication is transposition • Fate of new gene depends on function: redundancy vs. natural selection • Genes can also acquire new functions without duplication e.g. e-crystallin and LDH

Gene duplication in the Hox gene family • Homeotic genes control the development of body plan in animals • In both vertebrate Hox and invertebrate HOM genes, there is a highly conserved protein motif known as a homeobox • Mutations in Hox/HOM genes can drastically affect the organisation of body parts

Gene duplication in the Hox gene family • Although Hox/HOM genes are related, their organisation differs between organisms: • In vertebrates, there are multiple clusters of Hox genes: the mouse has four clusters, each located on a different chromosome and covering over 100 kb • HOM genes in Drosophila are found in two clusters, Antennipedia and Bithorax, on the same chromosome • In amphioxus – a class of marine invertebrates which are the closest relatives to the vertebrates – there is a single cluster of at least 10 Hox genes each of which is homologous to a different Hox gene in vertebrates: origin of vertebrates coincided with a series of gene duplications • Example of a dispersed gene family in vertebrates

Gene Duplications (four clusters) Amphioxus Hypothetical Common Ancestor Drosophila lab pb Dfd Scr Antp Ubx AbdA AbdB Gene duplication in the Hox gene family

ETS ITS1 ITS2 NTS 18S 5.8S 28S Tandem arrays • Tandem arrays contain multiple copies of genes with the same function • Good example is the rDNA array: • Large quantities of rRNA required • Genes and spacers co-transcribed and separated by non-transcribed spacer • Variation in size of arrays: • 1 copy in Tetrahymena • 19,300 copies in Amphiuma

Evolution of rDNA arrays • Because they contain both highly conserved (18S) and highly variable (NTS) regions, rDNA sequences have been used frequently in molecular systematics • Despite this, they do not evolve in a simple manner: • Although there is a high degree of sequence similarity within species, there is great divergence between them • Due to unequal crossing-over and gene conversion, concerted evolution can take place which allows genes to evolve together by spreading mutations throughout members • This makes phylogenetic analysis difficult since it is not easy to discern which genes are truly homologous • Often leads to “mosaics” of sequences, each with different phylogenetic history

Class Copy number Organisation Satellite DNA Highly repetitive (>104) Tandemly repeated Mini-/microsatellite Moderately repetitive Tandemly repeated Transposable elements Moderately/highly repetitive Dispersed Non-coding repetitive DNA

Tandemly repeated DNA • Much of the non-coding repetitive DNA in eukaryotes consists of tandem repeats of short sequence motifs: • Satellite DNA is located mainly in the heterochromatin and consists of motifs up to 40 kb in length: • The a-satellite DNA of primates based on a 171 bp motif repeated for hundreds of kilobases • Over 60% of the genome of Drosophila nasutoides is satellite DNA • Minisatellites and microsatellites are comprised of shorter motifs duplicated through unequal crossing over and DNA slippage: • Minisatellites motifs are 11 – 60 bp in length and contain a G-rich “core” sequence • Microsatellites are shorter, generally dinucleotide repeats • Both exhibit extremely high mutation rates and multiple alleles are usually found in populations • Used in population genetics / forensics

Transposable elements • Transposable elements increase copy number by moving around the genome making additional copies: • Around 50% of the maize genome may be transposable elements • 10-20% of the Drosophila genome • Three groups of transposable elements: • Class I (retroelements) transpose through an intermediate RNA stage via reverse transcriptase cf. retroviruses • Class II (DNA elements) transpose directly from DNA to DNA • Little is known about miniature inverted-repeat transposable elements (MITEs): around 100 – 400 bp in length and transpose by as yet unknown means

LTR Reverse transcriptase LTR Retrotransposons Reverse transcriptase Retroposons AAAAAA Transposase Ac-like elements Short repeat e.g. Tourist and Stowaway Terminal repeat Transposable elements Class I transposable elements (retroelements) Class II transposable elements (DNA elements) Miniature inverted-repeat transposable elements (MITEs)

Retroelements • Two subgroups: • Retrotransposons contain long terminal repeats at both ends: example is copia element which is found 20 – 60 times in the genome of D. melanogaster • Retroposons have no LTR and have a poly-A tail: • Long interspersed nuclear elements (LINEs) are 6 – 8 kb in length and present in thousands of copies: the L1 family is present in 590,00 copies in the human genome (17% of total) • Short interspersed nuclear elements (SINEs) do not produce reverse transcriptase and so are not considered true retroelements: they vary in size from 130 – 300 bp and have copy numbers from 50,000 to over 1,000,000 • Originally derived from RNA transcripts • Endogenous retroviruses are proviruses which have been integrated into the germ-line of eukaryotes

Class II (DNA) elements • Possess terminal repeats but unlike retrotransposons these are short (generally < 100 bp) and usually inverted • Encode a special transposase protein • Best known types: • Mariner elements in animals • Hobo and P elements in Drosophila: • P elements can move between species and affect host phenotype • Increased infertility due to chromosome breakage (hybrid dysgenesis) occurs in D. melanogaster.P elements are not found in closely related species (D. simulans, D. sechellia, D. mauritania) but are found in more distantly related species e.g. D. willistoni group: transferred after D. melanogaster split from sibling species • Insertion can have “knock-out” effect on phenotype e.g. white gene in flies lacking red eye pigment

Genome organisation and evolution

Genome organisation and evolution

Presentation Transcript

Genome evolution

Genome evolution

Genome Evolution

PHAR2811: Genome Organisation

Genome evolution

Genome evolution

Genome Evolution

Genome Organisation II

GENOME EVOLUTION

Genome Evolution

Genome evolution

Genome Evolution

Genome evolution

Genome evolution

Genome evolution:

Genome Evolution

Human genome organisation

Genome evolution

Genome evolution