Formation of novel protein-coding genes

Formation of novel protein-coding genes Level 3 Molecular Evolution and Bioinformatics Jim Provan Patthy Chapter 6

De-novo formation of novel protein-coding genes • Creation of simple structural elements such as a-helices, b-sheets and reverse turns seems to be rather trivial: • So many alternative ways of forming these structures • Have been “invented” independently several times • Proteins with repetitive structure are most likely to arise de novo: • Repetitive oligonucleotide sequences can expand, forming periodic protein structures: • Collagen-like • Leucine-rich repeat (LRR) • Probably arose several times during evolution

Evolution of serum antifreeze glycoproteins • Fish that live in polar waters have serum antifreeze glycoproteins (AFGPs) which allow them to tolerate temperatures of as low as –1.9°C • It has been shown that fish from the north and south poles have evolved very similar AFGPs independently: • AFGP of Antarctic fish, made up of a simple tripeptide repeat: evolved by recruitment of the 5’ and 3’ ends of an ancestral trypsinogen gene (secretory signal and 3’ UTR) and de novo amplification of a 9bp Thr-Ala-Ala motif • Arctic cod also have a Thr-Ala-Ala tripeptide repeat-based AFGP but this has no relationship with the trypsinogen gene • Threonines are O-linked to galactosyl-N-acetylgalactosamine and periodicity of repeats matches periodicity of water molecules • Convergent evolution of the tripeptide-based AFGP

De novo creation of complex proteins • Probability of de novo creation of more complex, globular proteins is inversely proportional to complexity: • Those that consist of a single supersecondary structure element (TIM barrel proteins) have higher probability of independent creation • TIM barrel structure likely to have evolved several times • Easier to remodel replicas of old protein folds than to invent them from scratch: • Creation of first folded proteins was probably the rate-limiting step in protein-based life • All extant proteins probably arose from a limited number of ancestral folds through divergence

Evolutionary convergence • Previous examples highlight convergence to similar primary, secondary or tertiary structure (structural convergence) • Unlike structural convergence, functional convergence and mechanistic convergence are relatively common: • Several types of proteinases that have similar function (i.e. they cleave proteins) but have different structures and catalytic mechanisms and have evolved independently • Example of mechanistic convergence is the serine proteases of the subtilisin and trypsin families: • Similar active sites and catalytic mechanisms but no sequence or conformational homology • Catalytic triad residues (His, Asp, Ser) occur in different order in primary structures • Difficult to prove structural convergence

Gene duplications • Evolutionary significance of gene duplication is that it gives rise to a redundant duplication of a gene • Duplicated gene may acquire divergent mutations and eventually emerge as a new gene • Gene duplication is the predominant and most important mechanism by which new genes arise: • Genes derived by a duplication event are said to be paralogous and are found in different loci of the chromosome • Different from orthologous genes gained by speciation events, which are found in different loci of the corresponding species

Types of DNA duplications • An increase in the number of copies of a DNA segment can be brought about by several types of DNA duplication: • Partial, intragenic or internal gene duplication – only an internal segment of a protein-coding gene is duplicated • Complete gene duplication, including flanking regions necessary for expression • Partial chromosome duplication – several adjacent genes are duplicated • Chromosomal duplication (aneuploidy) • Genome duplication (polyploidy)

Mechanisms of gene duplication • Major mechanisms for short intragenic duplications is disengagement of the DNA polymerase from the strand that is being copied and reattachment at the wrong point (slipped strand mispairing) • Major mechanism for larger duplications involves unequal crossing over: • Involves mistaken pairing and recombination between homologous chromosomes • Most likely in already-duplicated regions: • Allows rapid expansion of repeats within genes and expansion of gene families • May facilitate “homogenisation” of gene sequences and thus slow down divergence (concerted evolution)

Anti-Lepore d b d b d b d b d b Lepore Unequal crossing-over

lys lys lysP lysM Gene duplications in lysozyme • In ruminants, lysozyme gene has been duplicated ~10 times and is expressed less in extra-intestinal tissues • In mice, intestinal lysozyme is expressed from lysP gene, whereas in other tissues it is encoded by the lysM gene • Original gene duplication through unequal crossing-over in Alu-like B2 middle repetitive elements

Retrosequences • Copies of protein-coding genes may be produced by duplicative transposition: • DNA is transcribed into RNA, which is reverse-transcribed into a cDNA (retroposition) • During re-insertion, small segments of host DNA (4-12bp) are duplicated, forming direct repeats • Significant diagnostic features of retrosequences: • Lack introns (where parent gene would have introns) • Lack upstream promoter elements of parent gene • Contain poly(A) stretches at 3’ end • Flanked by short, direct repeats • Different chromosomal location from original gene

Functionality of retrosequences • Depending on whether the copied gene is functional or not, we can distinguish processed genes (retrogenes) and processed pseudogenes (retropseudogenes): • Several reasons why functional retrogenes are unlikely: • Process of reverse-transcription is very inaccurate • Lacks necessary regulatory elements • Generally truncated at 5’ end (reverse transcriptase failure) • May be inserted in genomic region unsuitable for expression • More likely to form retropseudogenes • Some examples of processed functional genes have been found e.g. human phosphoglycerate kinase: • X-linked gene has 11 exons and 10 introns • Autosomal PGK gene has no introns and a poly(A) tail

Alu elements • Processed pseudogenes of the RNA gene specifying 7SL RNA which cuts signal sequences of secreted proteins: • About 300 bp long • Around 500,000 copies in the human genome (5-6%) • Named after characteristic AluI restriction site • Derived from functional 7SL sequence by duplication, two deletions and many mutations • Play a key role in genome plasticity since they facilitate unequal crossing-over: • Gene duplication • Exon shuffling

Fate of duplicated genes • Determined by functional consequences of having extra copies of same gene and increased amounts of protein • Duplications can be advantageous, deleterious or neutral: • If an organism is exposed to a toxic environment, there may be an advantage in overproduction of detoxifying enzymes • Disadvantage will result of overproduction of protein upsets regulatory balance • Most duplications are neutral – fate determined by selection and drift • Duplicated gene is unlikely to be fixed unless it acquires a novel and useful function: • May specialise in different subfunctions of ancestral gene • May acquire drastically different functions (hepatocyte growth factor vs. plasminogen)

Formation of gene families • Recently duplicated gene families are generally found in close proximity on the same chromosome • Some multigene families contain invariant repeated genes: • Common when large quantities of protein product are required • Histones have to be synthesised at a high rate during a well-defined, short period of cell division • Some members of multigene families serve the same function but differ in tissue specificity, developmental regulation or biochemical properties e.g. isozymes

Concerted evolution in multigene families • Paralogous members of multigene families are very similar to each other within one species although orthologous members of the same family may differ greatly between even closely related species: • Suggests that mechanisms exist which cause gene families to evolve together as a unit (concerted evolution) • Process of concerted evolution of multigene families under the effects of random genetic drift is known as molecular drive • Gene correction mechanisms may homogenise genes – difficult to trace true evolutionary history of many multigene families

Dating gene duplications • Assuming duplicated genes diverge at a constant rate, we can estimate the date of a gene duplication, TD, that gave rise to two paralogous genes (A and B) if we have sequences of these paralogues from two different species (1 and 2) and we know the time of speciation TS: • If genes evolved at a constant rate then: • Average number of substitutions per site ([KA + KB]/2) in the two orthologue comparisons (A1 vs. A2, B1 vs. B2) is proportional to TS • Average number of substitutions per site KAB in the four paralogous comparisons (A1 vs. B1, A2 vs. B2, A1 vs. B2, A2 vs. B1) is proportional to the time since duplication TD • Thus, the following equation holds: TD/TS = 2KAB/[KA + KB]

Dating gene duplications (continued) • All vertebrates have both myoglobin and haemoglobin • Myoglobin differs from both the a and b subunits of haemoglobin more than they differ from each other: • Myoglobin diverged (TD = 600-800 mya) before the a and b genes arose (TD = 500 mya) • Mammals, reptiles, birds, amphibians and bony fish all have distinct a and b subunits, whereas the most primitive vertebrates, the Agnatha (jawless fish), contain only one type of haemoglobin subunit: • Myoglobin and haemoglobin diverged prior to the separation of agnathans and jawed vertebrates • Duplication giving rise to a and b subunits occurred in the ancestor of all jawed vertebrates following its divergence from agnathans

Evolutionary history and linkage patterns in a- and b-globin clusters • In humans, gene cluster of the a-globin family (c16) consists of four functional genes (z, a1, a2, q1) and three unprocessed pseudogenes (Yz, Ya1, Ya2): • Embryonic type z is most divergent (estimated TD > 300 mya) • q1 is less divergent (estimated TD~ 260 mya) • Genes a1 and a2 produce identical polypeptide and have near-identical nucleotide sequence, suggesting recent divergence • b-globin family (c11) contains five functional genes and Yb: • Adult types (b and d) diverged from non-adult types (Gg, Ag and e) around 155-200 mya • Ancestor of both g genes diverged from e about 100-140 mya • Duplication that formed Gg andAg occurred after separation of human lineage from New World monkeys (35 mya) • Divergence of adult genes (b and d) occurred about 80 mya

b e gG gA d 30 20 Myr 25 40 Myr 20 Distance (kb) 100 Myr 15 10 200 Myr 5 10 kb 50 100 150 200 250 300 Age (Myr) Intergene distance and time since duplication

Formation of novel protein-coding genes

Formation of novel protein-coding genes

Presentation Transcript

Coding Theory and Protein Synthesis

Novel approaches to protein expression analysis

Protein Denaturation and Gel Formation

Identification of Novel Virulence-Associated Genes via Genome Analysis of Hypothetical Genes

Phylogenetic inference on the evolution of protein-coding genes

Phylogenetic inference on the evolution of protein-coding genes

Novel Protein design

FISHING FOR NOVEL CAROTENOID BIOSYNTHESIS RELATED GENES

Functional Non-Coding DNA Part I Non-coding genes and non-coding elements of coding genes

Genes and Protein Synthesis

Genes and Protein Synthesis

Number of substitutions between two protein-coding genes

Protein Synthesis – Turning genes into proteins

Origins and Evolution of Novel Genes

Discovery and Characterization of protein-coding genes in D. melanogaster

Impact of microRNAs on Organization of Protein Interactions and Formation of Protein Complexes

The Number of Genes with Novel Mutations

Searching for Novel Fusion Genes

A Novel Lipid Inhibitor of Protein Phosphatase-1

Transcription of Protein-Coding Genes and Formation of Functional mRNA

Protein formation