1 / 100

Orthology predictions for whole mammalian genomes

Orthology predictions for whole mammalian genomes. Leo Goodstadt MRC Functional Genomics Unit Oxford University. Finishing. “Evolution of Orthologues” Selection pressures in orthologues and paralogs. “Gene Duplications” Reproduction, immunity or chemosensation.

lonato
Download Presentation

Orthology predictions for whole mammalian genomes

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Orthology predictions for whole mammalian genomes Leo Goodstadt MRC Functional Genomics Unit Oxford University

  2. Finishing “Evolution of Orthologues”Selection pressures in orthologues and paralogs “Gene Duplications”Reproduction, immunity or chemosensation “Gene birth in the human lineage”Ongoing duplications underlie polymorphism “Synonymous substitution rates”Mutation and selection varies by chromosome size

  3. Orthology is the key

  4. How it started We are “consumers” of orthology / paralogy Started off using Ensembl predictions Ensembl 1:1 covered 50% of predicted mouse genes. Ewan’s manual survey said 80%

  5. 1) General observations for all mammalian genomes Paralogues evolve fast (and are fun!)

  6. 2) Observations for whole clades of species Drosophila 0.14 Nematodes Amniotes 0.12 S /d 0.10 N d 0.08 Lineage specific 0.06 0.04 0.02 0.00 dvir cbri dgri cele ggal dere dmel dsim dmoj dyak dpse hsap dana cfam oana crem c2801 mmus mdom Species

  7. 3) Inparalogues define lineage specific biology Marsupial / Monodelphis biology revealed by lineage specific genes • Chemosensation(OR, V1R and V2R ) • Reproduction(Vomeronasal Receptors, lipocalins, b-microseminoprotein (12:1)) • Immunity(IG chains, butyrophilins, leukocyte IG-like receptors, T-cell receptor chains and carcinoembryonic antigen-related cell adhesion molecules )pancreatic RNAses • Detoxification(hypoxanthine phosphoribosyltransferase homologues nitrogen poor diets) • KRAB ZnFingers

  8. 4) Interesting stories in the aggregate

  9. 5) Treasure trove in the details On going mouse inparalogues analysis: Lots and lots of reproductive genes clade: #2 (ortholog_id = 17117 in panda) 159 mus genes 47 genes new to assembly 36 10 genes completely new to assembly 36 Interpro matches for this clade: !!! Expansion mainly on chr5 and 14, although single (pseudogene?) versions on chr13 and chr16. !!! Mouse DLG5 is: chr14:22,966,420-22,978,653 (expressed in testis: AK147699) gene identifier order chrm exons stop length -------------------- ----- ---- ----- ---- ------ MUS_GENE_21705 6639 5 spermatogenesis associated glutamate (E)-rich protein 1, pseudogene 1 ; ENSMUSP00000086007 4 182 MUS_GENE_22420 6643 5 predicted gene, EG623898 ; ENSMUSP00000099126 2 72 < MUS_GENE_19599 6646 5 spermatogenesis associated glutamate ( E)-rich protein 1, pseudogene 1 (Speer1-ps1) on chromosome 5 ; NCBIMUSP_83776567 4 157 < MUS_GENE_23688 6651 5 predicted gene, EG623898 ; ENSMUSP00000094421 2 72 MUS_GENE_19774 6657 5 spermatogenesis associated glutamate (E)-rich protein 3 ;

  10. 6) Candidates for evolutionary and functional analyses Secretoglobin Protein Family members: Androgen-binding proteins. Emes et al. (2004) Genome Res. 14(8):1516-29

  11. Available Genomes And Divergences Hedges, SB Nature Reviews Genetics 3, 838 -849 (2002)

  12. How do we find function in the genome? • Nothing in Biology Makes Sense Except in the Light of Evolution. Theodosius Dobzhansky (1900-1975).

  13. How to find the function in the genome? Similar Sequences (Genes / Genome regions) Common Ancestry (homology) Similar Structures / Folds Similar Functions ?

  14. How much of the genome is functional?Compare with the mouse Ancestral Repetitive (AR) sequence is non-functional and has evenly distributed conservation scores (red) (symmetrical bell shaped due to biological variation) Whole Genome Whole Genomesequence contains some functional sequence under selection and thus has a small excess of conserved sequence under purifying selection(asymetrical) ARs Functional sequence =Whole Genome- Ancestral Repetitive = 5% • N.B. • This is an estimate that doesn’t take into account sequence • Turning over rapidly (not shared by mouse/human) • Under positive (diversifying) selection

  15. The human genome (euchromatic sequence) Conserved non-coding (3.5% ?) Protein coding: 1.2% UTR: 0.3% Neutral Repeats (Transposable elements, …) ~45% Unknown (old repetitive junk?)

  16. Conserved non-coding material • Transcription factor binding sites • Enhancers, insulators and other non-transcribed regulatory elements • Alternative splicing signals • Transfer RNAs, ribosomal RNAs • Small RNAs (e.g. snoRNAs, microRNAs, siRNAs and piRNAs)regulatory/gene silencing / RNA degradation • MacroRNAs (e.g. Xist)enzymatic? / chromosome inactivation

  17. Functional parts of genes are highly conserved

  18. How many protein coding genes? • Walter Gilbert [1980s] 100k • Antequera & Bird [1993] 70-80k • John Quackenbush et al. (TIGR) [2000] 120k • Ewing & Green [2000] 30k • Tetraodon analysis [2001] 35k • Human Genome Project (public) [2001] ~ 31k • Human Genome Project (Celera) [2001] 24-40k • Mouse Genome Project (public) [2002] 25k -30k • Lee Rowen [2003] 25,947 • Human Genome Project (finishing) 20-25k [2004] • Current predictions [2008] 19-20k

  19. Traditional Genome Orthology Reciprocal BLAST best hits between longest transcript of each gene (+ synteny) Assumes: • Protein similarity is proportional to evolutionary distance (selection is invariant!) • Pairwise relationships adequately represent the evolutionary tree • No gene losses or missing predictions • Alternative splicing can be ignored! • No gene translocations after tandem duplication

  20. Orthology prediction methods Query • Two genomes • Reciprocal best blast hit • Multiple genomes • Clustering of • reciprocal best hits • protein similarities Blast hits

  21. Reciprocal Blast Best Hits Advantages: • Fast, Well understood • Works well for distant lineages • Can correlate with protein structure (domains) Disadvantages: • Only provides 1:1 orthologues in the best case • Can be difficult to reconcile with the species tree

  22. Genes on chromosome of species 1 Genes on chromosome of species 2

  23. ? Reciprocal Blast Best Hits

  24. ? Reciprocal Blast Best Hits

  25. How to add duplicated genes? synteny Ensembl compara in the past • Local gene order tends to be conserved in mammalian lineages • Look for inparalogs locally even if the protein distances don’t add up (sequence error, sampling error etc.)

  26. ? Blast Best Hits in Local Regions

  27. ? Blast Best Hits in Local Regions

  28. Problems with relying only on synteny Local homologs are often not inparalogs: • Local rearrangements • Missing predictions(neighbouring orphans) • Need sanity checking

  29. Human and Mouse chromosomes: • Extensive rearrangements only over larger regions • Conservation of gene order in the short range

  30. One to one One to many Many to one Many to many Olfactory Orthology from compara Mouse chromosome 2 Rat chromosome 3

  31. One to one One to many Many to one Many to many Olfactory Orthology Mouse chromosome 2 Rat chromosome 3

  32. Inparanoid • Remm,M., Storm,C.E. and Sonnhammer,E.L.L. (2001) Automatic clustering of orthologs and in-paralogs from pairwise species comparisons. J. Mol. Biol. 314, 1041–1052. • Avoids multiple alignments and phylogenetic methods for speed and to avoid errors • Heuristics are implicitly phylogenetic

  33. How Inparanoid works Longest Transcripts Pairwise alignments scores Use cutoff 2. Reciprocal Best Hits are orthologues 3. Add lineage Specific duplicates (inparalogs) With confidences 4. Resolve conflicts 5. Orthology

  34. Identify “main” orthologues Identify “inparalog” candidates Longest Transcripts Pairwise alignments scores Use cutoff 2. Reciprocal Best Hits are orthologues Reciprocal Best Hits are orthologues 3. Add lineage Specific duplicates (inparalogs) With confidences Add lineage Specific duplicates (inparalogs) Add lineage Specific duplicates (inparalogs) With confidences 4. Resolve conflicts 5. Orthology

  35. Confidence values for inparalogs A B Most confident inparalog is when the inparalog is sequence identical to main orthologue. Maximum value = scoreidentical – scoreorthologs Confidence = (scoreinparalog – scoreorthologs) / (scoreidentical – scoreorthologs)

  36. Resolving conflicts Merge if orthologs already clustered in same group Merge if two equally good best hits Delete weaker group Merge significantly overlapping Divide overlapping Longest Transcripts Pairwise alignments scores Use cutoff 2. Reciprocal Best Hits are orthologues 3. Add inparalogs With confidences 4. Resolve conflicts 5. Orthology

  37. Why are there conflicts? • Protein differences are a proxy for evolutionary time • Protein similarity scores approximate protein differences (sequence, alignment, estimation errors) • Pairwise scores can be used to (conceptually) recover phylogenetic (tree) data

  38. Alternatives: phylogenetic methods • Inparanoid is great because it models phylogeny explicitly • Why not use phylogenetic methods directly? • Multiple estimators of protein distance 4 pairwise scores used out of 30

  39. Phylogenetic methods • Iterative distance methods are very fast, suitable for whole genome analyses (variants on neighbor joining) • Statistically consistent with evolutionary models (can have explicit error model with evolutionary distances, e.g. bionj) • Inparanoid type consistency checking can be carried out after phylogeny is predicted

  40. Is protein similarity a good proxy for evolutionary distance? Advantages • Does not saturate over long evolutionary distances • Easy to align / predict genes (unlike non-coding regions) • Sometimes cDNA sequence is not available Disadvantage • Assumes constant evolutionary rate • Assumes invariant selection

  41. Use Silent Mutations as a genetic clock • Redundant genetic code, e.g. GCA GCC GCG GCT • Third base of a codon “wobbles” without changing the translated amino acid • dS approximates neutral mutation rate (without selection) in coding regions } → Alanine

  42. dS as proxy for evolutionary distance • Easier to align than Ancestral Repeats • Not neutral sequence!! • Genomic > 2x variation in dS • Assumes most gene families are local due to tandem duplication and share dS • Assume (partial) gene conversions are infrequent

  43. dS Caveats • Saturates at long evolutionary distances(but less so than many think) • Beware of GC / codon frequency biases(use ML rather than heuristic methods) • Multiple alignment / tree rather than pairwise for best results • Slow to estimate accurately • Missing values (where dS saturates)

  44. codeml dS accuracy at 400 codons

  45. yn00 dS accuracy at 400 codons

  46. Use all transcripts

  47. PhyOP: transcript trees from dS • Whole genome alignment identifies homologues • codeml for dS calculation • Ignore large dS • Hierarchical cluster • Fitch Margoliash modified to handle missing values to give giant transcript tree • Heuristics based on lowest dS to select 1 “representative” transcript per gene • Map Gene tree to species tree

  48. Fitch Margoliash Minimize Where • dij is the pairwise distance estimate • pij is the distance between i and j on the tree Assumes that the error is a fixed proportion of the total distance (Fitch and Margoliash, 1967) Easily adapted for missing values

  49. PhyOP pipeline Part 1

  50. 3 ways in which transcript trees map to genes • Simple cladesonly 1 transcript per gene in orthologous relationship: most genes • Unambigous cladesAlternative transcripts are in the same orthologous relationships

More Related