1 / 71

Molecular Evolution, Multiple Sequence Alignment & Phylogenetics. Canadian Bioinformatics Workshop Thursday June 21s

Molecular Evolution, Multiple Sequence Alignment & Phylogenetics. Canadian Bioinformatics Workshop Thursday June 21st. David Lynn M.Sc., Ph.D., Postdoctoral Research Associate, Brinkman Lab., Department of Molecular Biology & Biochemistry, Simon Fraser University, Greater Vancouver, B.C.

giselle
Download Presentation

Molecular Evolution, Multiple Sequence Alignment & Phylogenetics. Canadian Bioinformatics Workshop Thursday June 21s

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Molecular Evolution, Multiple Sequence Alignment & Phylogenetics.Canadian Bioinformatics Workshop Thursday June 21st David Lynn M.Sc., Ph.D., Postdoctoral Research Associate, Brinkman Lab., Department of Molecular Biology & Biochemistry, Simon Fraser University, Greater Vancouver, B.C.

  2. Evidence for Evolution – Fact not Theory • Fossils • Observable – e.g. viral evolution – HIV drug treatment can predict which sites will change. Why you need flu vaccine every year! • Overwhelming scientific evidence. • We are 99% identical at DNA level to chimp.

  3. Nothing in biology makes sense except in the light of evolution Dobzhansky, 1973

  4. Why Learn about Evolution: • Tells us where we come from, classification of species, which species are most closely related. • Understand the fundamentals of life. • Practical side: • Foundation of most bioinformatics analyses: • Gene family identification. • Gene discovery – inferring gene function, gene annotation. • Origins of a genetic disease, characterization of polymorphisms.

  5. Besoin - the need or desire for change in phenotype Change in phenotype Jean Baptiste de Lamarck Change in genotype Change in phenotype of offspring Inherited

  6. Genotype unaffected by changes in phenotype August Weismann Spontaneous and random changes in genes during reproduction Offspring has changed genotype Change in phenotype of offspring Weismann distinguished somatic and germline mutation

  7. Part of Darwin’s Theory • The world is not constant, but changing • All organisms are derived from common ancestors by a process of branching. • Classify organisms based on shared traits inherited from common ancestor • Morphological character-based analysis – didn’t know about DNA

  8. For evolution to happen, must have heredity and variation – Decent with modification.

  9. Variation by DNA mutation • Nucleotide substitution • Replication error • Chemical reaction • Insertions or deletions (indels) • single base indels • Unequal crossing over

  10. What happens when a new mutation arises?

  11. Positive Selection • A new allele (mutant) confers some increase in the fitness of the organism • Selection acts to favour this allele • Also called adaptive evolution NOTE: Fitness = ability to survive and reproduce

  12. Advantageous Allele Herbicide resistance gene in nightshade plant

  13. Negative selection • A new allele (mutant) confers some decrease in the fitness of the organism • Selection acts to remove this allele • Also called purifying selection

  14. Deleterious allele Human breast cancer gene, BRCA2 5% of breast cancer cases are familial Mutations in BRCA2 account for 20% of familial cases Normal (wild type) allele Mutant allele (Montreal 440 Family) Stop codon 4 base pair deletion Causes frameshift

  15. Neutral mutations • Neither advantageous nor disadvantageous • Invisible to selection (no selection) • Frequency subject to ‘drift’ in the population • Random drift – random changes in small populations

  16. Random Genetic Drift Selection 100 advantageous Allele frequency disadvantageous 0

  17. Evolutionary models • Neo-Darwinian (Pan-selectionist) – positive selection only. • Mutationist – mutation and random drift. • Neutralist – mutation, random drift, and negative selection.

  18. Neo-Darwinian Model • Mutation is recognised as the origin of variation. • Gene substitution (new allele replacing old) occurs by positive selection only. • Polymorphism (multiple alleles co-existing) caused by balancing selection.

  19. Neutral Theory • Too much polymorphism to be explained by mutation and positive selection alone (NeoDarwinian model). • Why so much? • Neutral Theory of Molecular Evolution • Motoo Kimura, 1968 • Most polymorphism is selectively neutral. • Majority of evolutionary changes caused by random genetic drift of selectively neutral (or almost neutral) alleles. • Still allows for some selection. Motoo Kimura (1924-94)

  20. What about the rate of evolution?

  21. Molecular Clock Hypothesis • Rate of evolution of DNA is constant over time and across lineages • Resolve history of species • Timing of events • Relationship of species • Early protein studies showed approximately constant rate of evolution • As more data accumulated quickly shown that there is no universal molecular clock. • But: still useful if you compare like with like.

  22. Different Rates within a Gene or Genome • Coding sequences evolve more slowly than non-coding sequences. • Synonymous substitutions are often more common than non-synonymous. • 3rd codon position sites evolve faster than others. • Some sequences are under functional constraint. • Different genes evolve at different rates. • Different regions of genome – higher mutation, higher recombination rates. • Genes in different species evolve at different rates e.g. • rodents vs primates  generation time hypothesis. • sharks vs mammals  metabolic rate hypothesis.

  23. Two Sequence Alignment

  24. Inferring Function by Homology • The fact that functionally important aspects of sequences are conserved across evolutionary time allows us to find, by homology searching, the equivalent genes in one species to those known to be important in other model species. • Logic: if the linear alignment of a pair of sequences is similar, then we can infer that the 3-dimensional structure is similar; if the 3-D structure is similar then there is a good chance that the function is similar.

  25. BASIC LOCAL ALIGNMENT SEARCH TOOLS (BLAST) • BLAST programs (there are several) compare a query sequence to all the sequences in a database in a pairwise manner. • Breaks: query and database sequences into fragments known as "words", and seeks matches between them. • Attempts to align query words of length "W" to words in the database such that the alignment scores at least a threshold value, "T". known as High-Scoring Segment Pairs (HSPs) • HSPs are then extended in either direction in an attempt to generate an alignment with a score exceeding another threshold, "S", known as a Maximal-Scoring Segment Pair (MSP)

  26. Two Sequence Alignment To align GARFIELDTHECAT withGARFIELDTHERAT is easy GARFIELDTHECAT ||||||||||| || GARFIELDTHERAT

  27. Gaps Sometimes, you can get a better overall alignment if you insert gaps GARFIELDTHECAT |||||||| ||| GARFIELDA--CAT is better (scores higher) than GARFIELDTHECAT |||||||| GARFIELDACAT

  28. No Gap Penalty But there has to be some sort of a gap-penalty otherwise you can align ANY two sequences: G-R--E------AT | | | || GARFIELDTHECAT

  29. Affine Gap Penalty • Could set a score for each indel. • Usually use affine (open + extend). • Open –10, extend -0.05

  30. 2+ Similar Sequences • When doing a similarity search against a database you are trying to decide which of many sequences is the CLOSEST match to your search sequence. • Which of the following alignment pairs is better?:

  31. Scoring Alignments GARFIELDTHECAT |||| ||||||| GARFRIEDTHECAT GARFIELDTHECAT ||| ||| ||||| GARWIELESHECAT GARFIELDTHECAT || ||||||| || GAVGIELDTHEMAT

  32. Willie Taylor’s AA Venn Diagram

  33. Substitution Matrices #BLOSUM 90 A R N D C Q E G H I L A 5 -2 -2 -3 -1 -1 -1 0 -2 -2 -2 R -2 6 -1 -3 -5 1 -1 -3 0 -4 -3 N -2 -1 7 1 -4 0 -1 -1 0 -4 -4 D -3 -3 1 7 -5 -1 1 -2 -2 -5 -5 C -1 -5 -4 -5 9 -4 -6 -4 -5 -2 -2 Q -1 1 0 -1 -4 7 2 -3 1 -4 -3 E -1 -1 -1 1 -6 2 6 -3 -1 -4 -4 G 0 -3 -1 -2 -4 -3 -3 6 -3 -5 -5 H -2 0 0 -2 -5 1 -1 -3 8 -4 -4 I -2 -4 -4 -5 -2 -4 -4 -5 -4 5 1 L -2 -3 -4 -5 -2 -3 -4 -5 -4 1 5

  34. Low Complexity Masking • Some sequences are similar even if they have no recent common ancestor. • Huntington's disease is caused by poly CAG tracks in the DNA which results in polyGlutamine (Gln, Q) tracks in the protein. • If you do a homology search with QQQQQQQQQQ you get hits to other proteins that have a lot of glutamines but have totally different function.

  35. Low Complexity Masking Huntingtin: MATLEKLMKA FESLKSFQQQ QQQQQQQQQQ QQQQQQQQQQ PPPPPPPPPP PQLPQPPPQA hits>MM16_MOUSE MATRIX METALLOPROTEINASE-16 Score = 34.4 bits (78), Expect = 0.18 Identities = 21/65 (32%), Positives = 25/65 (38%), Gaps = 2/65 (3%): FQQQQQQQQQQQQQQQQQQQQQQQPPPPPPPPPPPQLPQPPPQ--AQPLLPQPQPPPPPP F Q + + Q Q+ PP PPP LP PP P P+ P PP FYQYMETDNFKLPNDDLQGIQKIYGPPDKIPPPTRPLPTVPPHRSVPPADPRRHDRPKPP But not because it is involved in microtubule mediated transport!

  36. E values • An E-value is a measure of the probability of any given hit occurring by chance. • Dependent on the size of the query sequence and the database. • The lower the E-value the more confidence you can have that a hit is a true homologue (sequence related by common descent).

  37. Dotplot theory Another way of comparing 2 sequences Task: align ATGATATTCTT and ATTGTTC A T G A T A T T C T T A . . . . . . . . . . . T . . . . . . . . . . . T . . . . . . . . . . . G . . . . . . . . . . . T . . . . . . . . . . . T . . . . . . . . . . . C . . . . . . . . . . .

  38. Go along the first seq inserting a + wherever 2/3 bases in a moving window match. The first seq is compared to ATT (the first 3 bases in the vertical sequence) A T G A T A T T C T T A . . . . . . . . . . . T . + . . + . + . . + . T . . . . . . . . . . . G . . . . . . . . . . . T . . . . . . . . . . . T . . . . . . . . . . . C . . . . . . . . . . .

  39. Then go along the first seq inserting a + wherever 2/3 bases in a moving window match. The first seq is compared to TTG (the next 3 in the vertical sequence). A T G A T A T T C T T A . . . . . . . . . . . T . + . . + . + . . + . T . + . . . . . + . . . G . . . . . . . . . . . T . . . . . . . . . . . T . . . . . . . . . . . C . . . . . . . . . . .

  40. Iterate until A T G A T A T T C T T A . . . . . . . . . . . T . + . . + . + . . + . T . + . . . . . + . . . G . . + . . . . . + . . T . . . + . . . . . + . T . . . . . . . + . . . C . . . . . . . . . . .

  41. A T G A T A T T C T T A T + + + + T + + G + + T + + T + C The human eye is particularly good at picking up structure from the pattern of dots. You might see a hint of a duplicated region in the horizontal sequence that is not so clear from the sequence itself

  42. Multiple Sequence Alignments

  43. Why Do MSAs? • Although BLAST may give you good E-value – MSA more convincing that protein is related and can be aligned over entire length. • Identification of conserved regions or domains in proteins. • Regions that are evolutionary conserved are likely to be important for structure/function. • Mutations in these areas more likely to affect function. • Identification of conserved residues in proteins. • Prerequisite for doing phylogenetic trees.

  44. Identification of Conserved Domains:

  45. Human b-defensins

  46. Computing MSAs • Problem: Once you attempt to align more than a few sequences – MSA quickly becomes computationally intensive and eventually intractable. • Solution: Clustal – invented in Kennedy’s pub, Trinity College Dublin. • Thompson, J.D., Higgins, D.G. and Gibson, T.J. (1994). CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, positions-specific gap penalties and weight matrix choice. Nucleic Acids Research, 22:4673-4680. • Download Clustalx: ftp://ftp-igbmc.ustrasbg.fr/pub/ClustalX/clustalx1.81.msw.zip • Adding evolutionary theory to multiple sequence alignment.

  47. How MSAs are computed

  48. You still may have to do some hand-editing!!

More Related