Molecular Evolution, Multiple Sequence Alignment & Phylogenetics.Canadian Bioinformatics Workshop Thursday June 21st David Lynn M.Sc., Ph.D., Postdoctoral Research Associate, Brinkman Lab., Department of Molecular Biology & Biochemistry, Simon Fraser University, Greater Vancouver, B.C.
Evidence for Evolution – Fact not Theory • Fossils • Observable – e.g. viral evolution – HIV drug treatment can predict which sites will change. Why you need flu vaccine every year! • Overwhelming scientific evidence. • We are 99% identical at DNA level to chimp.
Nothing in biology makes sense except in the light of evolution Dobzhansky, 1973
Why Learn about Evolution: • Tells us where we come from, classification of species, which species are most closely related. • Understand the fundamentals of life. • Practical side: • Foundation of most bioinformatics analyses: • Gene family identification. • Gene discovery – inferring gene function, gene annotation. • Origins of a genetic disease, characterization of polymorphisms.
Besoin - the need or desire for change in phenotype Change in phenotype Jean Baptiste de Lamarck Change in genotype Change in phenotype of offspring Inherited
Genotype unaffected by changes in phenotype August Weismann Spontaneous and random changes in genes during reproduction Offspring has changed genotype Change in phenotype of offspring Weismann distinguished somatic and germline mutation
Part of Darwin’s Theory • The world is not constant, but changing • All organisms are derived from common ancestors by a process of branching. • Classify organisms based on shared traits inherited from common ancestor • Morphological character-based analysis – didn’t know about DNA
For evolution to happen, must have heredity and variation – Decent with modification.
Variation by DNA mutation • Nucleotide substitution • Replication error • Chemical reaction • Insertions or deletions (indels) • single base indels • Unequal crossing over
Positive Selection • A new allele (mutant) confers some increase in the fitness of the organism • Selection acts to favour this allele • Also called adaptive evolution NOTE: Fitness = ability to survive and reproduce
Advantageous Allele Herbicide resistance gene in nightshade plant
Negative selection • A new allele (mutant) confers some decrease in the fitness of the organism • Selection acts to remove this allele • Also called purifying selection
Deleterious allele Human breast cancer gene, BRCA2 5% of breast cancer cases are familial Mutations in BRCA2 account for 20% of familial cases Normal (wild type) allele Mutant allele (Montreal 440 Family) Stop codon 4 base pair deletion Causes frameshift
Neutral mutations • Neither advantageous nor disadvantageous • Invisible to selection (no selection) • Frequency subject to ‘drift’ in the population • Random drift – random changes in small populations
Random Genetic Drift Selection 100 advantageous Allele frequency disadvantageous 0
Evolutionary models • Neo-Darwinian (Pan-selectionist) – positive selection only. • Mutationist – mutation and random drift. • Neutralist – mutation, random drift, and negative selection.
Neo-Darwinian Model • Mutation is recognised as the origin of variation. • Gene substitution (new allele replacing old) occurs by positive selection only. • Polymorphism (multiple alleles co-existing) caused by balancing selection.
Neutral Theory • Too much polymorphism to be explained by mutation and positive selection alone (NeoDarwinian model). • Why so much? • Neutral Theory of Molecular Evolution • Motoo Kimura, 1968 • Most polymorphism is selectively neutral. • Majority of evolutionary changes caused by random genetic drift of selectively neutral (or almost neutral) alleles. • Still allows for some selection. Motoo Kimura (1924-94)
Molecular Clock Hypothesis • Rate of evolution of DNA is constant over time and across lineages • Resolve history of species • Timing of events • Relationship of species • Early protein studies showed approximately constant rate of evolution • As more data accumulated quickly shown that there is no universal molecular clock. • But: still useful if you compare like with like.
Different Rates within a Gene or Genome • Coding sequences evolve more slowly than non-coding sequences. • Synonymous substitutions are often more common than non-synonymous. • 3rd codon position sites evolve faster than others. • Some sequences are under functional constraint. • Different genes evolve at different rates. • Different regions of genome – higher mutation, higher recombination rates. • Genes in different species evolve at different rates e.g. • rodents vs primates generation time hypothesis. • sharks vs mammals metabolic rate hypothesis.
Inferring Function by Homology • The fact that functionally important aspects of sequences are conserved across evolutionary time allows us to find, by homology searching, the equivalent genes in one species to those known to be important in other model species. • Logic: if the linear alignment of a pair of sequences is similar, then we can infer that the 3-dimensional structure is similar; if the 3-D structure is similar then there is a good chance that the function is similar.
BASIC LOCAL ALIGNMENT SEARCH TOOLS (BLAST) • BLAST programs (there are several) compare a query sequence to all the sequences in a database in a pairwise manner. • Breaks: query and database sequences into fragments known as "words", and seeks matches between them. • Attempts to align query words of length "W" to words in the database such that the alignment scores at least a threshold value, "T". known as High-Scoring Segment Pairs (HSPs) • HSPs are then extended in either direction in an attempt to generate an alignment with a score exceeding another threshold, "S", known as a Maximal-Scoring Segment Pair (MSP)
Two Sequence Alignment To align GARFIELDTHECAT withGARFIELDTHERAT is easy GARFIELDTHECAT ||||||||||| || GARFIELDTHERAT
Gaps Sometimes, you can get a better overall alignment if you insert gaps GARFIELDTHECAT |||||||| ||| GARFIELDA--CAT is better (scores higher) than GARFIELDTHECAT |||||||| GARFIELDACAT
No Gap Penalty But there has to be some sort of a gap-penalty otherwise you can align ANY two sequences: G-R--E------AT | | | || GARFIELDTHECAT
Affine Gap Penalty • Could set a score for each indel. • Usually use affine (open + extend). • Open –10, extend -0.05
2+ Similar Sequences • When doing a similarity search against a database you are trying to decide which of many sequences is the CLOSEST match to your search sequence. • Which of the following alignment pairs is better?:
Scoring Alignments GARFIELDTHECAT |||| ||||||| GARFRIEDTHECAT GARFIELDTHECAT ||| ||| ||||| GARWIELESHECAT GARFIELDTHECAT || ||||||| || GAVGIELDTHEMAT
Substitution Matrices #BLOSUM 90 A R N D C Q E G H I L A 5 -2 -2 -3 -1 -1 -1 0 -2 -2 -2 R -2 6 -1 -3 -5 1 -1 -3 0 -4 -3 N -2 -1 7 1 -4 0 -1 -1 0 -4 -4 D -3 -3 1 7 -5 -1 1 -2 -2 -5 -5 C -1 -5 -4 -5 9 -4 -6 -4 -5 -2 -2 Q -1 1 0 -1 -4 7 2 -3 1 -4 -3 E -1 -1 -1 1 -6 2 6 -3 -1 -4 -4 G 0 -3 -1 -2 -4 -3 -3 6 -3 -5 -5 H -2 0 0 -2 -5 1 -1 -3 8 -4 -4 I -2 -4 -4 -5 -2 -4 -4 -5 -4 5 1 L -2 -3 -4 -5 -2 -3 -4 -5 -4 1 5
Low Complexity Masking • Some sequences are similar even if they have no recent common ancestor. • Huntington's disease is caused by poly CAG tracks in the DNA which results in polyGlutamine (Gln, Q) tracks in the protein. • If you do a homology search with QQQQQQQQQQ you get hits to other proteins that have a lot of glutamines but have totally different function.
Low Complexity Masking Huntingtin: MATLEKLMKA FESLKSFQQQ QQQQQQQQQQ QQQQQQQQQQ PPPPPPPPPP PQLPQPPPQA hits>MM16_MOUSE MATRIX METALLOPROTEINASE-16 Score = 34.4 bits (78), Expect = 0.18 Identities = 21/65 (32%), Positives = 25/65 (38%), Gaps = 2/65 (3%): FQQQQQQQQQQQQQQQQQQQQQQQPPPPPPPPPPPQLPQPPPQ--AQPLLPQPQPPPPPP F Q + + Q Q+ PP PPP LP PP P P+ P PP FYQYMETDNFKLPNDDLQGIQKIYGPPDKIPPPTRPLPTVPPHRSVPPADPRRHDRPKPP But not because it is involved in microtubule mediated transport!
E values • An E-value is a measure of the probability of any given hit occurring by chance. • Dependent on the size of the query sequence and the database. • The lower the E-value the more confidence you can have that a hit is a true homologue (sequence related by common descent).
Dotplot theory Another way of comparing 2 sequences Task: align ATGATATTCTT and ATTGTTC A T G A T A T T C T T A . . . . . . . . . . . T . . . . . . . . . . . T . . . . . . . . . . . G . . . . . . . . . . . T . . . . . . . . . . . T . . . . . . . . . . . C . . . . . . . . . . .
Go along the first seq inserting a + wherever 2/3 bases in a moving window match. The first seq is compared to ATT (the first 3 bases in the vertical sequence) A T G A T A T T C T T A . . . . . . . . . . . T . + . . + . + . . + . T . . . . . . . . . . . G . . . . . . . . . . . T . . . . . . . . . . . T . . . . . . . . . . . C . . . . . . . . . . .
Then go along the first seq inserting a + wherever 2/3 bases in a moving window match. The first seq is compared to TTG (the next 3 in the vertical sequence). A T G A T A T T C T T A . . . . . . . . . . . T . + . . + . + . . + . T . + . . . . . + . . . G . . . . . . . . . . . T . . . . . . . . . . . T . . . . . . . . . . . C . . . . . . . . . . .
Iterate until A T G A T A T T C T T A . . . . . . . . . . . T . + . . + . + . . + . T . + . . . . . + . . . G . . + . . . . . + . . T . . . + . . . . . + . T . . . . . . . + . . . C . . . . . . . . . . .
A T G A T A T T C T T A T + + + + T + + G + + T + + T + C The human eye is particularly good at picking up structure from the pattern of dots. You might see a hint of a duplicated region in the horizontal sequence that is not so clear from the sequence itself
Why Do MSAs? • Although BLAST may give you good E-value – MSA more convincing that protein is related and can be aligned over entire length. • Identification of conserved regions or domains in proteins. • Regions that are evolutionary conserved are likely to be important for structure/function. • Mutations in these areas more likely to affect function. • Identification of conserved residues in proteins. • Prerequisite for doing phylogenetic trees.
Computing MSAs • Problem: Once you attempt to align more than a few sequences – MSA quickly becomes computationally intensive and eventually intractable. • Solution: Clustal – invented in Kennedy’s pub, Trinity College Dublin. • Thompson, J.D., Higgins, D.G. and Gibson, T.J. (1994). CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, positions-specific gap penalties and weight matrix choice. Nucleic Acids Research, 22:4673-4680. • Download Clustalx: ftp://ftp-igbmc.ustrasbg.fr/pub/ClustalX/clustalx1.81.msw.zip • Adding evolutionary theory to multiple sequence alignment.