Slide1 l.jpg
This presentation is the property of its rightful owner.
Sponsored Links
1 / 71

Molecular Evolution, Multiple Sequence Alignment & Phylogenetics. Canadian Bioinformatics Workshop Thursday June 21st PowerPoint PPT Presentation


  • 291 Views
  • Uploaded on
  • Presentation posted in: General

Molecular Evolution, Multiple Sequence Alignment & Phylogenetics. Canadian Bioinformatics Workshop Thursday June 21st. David Lynn M.Sc., Ph.D., Postdoctoral Research Associate, Brinkman Lab., Department of Molecular Biology & Biochemistry, Simon Fraser University, Greater Vancouver, B.C.

Download Presentation

Molecular Evolution, Multiple Sequence Alignment & Phylogenetics. Canadian Bioinformatics Workshop Thursday June 21st

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Slide1 l.jpg

Molecular Evolution, Multiple Sequence Alignment & Phylogenetics.Canadian Bioinformatics Workshop Thursday June 21st

David Lynn M.Sc., Ph.D.,

Postdoctoral Research Associate,

Brinkman Lab.,

Department of Molecular Biology & Biochemistry,

Simon Fraser University,

Greater Vancouver, B.C.


Evidence for evolution fact not theory l.jpg

Evidence for Evolution – Fact not Theory

  • Fossils

  • Observable – e.g. viral evolution – HIV drug treatment can predict which sites will change. Why you need flu vaccine every year!

  • Overwhelming scientific evidence.

  • We are 99% identical at DNA level to chimp.


Slide3 l.jpg

Nothing in biology makes sense except in the light of evolution

Dobzhansky, 1973


Why learn about evolution l.jpg

Why Learn about Evolution:

  • Tells us where we come from, classification of species, which species are most closely related.

  • Understand the fundamentals of life.

  • Practical side:

  • Foundation of most bioinformatics analyses:

  • Gene family identification.

  • Gene discovery – inferring gene function, gene annotation.

  • Origins of a genetic disease, characterization of polymorphisms.


Slide5 l.jpg

Besoin - the need or

desire for change in

phenotype

Change in phenotype

Jean Baptiste

de Lamarck

Change in genotype

Change in phenotype

of offspring

Inherited


Slide6 l.jpg

Genotype unaffected by

changes in phenotype

August

Weismann

Spontaneous and

random changes in

genes during

reproduction

Offspring has

changed genotype

Change in

phenotype of offspring

Weismann distinguished somatic and germline mutation


Part of darwin s theory l.jpg

Part of Darwin’s Theory

  • The world is not constant, but changing

  • All organisms are derived from common ancestors by a process of branching.

  • Classify organisms based on shared traits inherited from common ancestor

  • Morphological character-based analysis – didn’t know about DNA


Slide8 l.jpg

For evolution to happen, must have heredity and variation – Decent with modification.


Variation by dna mutation l.jpg

Variation by DNA mutation

  • Nucleotide substitution

    • Replication error

    • Chemical reaction

  • Insertions or deletions (indels)

    • single base indels

    • Unequal crossing over


What happens when a new mutation arises l.jpg

What happens when a new mutation arises?


Positive selection l.jpg

Positive Selection

  • A new allele (mutant) confers some increase in the fitness of the organism

  • Selection acts to favour this allele

  • Also called adaptive evolution

    NOTE: Fitness = ability to survive and reproduce


Advantageous allele l.jpg

Advantageous Allele

Herbicide resistance gene in nightshade plant


Negative selection l.jpg

Negative selection

  • A new allele (mutant) confers some decrease in the fitness of the organism

  • Selection acts to remove this allele

  • Also called purifying selection


Deleterious allele l.jpg

Deleterious allele

Human breast cancer gene, BRCA2

5% of breast cancer cases are familial

Mutations in BRCA2 account for 20% of familial cases

Normal (wild type) allele

Mutant allele

(Montreal 440

Family)

Stop codon

4 base pair deletion

Causes frameshift


Neutral mutations l.jpg

Neutral mutations

  • Neither advantageous nor disadvantageous

  • Invisible to selection (no selection)

  • Frequency subject to ‘drift’ in the population

  • Random drift – random changes in small populations


Slide16 l.jpg

Random Genetic Drift

Selection

100

advantageous

Allele frequency

disadvantageous

0


Evolutionary models l.jpg

Evolutionary models

  • Neo-Darwinian (Pan-selectionist) – positive selection only.

  • Mutationist – mutation and random drift.

  • Neutralist – mutation, random drift, and negative selection.


Neo darwinian model l.jpg

Neo-Darwinian Model

  • Mutation is recognised as the origin of variation.

  • Gene substitution (new allele replacing old) occurs by positive selection only.

  • Polymorphism (multiple alleles co-existing) caused by balancing selection.


Neutral theory l.jpg

Neutral Theory

  • Too much polymorphism to be explained by mutation and positive selection alone (NeoDarwinian model).

  • Why so much?

  • Neutral Theory of Molecular Evolution

    • Motoo Kimura, 1968

  • Most polymorphism is selectively neutral.

  • Majority of evolutionary changes caused by random genetic drift of selectively neutral (or almost neutral) alleles.

  • Still allows for some selection.

Motoo Kimura (1924-94)


What about the rate of evolution l.jpg

What about the rate of evolution?


Molecular clock hypothesis l.jpg

Molecular Clock Hypothesis

  • Rate of evolution of DNA is constant over time and across lineages

  • Resolve history of species

    • Timing of events

    • Relationship of species

  • Early protein studies showed approximately constant rate of evolution

  • As more data accumulated quickly shown that there is no universal molecular clock.

  • But: still useful if you compare like with like.


Different rates within a gene or genome l.jpg

Different Rates within a Gene or Genome

  • Coding sequences evolve more slowly than non-coding sequences.

  • Synonymous substitutions are often more common than non-synonymous.

  • 3rd codon position sites evolve faster than others.

  • Some sequences are under functional constraint.

  • Different genes evolve at different rates.

  • Different regions of genome – higher mutation, higher recombination rates.

  • Genes in different species evolve at different rates e.g.

    • rodents vs primates  generation time hypothesis.

    • sharks vs mammals  metabolic rate hypothesis.


Two sequence alignment l.jpg

Two Sequence Alignment


Inferring function by homology l.jpg

Inferring Function by Homology

  • The fact that functionally important aspects of sequences are conserved across evolutionary time allows us to find, by homology searching, the equivalent genes in one species to those known to be important in other model species.

  • Logic: if the linear alignment of a pair of sequences is similar, then we can infer that the 3-dimensional structure is similar; if the 3-D structure is similar then there is a good chance that the function is similar.


Basic local alignment search tools blast l.jpg

BASIC LOCAL ALIGNMENT SEARCH TOOLS (BLAST)

  • BLAST programs (there are several) compare a query sequence to all the sequences in a database in a pairwise manner.

  • Breaks: query and database sequences into fragments known as "words", and seeks matches between them.

  • Attempts to align query words of length "W" to words in the database such that the alignment scores at least a threshold value, "T". known as High-Scoring Segment Pairs (HSPs)

  • HSPs are then extended in either direction in an attempt to generate an alignment with a score exceeding another threshold, "S", known as a Maximal-Scoring Segment Pair (MSP)


Two sequence alignment26 l.jpg

Two Sequence Alignment

To align GARFIELDTHECAT withGARFIELDTHERAT is easy

GARFIELDTHECAT

||||||||||| ||

GARFIELDTHERAT


Slide27 l.jpg

Gaps

Sometimes, you can get a better overall alignment if you insert gaps

GARFIELDTHECAT

|||||||| |||

GARFIELDA--CAT

is better (scores higher) than

GARFIELDTHECAT

||||||||

GARFIELDACAT


No gap penalty l.jpg

No Gap Penalty

But there has to be some sort of a gap-penalty otherwise you can align ANY two sequences:

G-R--E------AT

| | | ||

GARFIELDTHECAT


Affine gap penalty l.jpg

Affine Gap Penalty

  • Could set a score for each indel.

  • Usually use affine (open + extend).

  • Open –10, extend -0.05


2 similar sequences l.jpg

2+ Similar Sequences

  • When doing a similarity search against a database

    you are trying to decide which of many sequences is the CLOSEST match to your search sequence.

  • Which of the following alignment pairs is better?:


Scoring alignments l.jpg

Scoring Alignments

GARFIELDTHECAT

|||| |||||||

GARFRIEDTHECAT

GARFIELDTHECAT

||| ||| |||||

GARWIELESHECAT

GARFIELDTHECAT

|| ||||||| ||

GAVGIELDTHEMAT


Willie taylor s aa venn diagram l.jpg

Willie Taylor’s AA Venn Diagram


Substitution matrices l.jpg

Substitution Matrices

#BLOSUM 90

A R N D C Q E G H I L

A 5 -2 -2 -3 -1 -1 -1 0 -2 -2 -2

R -2 6 -1 -3 -5 1 -1 -3 0 -4 -3

N -2 -1 7 1 -4 0 -1 -1 0 -4 -4

D -3 -3 1 7 -5 -1 1 -2 -2 -5 -5

C -1 -5 -4 -5 9 -4 -6 -4 -5 -2 -2

Q -1 1 0 -1 -4 7 2 -3 1 -4 -3

E -1 -1 -1 1 -6 2 6 -3 -1 -4 -4

G 0 -3 -1 -2 -4 -3 -3 6 -3 -5 -5

H -2 0 0 -2 -5 1 -1 -3 8 -4 -4

I -2 -4 -4 -5 -2 -4 -4 -5 -4 5 1

L -2 -3 -4 -5 -2 -3 -4 -5 -4 1 5


Low complexity masking l.jpg

Low Complexity Masking

  • Some sequences are similar even if they have no recent

    common ancestor.

  • Huntington's disease is caused by poly CAG tracks in the DNA which results in polyGlutamine (Gln, Q) tracks in the protein.

  • If you do a homology search with QQQQQQQQQQ you get hits to other proteins that have a lot of glutamines but have totally different function.


Low complexity masking35 l.jpg

Low Complexity Masking

Huntingtin:

MATLEKLMKA FESLKSFQQQ QQQQQQQQQQ

QQQQQQQQQQ PPPPPPPPPP PQLPQPPPQA

hits>MM16_MOUSE MATRIX METALLOPROTEINASE-16 Score = 34.4 bits (78), Expect = 0.18 Identities = 21/65 (32%), Positives = 25/65 (38%), Gaps = 2/65 (3%):

FQQQQQQQQQQQQQQQQQQQQQQQPPPPPPPPPPPQLPQPPPQ--AQPLLPQPQPPPPPP

F Q + + Q Q+ PP PPP LP PP P P+ P PP

FYQYMETDNFKLPNDDLQGIQKIYGPPDKIPPPTRPLPTVPPHRSVPPADPRRHDRPKPP

But not because it is involved in microtubule mediated transport!


E values l.jpg

E values

  • An E-value is a measure of the probability of any given hit occurring by chance.

  • Dependent on the size of the query sequence and the database.

  • The lower the E-value the more confidence you can have that a hit is a true homologue (sequence related by common descent).


Dotplot theory l.jpg

Dotplot theory

Another way of comparing 2 sequences

Task: align ATGATATTCTT and ATTGTTC

A T G A T A T T C T T

A . . . . . . . . . . .

T . . . . . . . . . . .

T . . . . . . . . . . .

G . . . . . . . . . . .

T . . . . . . . . . . .

T . . . . . . . . . . .

C . . . . . . . . . . .


Slide38 l.jpg

Go along the first seq inserting a + wherever 2/3 bases in a

moving window match. The first seq is compared to ATT

(the first 3 bases in the vertical sequence)

A T G A T A T T C T T

A . . . . . . . . . . .

T . + . . + . + . . + .

T . . . . . . . . . . .

G . . . . . . . . . . .

T . . . . . . . . . . .

T . . . . . . . . . . .

C . . . . . . . . . . .


Slide39 l.jpg

Then go along the first seq inserting a + wherever 2/3 bases

in a moving window match. The first seq is compared to TTG

(the next 3 in the vertical sequence).

A T G A T A T T C T T

A . . . . . . . . . . .

T . + . . + . + . . + .

T . + . . . . . + . . .

G . . . . . . . . . . .

T . . . . . . . . . . .

T . . . . . . . . . . .

C . . . . . . . . . . .


Slide40 l.jpg

Iterate until

A T G A T A T T C T T

A . . . . . . . . . . .

T . + . . + . + . . + .

T . + . . . . . + . . .

G . . + . . . . . + . .

T . . . + . . . . . + .

T . . . . . . . + . . .

C . . . . . . . . . . .


Slide41 l.jpg

A T G A T A T T C T T

A

T + + + +

T + +

G + +

T + +

T +

C

The human eye is particularly good at picking up structure from

the pattern of dots. You might see a hint of a duplicated region in

the horizontal sequence that is not so clear from the sequence itself


Multiple sequence alignments l.jpg

Multiple Sequence Alignments


Why do msas l.jpg

Why Do MSAs?

  • Although BLAST may give you good E-value – MSA more convincing that protein is related and can be aligned over entire length.

  • Identification of conserved regions or domains in proteins.

    • Regions that are evolutionary conserved are likely to be important for structure/function.

    • Mutations in these areas more likely to affect function.

  • Identification of conserved residues in proteins.

  • Prerequisite for doing phylogenetic trees.


Identification of conserved domains l.jpg

Identification of Conserved Domains:


Human b defensins l.jpg

Human b-defensins


Computing msas l.jpg

Computing MSAs

  • Problem: Once you attempt to align more than a few sequences – MSA quickly becomes computationally intensive and eventually intractable.

  • Solution: Clustal – invented in Kennedy’s pub, Trinity College Dublin.

  • Thompson, J.D., Higgins, D.G. and Gibson, T.J. (1994). CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, positions-specific gap penalties and weight matrix choice. Nucleic Acids Research, 22:4673-4680.

  • Download Clustalx: ftp://ftp-igbmc.ustrasbg.fr/pub/ClustalX/clustalx1.81.msw.zip

  • Adding evolutionary theory to multiple sequence alignment.


How msas are computed l.jpg

How MSAs are computed


You still may have to do some hand editing l.jpg

You still may have to do some hand-editing!!


Alignment editors l.jpg

Alignment Editors

  • Several multiple sequence alignment editors are available for manually editing MSAs.

    • GeneDochttp://www.nrbsc.org/gfx/genedoc/index.html

    • Jalview http://www.jalview.org/


T coffee vs clustal l.jpg

T-Coffee Vs Clustal

  • ClustalW http://www.ebi.ac.uk/clustalw/ is standard program for MSAs.

  • However, newer program T-Coffee http://www.tcoffee.org/ often does a better job particularly with more distantly related proteins.

  • Other programs e.g. Musclehttp://www.drive5.com/muscle/ may be better than T-Coffee at aligning large number of sequences.


Phylogenetics inferring the evolutionary relationships between genes sequences species l.jpg

Phylogenetics – Inferring the evolutionary relationships between genes/sequences/species.


Terminology l.jpg

Node

Branch – length proportional to amount of evolution (not all trees)

operational taxonomic units (OTUs) e.g. genes, species, populations.

This case: protein sequences.

Clade

Terminology

Bootstrap values (%) showing level of statistical confidence in clade.

Outgroup


Different views of the same trees l.jpg

Different Views of the Same Trees

==

Star-shaped phylogeny.

No branch lengths shown


Why do trees l.jpg

Why Do Trees?

  • Classification of life.

  • Investigate the evolutionary relationship between genes/species/strains.

    • What can this tell us about function.

  • Epidemiology: tracing pathogen evolution/origins e.g. viruses, SARS, foot & mouth, Avian Influenza.

  • Assign orthology to related genes.

  • The closest BLAST hit is often not the nearest neighbor.

    • Koski LB, Golding GB J Mol Evol. 2001.


Sars as an example l.jpg

SARS as an example

SARS forms a distinct clade within genus Coronavirus.

Implications for vaccine and drug design.

Implications for epidemiology.


Ortholog paralogs l.jpg

Orthologs – Genes derived from a speciation event i.e. the ‘same’ gene in different species

Paralogs – Genes derived from a gene duplication event. Evolutionarily related but not the ‘same’ gene  may have similar functions but likely also different ones.

Ortholog & Paralogs


Importance of ortholog prediction l.jpg

Species1_GeneA

Species2_GeneA

Outgroup_GeneA

Importance of Ortholog Prediction:

  • Why important  implies likely conservation of function in different species  necessary to make inferences of function based on analysis in one of the species.

  • Example: knockout gene A in species 1  observe phenotype  infer gene A in species 2 has same/similar function

    • Only holds if comparing orthologous genes.


Common problems in ortholog prediction l.jpg

Species 2

Species 1

BLAST

Common Problems in Ortholog Prediction

  • Reciprocal Best BLAST Hit (RBH)  commonly used high-throughput method for ortholog identification.

  • Incomplete genome sequence or gene loss often result in paralogs predicted as orthologs.


Common problems in ortholog prediction61 l.jpg

Common Problems in Ortholog Prediction


Real example assigning orthology of a novel chicken irak l.jpg

Real Example: Assigning orthology of a novel chicken IRAK.

Lynn et al., 2003


Ortholuge improving the specificity of high throughput ortholog prediction l.jpg

Ortholuge: Improving the specificity of high-throughput ortholog prediction

  • Solution to problem: Putative orthologs from 2 species are compared to a third outgroup species and phylogenetic distances are calculated.

  • Unusual phylogenetic distances used to identified possible/probable paralogs.


Phylogenetic methods l.jpg

Phylogenetic Methods

  • UPGMA

    • assumes constant rate of evolution – molecular clock: don’t publish UPGMA trees

  • Neighbor-Joining

    • very fast. Often a “good enough” tree.

  • Maximum Parsimony

    • Minimum # mutations to construct tree. Slower than NJ.

  • Maximum Likelihood

    • Very CPU intensive. Requires explicit model of evolution – rate and pattern of nucleotide substitution. Only use if you know what you are doing. Rubbish in rubbish out!!


Distance methods l.jpg

Distance Methods

  • Distance matrix

  • UPGMA assumes constant rate of evolution – molecular clock: don’t publish UPGMA trees

  • Neighbor joining is very fast.

  • Often a “good enough” tree.

  • Embedded in ClustalW.

  • Use in publications only if too many taxa to compute with MP or ML


Maximum parsimony l.jpg

Maximum Parsimony

  • Minimum # mutations to construct tree.

  • Better than NJ – information lost in distance matrix – but much slower.

  • Sensitive to long-branch attraction.

  • No explicit evolutionary model.

  • Protpars refuses to estimate branch lengths.

  • Informative sites.


Maximum likelihood l.jpg

Maximum Likelihood

  • Very CPU intensive.

  • Requires explicit model of evolution – rate and pattern of nucleotide substitution.

    • JC Jukes/Cantor

    • K2P Kimura 2 parameter transition/transversion

    • F81 Felsenstein – base composition bias

    • HKY85 merges K2P and F81

  • Explicit model  preferred statistically.

  • Assumes change more likely on long branch.

  • No long-branch attraction.

  • Wrong model  wrong tree.


Dna trees l.jpg

DNA Trees

  • More info in DNA than proteins.

  • Systematic 3rd position changes can confuse.

  • For distant relationships: remove 3rd positions.

  • Advise: Use DNA directly only if evolutionary distance is short.

  • Translate into protein to align

    • then copygaps back to DNA

  • Many issues can confuse tree – Beware.


Things to be aware of l.jpg

Things to be aware of….

  • Beware base composition bias in unrelated taxa e.g. 2 species with high G+C content will tend to group together.

  • Are sites (hairpins, CpGs?) independent?  most models assume that they are.

  • Are substitution rates equal across dataset?  if not some methods can account for this.

  • Long branches prone to error – remove them?

  • Excellent alignment = few informative sites.

  • Exclude unreliable data – toss all gaps  but also removes phylogenetically informative indels.


Bootstrapping statistical confidence in a tree l.jpg

Bootstrapping – statistical confidence in a tree.


Acknowledgements l.jpg

Acknowledgements

  • Thanks to Aoife McLysaght, Trinity College Dublin, Ireland for sharing some of her slides on molecular evolution with me.

  • Some of the slides were adapted from material used last year at the CBW by Prof. Fiona Brinkman, Simon Fraser University.

  • Some of the material used here was originally given as part of a course “Introduction to Bioinformatics” designed and implemented by myself and Dr. Andrew Lloyd, University College Dublin.

  • Figures for some of the slides on phylogenetics have been taken from Baldauf SL, 2003 “Phylogeny for the faint of heart: a tutorial. Trends in Genetics 19(6).


  • Login