Alignments and phylogeny Peter Hantz EMBL Heidelberg. Evolution. Mutations: changes in the DNA sequence chemicals, physical conditions, replication errors (cca 10/replication/human genome) coding region mutations: >silent results in the same AA > mis-sense
Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.
changes in the DNA sequence
chemicals, physical conditions,
replication errors (cca 10/replication/human genome)
coding region mutations:
results in the same AA
results in a different AA
results in a stop codon
non-coding region mutations:
e.g. Transcription factor binding sites - developmental deficiencies
gene dublications + "neofunctionalization" of the extra copy
new function - new evolutionary pressure
Homology, orthology, paralogy
genes having a common ancestor
common ancestor, in different species
common ancestor, gene dublication in the same species
-to compare DNA or protein sequences
that diverged during evolutionary processes
Why is it good for?
-a new piece of DNA/protein: what's it?
(proteins: structural relationships)
-to trace evolutionary relationships
-to find mutations that cause diseases
-for gene cloning: degenerate PCR primer design
-forensics, whale protection...
Alignments are the first step
of building evolutionary trees
Malaria - Sickle Cell Anemia
Global: assumed to be similar overall (e.g. closely related ones)
"forces" the alignment to span the entire length of the sequences
e.g. to find evolutionary relationships
e.g. to find mutations in a gene
Local: if short pieces (motifs, domains) are similar
to identify regions of similarity within long sequences
e.g. To find coding regions in a genomic dna full with introns
Source: Gene Cloning
Global/local alignmnets: method
G: Needleman-Wunsch algorithm
L: Smith-Waterman algorithm
How do we quantify their similarity?
Measure of similarity: f(#of matches, mismatches, gaps)
How do we count for identities, mismatches, gaps?
A simple model (standard BLAST scoring matrix):
Gap creation -5
Gap extension -2
Terminal Mismatch 0
All mismatches are "similar"?
purine-purine and pirimidine-pirimidine : more probable
Score of two aligned DNA sequences (or parts of them):
TCCTAGGACTCATCGTAAGGTCCTAG - - AACCTCGTAAGG
+2+2+2 +2+2+2 -5-2 -3 -3 +2 -2 -3 +2+2 +2 +2+2+2+2 =28-7-9=14
Aligning Proteins alignments
some subtitutions less relevant than other ones
change to a similar AA: not too much change in the protein structure
change to sg. else: unprobable, it will be selkected against
letter =match, +=conservative substitution, Ø= non-cons. substitution, - =gap
Identities: same AA
"similar" AA-s, incl. identities
How "similar" are these sequences? alignments
Measure of similarity: f(#of matches, # of cons/non-cons ch, #gaps)
quantification by the "substitution matrix"
>the simplest one:the identity matrix
>A more realistic one: BlockSUbstitutionMatrix
diagoinal:prob. let like this
Gap penalties: usually -10 for gap open and -2 for gap extension
measure of similarity of a segment/the enitre length of two aligned protein sequences
high score means: good alignment/real similarity
score for a given alignment = symbol-wise score total (matrix) + gap penalty total
A A B B C C D D - - E E F
A A - - - - D D K K K E F G G
4+4 -10-2-2-2+6+6 -10-1+1+5+6 0 0 =
Significance: E-value alignments
Why score S is not enough?
To what do we compare it? Statistics is needed.
Given our query sequence (DNA or protein)
Let's take a random sequence from a hypothetical database of size D:
Can an alignment as good or better that this occur BY CHANCE?
(calculated from a random database sequence)
From the score S a so-called "Expectation value" is calculated
If the chance is tiny (E<10^-6):
unlikely that the observed alignment is due to chance alone
Note: D is very high, the E-value increases
"Progressive alignment methods" (e.g. Clustal)
Iterative methods (e.g. Muscle)
Multiple alignment of
Nitric Oxide Synthase
If two sequences have a large % of identity,
they can be interpreted to be homologous (y/n)
histons - very conservative
MHC gene pool - evolves like crazy
Scored pairwise local alignments are generated very fast
"Is my new sequence related to sg I know about?"
Also used for aligning sequences
How does it work?
In a nutshell:
-List the words of length 3 (by default) of the query:
PQGEFG >> PQG, GGE, GEF, EFG
-Scan the database sequences with "the relevant ones" of these
-try extending the exact matches, via local alignments,
until there are not too much mismatches (until a score level)
>> HSP-s (High-Scoring Segment Pairs)
-Evaluate the significance and E-value of the HSP-s
Blastn: Search a nucleotide database using a nucleotide queryWhat can my sequence be? – close relatives
(What are the sequences similar to my sequence?)
BlastP: Search protein database using a protein queryWhat can my protein be?
(What are the proteins similar to my sequence?)
Find conserved domains in the query
Find members of the protein family
Blastx: Search protein database using a translated nucleotide query
Find coding sequences in a piece of genomic DNA
Protein sequences: more conserved!
Tblastn: Search translated nucleotide database using a protein query
Find similar proteins e.g. not annotated DNA sequences
Tblastx: Search translated nucleotide database using a translated nucleotide query
Find coding sequences in a piece of genomic DNA
Note: translation is done in all 6 frames, and all of these are locally analyzed
Making Phylogenetic Trees alignments)
to show evolutinary relationships of:
genes, species (they evolve ALL the time)
even computer softwares...
Ji et al., 2008
The new rRNA-based animal phylogeny (1995) alignments)
present in all creatures
slow/fast evolving parts
Halanych et al, Aduotte et al., after 1995
The new rRNA-based Tree of Life alignments)
Woese et al., after 1990
Description of Evolutionary Trees alignments)
hypothetical ancestral organisms/genes
existing organisms/genes (Operational Taxonomy Units, OTU)
the last common ancestor of the entire group
on either side of a split, with a common ancestor
and no additional descendents
A group containing an ancestor and all of its descendants
most recent common ancestor of the group
Only the terminal nodes (OTU) exist right now
They all evolve in time!
Description of Phylogenetic Trees alignments)
branch length unscaled
branch length=amounts of evol. divergence
(horizontals doesn't count)
Biology: sequences, organisms evolve all the time
Can the # of mutations of a DNA or protein sequence correlated
with the time lapsed after the Last Common Ancestor?
If the rate is cca constant: YES
Different rates - different advantages GENE PHYLOGENY ≠ SPECIES PHYLOGENY
too fast: saturation - back mutations - underestimation of the distance:
calibration: some dated fossil records are needed
One step-one mutation
(a) MSATHC (b) ITATHC (c) ITAGHC (d) LTAAHC
A "rooted" tree: an assumption for the commn ancestor
source: Gene Cloning
Building Phylogenetic Trees alignments)
"Distance" of two sequences: "metric" in mathematics (4 axioms)
several ones: euclidean... non-euclidean...
Green: Euclidean distance
Others: "Manhattan" distance
A distance between two strings: Leveshtein distance d(IJ)
the minimum number of edits
(insertion, deletion, or substitution of a single character)
needed to transform one string into the other
Its calculation: intuitively easy, practically complicated (dynamic programing)
Example: d(kitten/6/, sitting/7/)=3
Leveshtein distance: a special case of the score:
gaps/mismatches/missing ends: 1; matches: 0
kitten → sitten ( 's' for 'k')
sitten → sittin ('i' for 'e')
sittin → sitting (insert 'g').
The Building of a Tree: the UPGMA method alignments)
Sequential clustering method
Starting point: distance matrix [d(IJ)] of the sequences (triangular m.)
-grouping the pairwise distances corresponding to the pairs of strings
with the with the smallest pairwise distance:
-node "placed" at the half of the distance
-creating a reduced matrix:
these two are "joined"
distances are re-calculated:
FIRST DO AN ALIGNMENT alignments)
Eliminate obiously wrong sequence regions
e.g. "forgotten" introns when investigating proteins, AG|GT...AG|G
e.g. obvious sequencing errors
e.g. bad sequences
Correct the distances for multiple substitutions (homoplasy)
(measure of distances: change/site, 0…1)
Building a tree
use a program...
Rooting a tree
Rooting induces a directionality
>"automatically" done by several software:
(root on the branch on equal distance from the most distant OTU-s)
by choosing an "outgroup": a homologous, but "quite far" sequence
root: on the branch between the tree and the outgroup
Problems might still appear!...
Bootstrap values: how roboust is our tree
Randomizing a bit the sequences.... does the tree/subtree persist?
Larger% - better