Alignments and phylogeny
Download
1 / 26

Alignments and phylogeny Peter Hantz EMBL Heidelberg - PowerPoint PPT Presentation


  • 50 Views
  • Uploaded on

Alignments and phylogeny Peter Hantz EMBL Heidelberg. Evolution. Mutations: changes in the DNA sequence chemicals, physical conditions, replication errors (cca 10/replication/human genome) coding region mutations: >silent results in the same AA > mis-sense

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about ' Alignments and phylogeny Peter Hantz EMBL Heidelberg' - tejano


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

Alignments and phylogeny

Peter Hantz

EMBL Heidelberg


Evolution

Mutations:

changes in the DNA sequence

chemicals, physical conditions,

replication errors (cca 10/replication/human genome)

coding region mutations:

>silent

results in the same AA

>mis-sense

results in a different AA

>nonsense

results in a stop codon

non-coding region mutations:

e.g. Transcription factor binding sites - developmental deficiencies

gene dublications + "neofunctionalization" of the extra copy

new function - new evolutionary pressure


Evolution

Homology, orthology, paralogy

Homologous:

genes having a common ancestor

orthologous:

common ancestor, in different species

paralogous:

common ancestor, gene dublication in the same species

wikipedia


Alignments

Aims:

-to compare DNA or protein sequences

that diverged during evolutionary processes

Why is it good for?

-a new piece of DNA/protein: what's it?

(proteins: structural relationships)

-to trace evolutionary relationships

(phylogenetic trees)

-to find mutations that cause diseases

-for gene cloning: degenerate PCR primer design

-forensics, whale protection...


Some applications...

Forensics

Alignments are the first step

of building evolutionary trees

A. Budd

Malaria - Sickle Cell Anemia

snp

23andme.com


Global and Local Alignments

Global: assumed to be similar overall (e.g. closely related ones)

"forces" the alignment to span the entire length of the sequences

e.g. to find evolutionary relationships

e.g. to find mutations in a gene

Local: if short pieces (motifs, domains) are similar

to identify regions of similarity within long sequences

e.g. To find coding regions in a genomic dna full with introns

Source:wikipedia



Global/local alignmnets: method

Dynamic Programing

G: Needleman-Wunsch algorithm

L: Smith-Waterman algorithm

Aligning DNA

|: identical

nothing: mismatch

-: gap

How do we quantify their similarity?


Aligning DNA, continued, concepts for both global and local alignments

Measure of similarity: f(#of matches, mismatches, gaps)

How do we count for identities, mismatches, gaps?

A simple model (standard BLAST scoring matrix):

Identity +2

Mismatch -3

Gap creation -5

Gap extension -2

Terminal Mismatch 0

All mismatches are "similar"?

Kimura model:

purine-purine and pirimidine-pirimidine : more probable

Score of two aligned DNA sequences (or parts of them):

TCCTAGGACTCATCGTAAGGTCCTAG - - AACCTCGTAAGG

+2+2+2 +2+2+2 -5-2 -3 -3 +2 -2 -3 +2+2 +2 +2+2+2+2 =28-7-9=14


Aligning Proteins alignments

some subtitutions less relevant than other ones

change to a similar AA: not too much change in the protein structure

change to sg. else: unprobable, it will be selkected against

letter =match, +=conservative substitution, Ø= non-cons. substitution, - =gap

(software: Blast)

Identities: same AA

Positives:

"similar" AA-s, incl. identities


How "similar" are these sequences? alignments

Measure of similarity: f(#of matches, # of cons/non-cons ch, #gaps)

AA changes:

quantification by the "substitution matrix"

>the simplest one:the identity matrix

>A more realistic one: BlockSUbstitutionMatrix

(BLOSUM62):

+probable, -unprobable,

diagoinal:prob. let like this

Gap penalties: usually -10 for gap open and -2 for gap extension

The score:

measure of similarity of a segment/the enitre length of two aligned protein sequences

high score means: good alignment/real similarity

score for a given alignment = symbol-wise score total (matrix) + gap penalty total

A A B B C C D D - - E E F

A A - - - - D D K K K E F G G

4+4 -10-2-2-2+6+6 -10-1+1+5+6 0 0 =


Significance: E-value alignments

Why score S is not enough?

To what do we compare it? Statistics is needed.

Ex.

Given our query sequence (DNA or protein)

Let's take a random sequence from a hypothetical database of size D:

Can an alignment as good or better that this occur BY CHANCE?

(calculated from a random database sequence)

From the score S a so-called "Expectation value" is calculated

E=P(S(random)>S(query))D

If the chance is tiny (E<10^-6):

unlikely that the observed alignment is due to chance alone

Note: D is very high, the E-value increases


Multiple alignments: more complex than the pairwise ones alignments

"Progressive alignment methods" (e.g. Clustal)

Iterative methods (e.g. Muscle)

Multiple alignment of

Nitric Oxide Synthase

protein sequences

(P. Hantz)

Viagra!

If two sequences have a large % of identity,

they can be interpreted to be homologous (y/n)

histons - very conservative

MHC gene pool - evolves like crazy


The BLAST (Basic Local Alignment Search Tool) alignments

Scored pairwise local alignments are generated very fast

"Is my new sequence related to sg I know about?"

Also used for aligning sequences

How does it work?

In a nutshell:

-List the words of length 3 (by default) of the query:

PQGEFG >> PQG, GGE, GEF, EFG

-Scan the database sequences with "the relevant ones" of these

-try extending the exact matches, via local alignments,

until there are not too much mismatches (until a score level)

>> HSP-s (High-Scoring Segment Pairs)

-Evaluate the significance and E-value of the HSP-s


Using it: BLAST sub-programs (beside the pairwise alignments)

Blastn: Search a nucleotide database using a nucleotide queryWhat can my sequence be? – close relatives

(What are the sequences similar to my sequence?)

BlastP: Search protein database using a protein queryWhat can my protein be?

(What are the proteins similar to my sequence?)

Find conserved domains in the query

Find members of the protein family

Blastx: Search protein database using a translated nucleotide query

Find coding sequences in a piece of genomic DNA

Protein sequences: more conserved!

Tblastn: Search translated nucleotide database using a protein query

Find similar proteins e.g. not annotated DNA sequences

Tblastx: Search translated nucleotide database using a translated nucleotide query

Find coding sequences in a piece of genomic DNA

Note: translation is done in all 6 frames, and all of these are locally analyzed


Making Phylogenetic Trees alignments)

Aims:

to show evolutinary relationships of:

genes, species (they evolve ALL the time)

even computer softwares...

A. Budd

Ji et al., 2008


The new rRNA-based animal phylogeny (1995) alignments)

Annelida

Mollusca

Platyhelm.

(1995)

Deuterostomia

Protostomia

rRNA:

present in all creatures

slow/fast evolving parts

secondary structure

Halanych et al, Aduotte et al., after 1995


The new rRNA-based Tree of Life alignments)

Woese et al., after 1990


Description of Evolutionary Trees alignments)

Internal nodes:

hypothetical ancestral organisms/genes

Terminal nodes:

existing organisms/genes (Operational Taxonomy Units, OTU)

Root:

the last common ancestor of the entire group

Sister groups:

on either side of a split, with a common ancestor

and no additional descendents

Monophyletic group

A group containing an ancestor and all of its descendants

most recent common ancestor of the group

Only the terminal nodes (OTU) exist right now

They all evolve in time!


Description of Phylogenetic Trees alignments)

Cladogram:

branch length unscaled

Phylogram:

branch length=amounts of evol. divergence

(horizontals doesn't count)

Biology: sequences, organisms evolve all the time

time

Molecular clocks:

Can the # of mutations of a DNA or protein sequence correlated

with the time lapsed after the Last Common Ancestor?

If the rate is cca constant: YES

Different rates - different advantages GENE PHYLOGENY ≠ SPECIES PHYLOGENY

too fast: saturation - back mutations - underestimation of the distance:

A-B-C-D-E-A-...

calibration: some dated fossil records are needed


Building Phylogenetic trees: A very simple example: alignments)

One step-one mutation

(a) MSATHC (b) ITATHC (c) ITAGHC (d) LTAAHC

Mutations (a)<>(b)<>(c)

Mutations (a)<>(d)<>(b)

A "rooted" tree: an assumption for the commn ancestor

source: Gene Cloning


Building Phylogenetic Trees alignments)

Distance-based methods

"Distance" of two sequences: "metric" in mathematics (4 axioms)

several ones: euclidean... non-euclidean...

Green: Euclidean distance

Others: "Manhattan" distance

A distance between two strings: Leveshtein distance d(IJ)

the minimum number of edits

(insertion, deletion, or substitution of a single character)

needed to transform one string into the other

Its calculation: intuitively easy, practically complicated (dynamic programing)

Example: d(kitten/6/, sitting/7/)=3

Leveshtein distance: a special case of the score:

gaps/mismatches/missing ends: 1; matches: 0

kitten → sitten ( 's' for 'k')

sitten → sittin ('i' for 'e')

sittin → sitting (insert 'g').


The Building of a Tree: the UPGMA method alignments)

Sequential clustering method

Starting point: distance matrix [d(IJ)] of the sequences (triangular m.)

-grouping the pairwise distances corresponding to the pairs of strings

with the with the smallest pairwise distance:

-node "placed" at the half of the distance

-creating a reduced matrix:

these two are "joined"

distances are re-calculated:



FIRST DO AN ALIGNMENT alignments)

Pre-processing:

Eliminate obiously wrong sequence regions

e.g. "forgotten" introns when investigating proteins, AG|GT...AG|G

e.g. obvious sequencing errors

e.g. bad sequences

Correct the distances for multiple substitutions (homoplasy)

(measure of distances: change/site, 0…1)

Building a tree

use a program...

Rooting a tree

Rooting induces a directionality

>"automatically" done by several software:

midpoint rooting

(root on the branch on equal distance from the most distant OTU-s)

>"by hand"

by choosing an "outgroup": a homologous, but "quite far" sequence

root: on the branch between the tree and the outgroup

Problems might still appear!...


Sample3 alignments)

A

A

A

A

A

A

A

A

A

A

A

A

A

A

A

A

A

A

A

A

Bootstrap values: how roboust is our tree

Randomizing a bit the sequences.... does the tree/subtree persist?

Larger% - better

Sample1

Sample2

Resample datasets

(with replacement)

Taxa1

Taxa2

Taxa3

Taxa4

G

G

C

C

T

A

A

T

A

A

A

A

T

A

A

T

G

G

C

C

A

A

A

A

C

C

G

G

A

A

T

T

A

A

T

T

G

G

C

C

A

A

A

A

C

C

G

G

T

A

A

T

G

G

C

C

A

A

T

T

1

3

2

4

Sample99

Sample100

The result:

G

G

C

C

T

A

A

T

G

G

C

C

A

A

A

A

T

A

A

T

T

A

A

T

T

A

A

T

A

A

A

A

C

C

G

G

C

C

G

G

...

60%

T. Larsson


ad