slide1 n.
Download
Skip this Video
Loading SlideShow in 5 Seconds..
break PowerPoint Presentation
Download Presentation
break

Loading in 2 Seconds...

play fullscreen
1 / 123

break - PowerPoint PPT Presentation


  • 121 Views
  • Uploaded on

break. Distance methods: p distances and the least squares (LS) approach. General concept of distance based methods. Two steps: Compute a distance D(i,j) between any two sequences i and j. Find the tree that agrees most with the distance table. . Simplest distance: the “ p ” distance.

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'break' - allan


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
slide3

General concept of distance based methods

  • Two steps:
  • Compute a distance D(i,j) between any two sequences i and j.
  • Find the tree that agrees most with the distance table.
slide4

Simplest distance: the “p” distance

SEQ1 AACAAGCG

SEQ2 AACGAGCA

There are 2 differences, so the distance = 2.

The problem is that now, if you have a longer pair of sequences

SEQ3 AACAAGCGCCCTCAGTCCGCTCGCACAA

SEQ4 AACGAGCACCCTCAGTCCGCTCGCACAA

The distance is still 2, but in fact, the distance between 3 and 4 should be smaller than the distance between 1 and 2.

slide5

Simplest distance: the “p” distance

SEQ1 AACAAGCG

SEQ2 AACGAGCA

There are 2 differences, the length = 8, so the distance is 2/8

This is called the p distance.

slide6

Distance estimation

There are better and more accurate methods to compute the distance D(i,j) between any two sequences i and j. For example, one can take into account different probabilities between transitions and transversions…

slide7

From a distance table to a tree

Each tree has branch lengths from which “predicted” set of distances can be computed: d(i,j) (small d, denotes the distance of the branches, unlike the observed pairwise distances D).

Human

d(Human,Chimp) = 0.55

d(Human,Gorilla) = 0.71

d(Chimp, Gorilla) = 0.66

0.3

0.41

Gorilla

0.25

Chimp

slide8

From a distance table to a tree

The question is can we find branch lengths, so that the d’s are equal to the D’s?

Human

D(Human,Chimp) = 0.3

D(Human,Gorilla) = 0.4

D(Chimp, Gorilla) = 0.5

X

Y

Gorilla

Z

Chimp

slide9

From a distance table to a tree

Human

D(Human, Chimp) = 0.3

D(Human, Gorilla) = 0.4

D(Chimp, Gorilla) = 0.5

X

Y

Gorilla

Z

Chimp

d(Human, Chimp) = X+Z

d(Human, Gorilla) = X+Y

d(Chimp, Gorilla) = Y+Z

X+Z = 0.3

X+Y = 0.4

Y+Z = 0.5

Y = 0.3

Z = 0.2

X = 0.1

YES

Y-Z = 0.1

Y+Z = 0.5

slide10

Is there always a solution?

Human

D(Human, Chimp) = D1

D(Human, Gorilla) = D2

D(Chimp, Gorilla) = D3

X

Y

Gorilla

Z

Chimp

d(Human, Chimp) = X+Z

d(Human, Gorilla) = X+Y

d(Chimp, Gorilla) = Y+Z

X+Z = D1

X+Y = D2

Y+Z = D3

We get 3 equations with 3 variables: there’s always a solution!

slide11

Ex.

Human

D(Human, Chimp) = D1

D(Human, Gorilla) = D2

D(Chimp, Gorilla) = D3

X

Y

Gorilla

Z

Chimp

d(Human, Chimp) = X+Z

d(Human, Gorilla) = X+Y

d(Chimp, Gorilla) = Y+Z

X+Z = D1

X+Y = D2

Y+Z = D3

Show that for a 3 taxa tree, there’s always a solution and it is given by:

Z=0.5(D1-D2+D3), Y=0.5(D2+D3-D1)

X=0.5(D1+D2-D3)

slide12

Is there always a solution??

D

A

5 Variables,

6 Equations,

It might be that there’s no solution

X

Y

W

Z

V

B

C

D(A, B) = 2 D(A, D) = 3

D(A, C) = 3 D(B, C) = 3

D(B, D) = 3 D(C, D) = 4

An example of a case where there’s no solution (v=w=x=y=z=1 solves the first 5 equations)

slide13

Is there always a solution??

In real life, for n>3 sequences, there is never a solution.

One might try to find the “best” solution.

slide14

Is there always a solution??

The simplest case where it might be that equations have no solution: two equations with 1 parameter

a = 2

a = 3

We want to find the “best” solution which solves these equations

slide15

Is there always a solution??

Putting it another way:

a – 2 = 0

a – 3 = 0

Let’s assign parameters instead of 0

a – 2 = e1

a – 3 = e2

Ideally, we want e1, and e2 to be as small as possible (e1=e2=0 could be the best).

slide16

The least square solution

a – 2 = e1

a – 3 = e2

We want the distance of the point (e1,e2) from (0,0) to be the smallest.

I.e., we want to find “a” that satisfies:

Sqrt(e12+e22) is minimum.

slide17

The least square solution

The term: sqrt(e12+e22) reaches its minimum when the term: e12+e22 reaches its minimum.

So for:

a – 2 = e1

a – 3 = e2

we want to minimize: [(a-2)2+(a-3)2]

slide18

The least square solution

Min [(a-2)2+(a-3)2]=

Min[2a2-10a+13]=

Min[2a2-10a]=

Min[a2-5a].

a2-5ais a parabola that crosses the X axis at a=0, and a=5, and its minimum is at a=2.5

slide19

Is there always a solution???

So for the simplest of two equations with 1 parameter

a = 2

a = 3

The “best” solution is a = 2.5 which makes sense.

slide20

Back to phylogeny

We have the D’s (“observed distances”), and we want to find the d’s (branches) that minimize the expression

slide21

Back to phylogeny

For each tree topology we get a different Q. The least square (LS) method searches for the tree with the lowest Q.

slide22

Back to phylogeny

The general formula for LS

The w’s are weights that differ between different least square methods.

slide23

Back to phylogeny

w’s used

Cavalli-Sforza and Edwards (1967)

Fitch Margoliash (1967)

Beyer et al (1974)

slide24

Tree search

There are the general heuristic searches.

No branch-and-bound method published so far.

Problem was shown to be NP-complete.

slide25

Minimum Evolution

The general formula for LS

Minimum Evolution (ME) for a given topology, it estimates the branch lengths using LS. But unlike LS, it chooses the topology that results in minimal sum of branches.

slide29

The Newick tree format is used to represent trees as strings

B

A

D

C

In Newick format: (A,C,(B,D)).

Each pair of parenthesis () enclose a monophyletic group,

and the comma separates the members of the corresponding group.

slide30

Neighbor-joining is based on Star decomposition

Red: best pair to group together

B

E

A

(C,B)

C

A

D

D

E

A

((C,B),E)

D

slide31

Neighbor-joining

The Neighbour Joining method is used for re-constructing phylogenetic trees. Both the tree topology and branch lengths are estimated. In each stage, the two nearest nodes of the tree (the term "nearest nodes" will be defined in the following paragraphs) are chosen and defined as neighbours in our tree. This is done recursively until all of the nodes are paired together.

slide32

Neighbor-joining

The algorithm was originally written by Saitou and Nei, 1987. In 1988 a correction for the paper was published by Studier & Keppler. The correction was related to the main theorem in the algorithm. Studier and Keppler also suggested a slight change to the algorithm which brought the efficiency down to O(n3).We will first of all describe the original algorithm, and then elaborate on the changes made by Studier & Kepler.

slide33

OTU’s and HTU’s

Reminder:

OTU’s = operational taxonomic units, or in other words – leaves of the tree.

HTU’s = hypothetical taxonomic units, or in other words – the internal nodes of the tree.

slide34

C

A

D

B

Neighbors, we are …

What are neighbours?Neighbours are defined as a pair of OTU's who have one internal node connecting them.

A and B are neighbours,

C and D are neighbours,

But…

A and C are not neighbours…

slide35

Additive trees

In an additive tree, the distance matrix exactly reflects the tree:

C

A

Y

X

D

B

slide36

Additive trees

The NJ theorem: the NJ algorithm recovers the true tree, if the tree is additive.

slide37

NJ is an approximation of the Minimum evolution

In the original article, Saitou and Nei defined the two nearest nodes as the pair of nodes that give the minimal sum of branches when placed in a tree.

slide38

NJ notations:

  • First of all – some notations:
  • D(i,j) is defined as the distance between leaves i and j (the observed distance which we have as an input from our distance matrix).
  • L(x,y) is defined as the sum of branch lengths between node X and node Y. L is used as a notation for distances between internal nodes, or an internal node to a leaf.
slide39

L(x,y) notation:

We distinguish between L(X,Y) and D(A,B).

D’s are given as input to the algorithm,

L’s should be inferred…

C

A

Y

X

D

B

slide40

NJ step:

  • In each round we join as neighbours all possible pairs of leaves and evaluatethe sum of branches for each resultant tree. This means we compare the sum of branches when 1 and 2 are joined as neighbours, denoted as S(1,2), to the sum of branches when 1 and 3 are joined as neighbours, S(1,3), and so on. We look for the i and j pair for whichS(i,j) is minimal, where i and j denote numbers of leaves, and i<j.
  • This is why NJ is approximating ME (minimum evolution).
slide41

Computing S(1,2)

How can we evaluate S(1,2) from the input (the distance matrix)?

3

1

X

Y

4

5

2

slide42

Computing S(1,2)

S(1,2) = L(1,X)+L(2,X)+L(X,Y)+L(Y,3)+L(Y,4)+L(Y,5)

3

1

X

Y

4

5

2

The problem is that we don’t know the L’s. We only know the D’s…

slide43

3

1

X

Y

4

5

2

Computing S(1,2)

S(1,2) = L(1,X)+L(2,X)+L(X,Y)+L(Y,3)+L(Y,4)+L(Y,5)

S(1,2) = D(1,2)+L(X,Y)+L(Y,3)+L(Y,4)+L(Y,5)

Since our tree is additive, we can replace L(1,X)+L(2,X), with D(1,2).

slide44

3

1

X

Y

4

5

2

Computing L(X,Y) in terms of the D’s

L(1,X) is counted here N-2 times

Here, -L(1,X) is counted N-2 times

So L(1,X) is canceled out…

N denotes the number of leaves

slide45

3

1

X

Y

4

5

2

Computing L(X,Y) in terms of the D’s

L(3,Y) is counted once here

Once here

Here, -L(3,Y) is counted 2 times

So L(3,Y) is canceled out…

slide46

3

1

X

Y

4

5

2

Computing L(X,Y) in terms of the D’s

L(X,Y) is counted N-2 times here

N-2 here

So L(X,Y) is counted altogether 2(N-2) times. Dividing by 2(N-2) we get L(X,Y)

slide47

3

1

X

Y

4

5

2

Computing L(X,Y) in terms of the D’s

We still have to replace this term by the D’s

slide48

3

1

X

Y

4

5

2

Computing L(X,Y) in terms of the Ds

L(3,Y) is counted here N-3 times: once in D(3,4), once in D(3,5), till D(3,N).

slide49

3

1

X

Y

4

5

2

Computing L(X,Y) in terms of the D’s

slide51

3

1

4

3

1

X

Y

4

5

2

5

2

Finding the best neighbor

So, we compute S(1,2), S(1,3), … , S(4,5) and join the two leaves i and j for which S(i,j) is minimal.

Let’s assume that S(1,2) is minimal in round 1…

We call the new node that joins 1 and 2, X.

slide52

3

1

X

Y

4

5

2

Finding the best neighbor

3

4

12

5

For the next step of the algorithm, we need to create a distance table of (N-1)x(N-1). Let 12 denote the new node that joins 1 and 2. We define:

slide53

Branch lengths

3

1

X

Y

Only the branches in red are being computed.

4

Z

5

2

slide54

Branch lengths

3

3

4

12

4

Y

5

X

12

5

If now (12) and (5) are joined, it is equivalent to joining (3) and (4). So we can already compute the branch lengths L((12),X),L(5,X), L(3,Y) and L(4,Y).

slide56

Complexity of computing S(1,2)

This part requires O(N2) computations

slide57

Complexity of the original NJ algorithm

Computing each S(i,j) sums up to N2computations.

There are N2combinations of S(i,j),

and N joining steps.

Altogether, the algorithm is O(N5).

slide58

More things to know about the NJ algorithm

  • Studier and Keppler introduced a way to reduce the complexity of the algorithm from O(N5) to O(N3).
  • The NJ-theorems were not presented.
  • BioNJ is a close relative to NJ, but with a slightly better performance.
  • NJ constraints.
slide61

Minimum evolution

In minimum evolution branch lengths are computed by the LS method for each possible tree topology.

However, the criterion to choose among tree topologies is not the lowest sum-of-squares, but rather the minimum sum of branch lengths.

slide62

Molecular clocks

Branch lengths measure average number of replacements per position. It is, thus, equal to

the number of replacements per position per year, multiplied by year.

Putting it another way:

slide63

Molecular clocks

Human

Human

Mouse

Mouse

Clearly, the time t, from the root to the tips is the same for all sequences. However, the rate r, can differ, and might depend on factors such as the DNA repair mechanisms, generation time, and much more.

slide64

Molecular clocks

WITH CLOCK

WITHOUTCLOCK

Human

Human

Mouse

Mouse

A molecular clock is the assumption that the rate of all species is approximately the same. Clearly, this is not the general case, but it might be true, for example when comparing very close species of ants. If the rate is the same, the branch lengths should be the same too.

slide65

Two kinds of tree search methods

Methods like least-squares, maximum parsimony, minimum evolution and maximum likelihood have an explicit criterion which they try to maximize or minimize.

There are some other methods (UPGMA, WPGMA, NJ) that apply some direct algorithm that result in a tree. These methods are usually very fast, but their statistical justification is unclear. These methods are usually some kind of a clustering algorithm.

slide66

Ultrametric

Trees which satisfy a molecular clock are called ultrametric.

When trees are ultrametric it is very easy to estimate the LS branch lengths (Farris 1969a).

slide67

UPGMA

UPGMA is one such direct method, receiving as input a distance matrix and giving as output an ultrametric tree.

It was suggested by Sokal and Michener (1958).

NOT TO BE USED, UNLESS YOU NEED A VERY FAST METHOD, AND YOU ARE SURE THE TREE IS ULTRAMETRIC!

slide68

UPGMA

The algorithm:

Input: a distance matrix D which is symmetric, i.e., D(i,j)=D(j,i).

Variables: for each group of species we give a number which indicates how many species are in this group. N(i) will indicate the number of species in group i. Intially, all sequences have n=1.

slide69

UPGMA

  • The algorithm:
  • Find the i and j that have the smallest D(i,j)
  • Create a new group (ij) which has n(ij)=n(i)+n(j)
  • Connect i and j to a new node (which corresponds to the new group (ij)). Give the two branches connecting i to (ij) and j to (ij) each length of D(i,j)/2.
slide70

UPGMA

  • The algorithm:
  • 4. Compute the distance between the new group and all other groups (except for i and j) by using:
slide71

UPGMA

  • The algorithm:
  • 5. Delete the columns and rows of the data (modified input) matrix that correspond to groups i and j, and add a column and row for group (ij).
  • 6. Go to step 1, unless there is only 1 item left in the data matrix.
slide72

Complexity

O(n3), because it takes O(n2) to find the minimum D(i,j) in a matrix and you have n iterations of that.

However, we can keep a record of the smallest number in each row, and then finding the minimum goes down to O(n).

Thus, the overall time-complexity is O(n2).

slide73

An example

Distances based on immunological data of Sarich (1969).

slide74

The players

Canis familiaris

Common name = Dog.

The species = familiaris.

Genus = Canis. [First letter always in capital]

Family = Canidae. [First letter always in capital]

Order = Carnivora. [First letter always in capital]

Class = Mammalia. [First letter always in capital]

Phylum = Chordata. [First letter always in capital]

Kingdom = Metazoa [=Multi-cellular organism. First letter always in capital]

slide75

The players

Ursus americanus

Common name = bear.

The species = americanus.

Genus = Ursus .

Family = Ursidae.

Order = Carnivora.

slide76

The players

Procyon lotor

Common name = raccoon.

The species = lotor.

Genus = Procyon.

Family = Procyonidae.

Order = Carnivora.

slide77

The raccoon (דביבון)

  • Reddish-brown above and black or greyish below.
  • Bushy tail with 4-6 black or brown rings
  • Black mask outlined in white
  • Small ears
  • The feet and forepaws are dexterous
slide78

The raccoon (דביבון)

  • Native to the southern part of the Canadian provinces and most of the United States
  • Most common along stream edges, open forests and coastal marshes
slide79

The raccoon (דביבון)

  • Inhabit hollow trees and logs and often use the ground burrows of other animals for raising their young or for sleeping during the coldest part of the winter months.
  • An average of 4-5 young are born in April-May; the mother at first carries them by the nape of the neck like a cat; they are weaned by late summer.
  • Omnivorous, it feeds on grapes, nuts, grubs, crickets, small mammals, birds' eggs and nestlings.
  • Often seen washing their food, the raccoon is actually feeling for matter that should be rejected as wetting the paws enhances its sense of feel.
  • Winter is the raccoons’ greatest enemy when food is scarce.

HEBREW: Nape = “OREF” ; Grub = “Zachal” ; Nestling = “Gozal”

slide80

The players

Mustela nivalis

Common name = weasel.

Order = Carnivora.

In Hebrew (Samor)

slide81

The players

The color of the weasel is chocolate brown on its back side and white with brown spots on its underparts. The summer coat is about 1 cm in length. The winter coat, which is about 1.5 cm in length, turns to all white in northern populations and remains brown in the southern populations.

slide82

The players

The body of the least weasel is long and slender, with a long neck; a flat, narrow head; short limbs. This animal has large black eyes and large, round ears. The weasel's feet have five fingers with sharp claws. Breeding can occur throughout the year, but most of the breeding occurs in the spring and late summer. Gestation in the least weasel lasts from 34 - 37 days. Litters may range from 1 - 7.

slide83

The players

A higher number of offsprings per litter can be found in northern populations. Newborns weigh from 1.1 g to 1.7 g and are wrinkled, pink, naked, blind, and deaf. After 49 - 56 days, they have reached their adult length. By week 6, the males are larger than the females. In 9 - 12 weeks family groups begin to break up, and in 12 - 15 weeks the weasels reach their adult mass.

slide84

The players

The young spend their time play fighting and play mating. Weasels watch the movement of their prey before they attack. When they kill, they go for the neck of the victim.

Distribution:

Europe, northern Africa, Asia, North America; introduced to New Zealand

Diet:

Rodents, birds

slide86

The players

In Hebrew: “Kelev-Yam”

Phoca vitulina

Common name = Harbor seal.

Order = Carnivora.

slide87

The players

Eumetopias jubatus

Common name = Steller sea lion.

Order = Carnivora.

slide88

The players

In Hebrew:

Arye-Yam

slide89

The players

Felis catus

Common name = cat.

Order = Carnivora.

slide90

The players

Pan troglodytes

Common name = chimpanzee.

Order = Primates.

slide93

Starting tree

Distance between these two taxa was 24, so each branch has a length of 12.

ss

12

12

seal

sea lion

We call the father node of seal and sea lion “ss”.

slide94

Removing the seal and sea-lion rows and columns,

and adding the ss row and columns

slide95

Computing dog-ss distance

Here, i=seal, j=sea lion, k = dog.

n(i)=n(j)=1.

D(ss,dog) = 0.5D(sea lion,dog) + 0.5D(seal,dog) = 49.

slide97

ss

12

12

seal

sea lion

Starting tree

Distance between bear and raccoon was 26, so each branch has a length of 13.

br

13

13

bear

raccoon

We call the father node of seal and sea lion “ss”.

slide98

Computing br-ss distance

Here, i=raccoon, j=bear, k = ss.

n(i)=n(j)=1. D(br,ss) = 0.5D(bear,ss)+0.5D(raccoon,ss)=37.5.

slide100

ss

12

12

seal

sea lion

Starting tree

Distance between br and ss was 37.5, so each branch has a length of 18.75. But this is the distance from br-ss to the leaves. The distance br-ss to ss is 18.75-12=6.75. The distance between br-ss to br is 18.75-13=5.75

brss

6.75

5.75

br

13

13

bear

raccoon

slide101

Computing dog-(br-ss) distance

Here, i = br, j = ss, k = dog.

n(i)=n(j)=2. D( brss , dog ) = 0.5D( br , dog ) + 0.5D( ss , dog )=44.5.

slide103

Starting tree

Distance between br-ss and w was 39.5, so wbrss is mapped to the line 19.75. The distance to br-ss, is thus, 1

wbrss

br-ss

19.75

18.75

br

13

ss

12

0

weasel

seal

sea lion

bear

raccoon

slide104

Computing dog-wbrss distance

Here, i = br-ss, j = weasel, k = dog.

n(i)=4, n(j)=1. D( wbrss , dog ) = 0.8D( br-ss , dog ) + 0.2D( weasel , dog )=

44.5*8/10+51*2/10 = (356+102)/10=45.8

slide106

Starting tree

Distance between wbrss and dog was 45.8, so dwbrss is mapped to the line 22.9 The distance to wbrss, is thus, 3.15

dwbrss

22.9

wbrss

br-ss

19.75

18.75

br

13

ss

12

0

weasel

seal

sea lion

bear

raccoon

dogl

slide108

Starting tree

Distance between dwbrss and cat was 89.833, so cdwbrss is mapped to the line 44.9165The distance to dwbrss, is thus, 22.0165

cdwbrss

44.9165

dwbrss

22.9

wbrss

br-ss

19.75

18.75

br

13

ss

12

0

cat

weasel

seal

sea lion

bear

raccoon

dog

slide110

Starting tree

72.14

Distance between cdwbrss and chimp was 144.2857, so THE ROOT is mapped to the line 72.14285The distance to dwbrss, is thus, 27.22635

cdwbrss

44.9165

dwbrss

22.9

wbrss

brss

19.75

18.75

br

13

ss

12

0

cat

weasel

seal

sea lion

bear

raccoon

dog

chimp

slide111

Problems with UPGMA, when the data is not clock-like

Assume that this is the true tree:

A

Then, the “true” distance matrix is

D

13

B

C

10

4

4

2

2

In this case, B and C will be clustered first – wrong!

slide112

Gene,

Volume 397, Issues 1-2, 1 August 2007, Pages 76-83

slide114

Networks

a

b

c

d

e

A network is sometimes used to represent a tree in which recombination occurred.

slide117

Known phylogenies

The best way to test different methods of phylogenetic reconstruction is by using trees that are known to be true from other sources…

Problem: known phylogenies are very rare.

Known phylogeny: laboratory animals, crop plants (and even those are often suspicious). Also, their evolutionary rate is very slow…

slide118

Known phylogenies

David Hillis and colleagues have created “experimental” phylogenies in the lab.

slide119

Known phylogenies

The first paper (1992) analyzed phylogeny reconstruction based on restriction sites analysis.

slide120

Known phylogenies

Later bacteriophage T7 was used. It was subdivided into cultures in the presence of a mutagen. Then they sequenced the final cultures and gave the sequences as input to a few phylogenetic reconstruction methods. The treeoutput of these methods was then compared to the true tree.

slide121

Known phylogenies

In fact, they used restriction sites to infer the phylogeny, using MP, NJ, UPGMA and others.

All methods reconstructed the true tree.

slide122

Known phylogenies

They also compared outputs of ancestral sequence reconstruction, using MP.

97.3% of the ancestral states were correctly reconstructed.

Encouraging!

slide123

Known phylogenies

Criticism:

(1) The true tree was very easy to infer, because it was well balanced, and all the nodes are accompanied by numerous changes.

(2) Mutating using a single mutagen doesn’t reflect reality.