Introduction to Bioinformatics. Tutorial 4 Multiple Alignment and Phylogeny. ClustalW Input. Alignment format. Fast alignment?. Fast alignment options. Scoring matrix. Gap scoring. Input sequences. Phylogenetic trees. ClustalW Output (1). Input sequences. Pairwise alignment scores.
Introduction to Bioinformatics
Fast alignment options
Pairwise alignment scores
Match strength in decreasing order: * : .
So we hang the tree from where it joins
We know this is furthest
Number of mutations
Newick tree description
Why phylogenetic analyses? Mutations accumulate in the genomes of pathogens, especially viruses, during a spread of an infection. This can be used to document the history of transmission events. Phylogenetic analysis of these mutations may not only be used to reconstruct the history of a pathogen's spread through host populations but can also be used to make predictions about it's future progress.
The unsolved HIV/SIV relationshipOne interesting case, where phylogenetic treebuilding is useful, is the unsolved HIV/SIV relationship: HIV-1, HIV-2 and SIV.AIDS (acquired immunodeficiency syndrome) is caused by two different human viruses:
HIV-1, group M and O
HIV-2, subtypes A to E
There are many related viruses in a variety of non-human primates. These related viruses are called SIV (simian immunodeficiency viruses).
The NJ tree in our example is based on the poly protein sequence from HIV-1, HIV-2 and SIV with HTLV-1 as an outgroup. HTLV-1 (human T-lymphotropic virus type 1) is another human retroviral pathogen that has originated from related simian viruses.
The calculation starts with the star:
The branch lengths between node 5 and 10 and between
6 and 10 are calculated with these formulas:
In this case L = 9 New node x = 10
ri=r5=Σd5k/(L-2) = 3.22406/(9-2) = 0.46058
rj=r6=Σd6k/(L-2) = 3.22758/(9-2) = 0.461083
dix=d510=(d5 6 + r5 - r6)/2 = (0.06088 + 0.46058 - 0.461083)/2 = 0.0301886
djx=d6 10 = d5 6 - d5 10 = 0.06088 - 0.0301886 = 0.0306914
Calculation of the new branches: In this case L = 8 New node x = 11
ri=r3=Σd3k/(L-2) = 2.715455/(8-2) = 0.452576
rj= r4=Σd4k/(L-2) = 2.50096/(8-2)=0.416827
dix=d3 11=(d3 4 + r3 - r4)/2 = (0.125 + 0.452576 - 0.416827)/2 = 0.080375
djx=d4 11 = d3 4 - d3 11 = 0.125 - 0.080375 = 0.044625
Calculation of the new branches: In this case L = 7 New node x = 12ri=r2=Σd2k/(L-2) = 2.252265/(7-2) = 0.450453
rj=r11=Σd11k/(L-2) = 2.108208/(7-2)=0.4216415
dix=d212=(d211 + r2 - r11)/2 = (0.109705 + 0.450453 - 0.4216415)/2 = 0.069258
djx=d1112 = d211 - d212 = 0.109705 - 0.069258 = 0.040447
In this case L = 3 New node x = 16:r13= 0.843684
d1316 = 0.131758
d1516 = 0.016648
Because node 9 is the outgroup, the root will be placed between node 9 and the other nodes. The distance between node 9 and the first internal node is 0.563519.
This means that HIV-1 and HIV-2 have originated independently from two different SIV strains.
HIV-1 seems to be more closely related to SIV from chimpanzee.
There also seems to have been a cross-species transmission from human to MAN/MAC.
There must have been a cross-species transmission from chimpanzee SIV to human HIV-1.
HIV-2 (H2) is more closely related to SIV (S) from sooty mangabey than to HIV-1 (H1).
As one can see the branch between the H2-ROD A and the to SIV taxa has a low support. Only 56% of the trees have this topology. Therefore the transmission events from human to non-human primates are very uncertain.
In this exercise you will perform a phylogenetic analysis of the human globin sequences. You will compare your results to current prevalent knowledge on the globin family, according to the following summary on the globin sequences:
Myoglobin and hemoglobins diverged from one another before the emergence of worms, about 800 million year ago.
The hemoglobins diverged into two families (the α-family and β-family) following a gene duplication, about 450 million years ago, which is before the emergence of mammals.
The α-family diverged into the zeta, teta and alpha genes, and the β-family diverged into the beta, gamma_G, gamma_A, delta and epsilon genes, all following a series of gene duplications.
The most recent duplication was that gamma_G from gamma_A, which occurred around the separation of the simians (humans, chimp, gorilla, etc.) from the pro-siminas (such as lemurs and lorises), about 55 million years ago.
(adapted from Graur and Li, 1999)