1 / 28

Multiple Sequence Alignment (MSA) and Phylogeny

Multiple Sequence Alignment (MSA) and Phylogeny. Clustal X. Input: multiple sequence Fasta file. >gi|21536452|ref|NP_002762.2| mesotrypsin preproprotein [Homo sapiens] MNPFLILAFVGAAVAVPFDDDDKIVGGYTCEENSLPYQVSLNSGSHFCGGSLISEQWVVSAAHCYKTRIQ

linus-ware
Download Presentation

Multiple Sequence Alignment (MSA) and Phylogeny

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Multiple Sequence Alignment (MSA)andPhylogeny

  2. Clustal X

  3. Input: multiple sequence Fasta file >gi|21536452|ref|NP_002762.2| mesotrypsin preproprotein [Homo sapiens] MNPFLILAFVGAAVAVPFDDDDKIVGGYTCEENSLPYQVSLNSGSHFCGGSLISEQWVVSAAHCYKTRIQ VRLGEHNIKVLEGNEQFINAAKIIRHPKYNRDTLDNDIMLIKLSSPAVINARVSTISLPTAPPAAGTECL ISGWGNTLSFGADYPDELKCLDAPVLTQAECKASYPGKITNSMFCVGFLEGGKDSCQRDSGGPVVCNGQL QGVVSWGHGCAWKNRPGVYTKVYNYVDWIKDTIAANS >gi|114051746|ref|NP_001040585.1| protease, serine, 2 [Macaca mulatta] MNPLLILAFVGVAVAAPFDDDDKIVGGYTCEENSVPYQVSLNSGYHFCGGSLINEQWVVSAAHCYKTRIQ VRLGEHNIEVLEGTEQFINAAKIIRHPDYDRKTLNNDILLIKLSSPAVINARVSTISLPTAPPAAGAEAL ISGWGNTLSSGADYPDELQCLEAPVLSQAECEASYPGKITSNMFCVGFLEGGKDSCQGDSGGPVVSNGQL QGIVSWGYGCAQKNRPGVYTKVYNYVDWIRDTIAANS >gi|6755891|ref|NP_035775.1| mesotrypsin [Mus musculus] MNALLILALVGAAVAFPVDDDDKIVGGYTCQENSVPYQVSLNSGYHFCGGSLINDQWVVSAAHCYKTRIQ VRLGEHNINVLEGNEQFVNAAKIIKHPNFNRKTLNNDIMLLKLSSPVTLNARVATVALPSSCAPAGTQCL ISGWGNTLSFGVSEPDLLQCLDAPLLPQADCEASYPGKITGNMVCAGFLEGGKDSCQGDSGGPVVCNREL QGIVSWGYGCALPDNPGVYTKVCNYVDWIQDTIAAN >gi|6981422|ref|NP_036861.1| protease, serine, 2 [Rattus norvegicus] MRALLFLALVGAAVAFPVDDDDKIVGGYTCQENSVPYQVSLNSGYHFCGGSLINDQWVVSAAHCYKSRIQ VRLGEHNINVLEGNEQFVNAAKIIKHPNFDRKTLNNDIMLIKLSSPVKLNARVATVALPSSCAPAGTQCL ISGWGNTLSSGVNEPDLLQCLDAPLLPQADCEASYPGKITDNMVCVGFLEGGKDSCQGDSGGPVVCNGEL QGIVSWGYGCALPDNPGVYTKVCNYVDWIQDTIAAN >gi|27819626|ref|NP_777115.1| pancreatic anionic trypsinogen [Bos taurus] MHPLLILAFVGAAVAFPSDDDDKIVGGYTCAENSVPYQVSLNAGYHFCGGSLINDQWVVSAAHCYQYHIQ VRLGEYNIDVLEGGEQFIDASKIIRHPKYSSWTLDNDILLIKLSTPAVINARVSTLALPSACASGSTECL . . .

  4. One of the options to get multiple sequence Fasta file

  5. One of the options to get multiple sequence Fasta file

  6. Input: multiple sequence Fasta file >gi|21536452|ref|NP_002762.2| mesotrypsin preproprotein [Homo sapiens] MNPFLILAFVGAAVAVPFDDDDKIVGGYTCEENSLPYQVSLNSGSHFCGGSLISEQWVVSAAHCYKTRIQ VRLGEHNIKVLEGNEQFINAAKIIRHPKYNRDTLDNDIMLIKLSSPAVINARVSTISLPTAPPAAGTECL ISGWGNTLSFGADYPDELKCLDAPVLTQAECKASYPGKITNSMFCVGFLEGGKDSCQRDSGGPVVCNGQL QGVVSWGHGCAWKNRPGVYTKVYNYVDWIKDTIAANS >gi|114051746|ref|NP_001040585.1| protease, serine, 2 [Macaca mulatta] MNPLLILAFVGVAVAAPFDDDDKIVGGYTCEENSVPYQVSLNSGYHFCGGSLINEQWVVSAAHCYKTRIQ VRLGEHNIEVLEGTEQFINAAKIIRHPDYDRKTLNNDILLIKLSSPAVINARVSTISLPTAPPAAGAEAL ISGWGNTLSSGADYPDELQCLEAPVLSQAECEASYPGKITSNMFCVGFLEGGKDSCQGDSGGPVVSNGQL QGIVSWGYGCAQKNRPGVYTKVYNYVDWIRDTIAANS >gi|6755891|ref|NP_035775.1| mesotrypsin [Mus musculus] MNALLILALVGAAVAFPVDDDDKIVGGYTCQENSVPYQVSLNSGYHFCGGSLINDQWVVSAAHCYKTRIQ VRLGEHNINVLEGNEQFVNAAKIIKHPNFNRKTLNNDIMLLKLSSPVTLNARVATVALPSSCAPAGTQCL ISGWGNTLSFGVSEPDLLQCLDAPLLPQADCEASYPGKITGNMVCAGFLEGGKDSCQGDSGGPVVCNREL QGIVSWGYGCALPDNPGVYTKVCNYVDWIQDTIAAN >gi|6981422|ref|NP_036861.1| protease, serine, 2 [Rattus norvegicus] MRALLFLALVGAAVAFPVDDDDKIVGGYTCQENSVPYQVSLNSGYHFCGGSLINDQWVVSAAHCYKSRIQ VRLGEHNINVLEGNEQFVNAAKIIKHPNFDRKTLNNDIMLIKLSSPVKLNARVATVALPSSCAPAGTQCL ISGWGNTLSSGVNEPDLLQCLDAPLLPQADCEASYPGKITDNMVCVGFLEGGKDSCQGDSGGPVVCNGEL QGIVSWGYGCALPDNPGVYTKVCNYVDWIQDTIAAN >gi|27819626|ref|NP_777115.1| pancreatic anionic trypsinogen [Bos taurus] MHPLLILAFVGAAVAFPSDDDDKIVGGYTCAENSVPYQVSLNAGYHFCGGSLINDQWVVSAAHCYQYHIQ VRLGEYNIDVLEGGEQFIDASKIIRHPKYSSWTLDNDILLIKLSTPAVINARVSTLALPSACASGSTECL . . .

  7. Input: multiple sequence Fasta file >gi|21536452|ref|NP_002762.2|mesotrypsin preproprotein [Homo sapiens] MNPFLILAFVGAAVAVPFDDDDKIVGGYTCEENSLPYQVSLNSGSHFCGGSLISEQWVVSAAHCYKTRIQ VRLGEHNIKVLEGNEQFINAAKIIRHPKYNRDTLDNDIMLIKLSSPAVINARVSTISLPTAPPAAGTECL ISGWGNTLSFGADYPDELKCLDAPVLTQAECKASYPGKITNSMFCVGFLEGGKDSCQRDSGGPVVCNGQL QGVVSWGHGCAWKNRPGVYTKVYNYVDWIKDTIAANS >gi|114051746|ref|NP_001040585.1|protease, serine, 2 [Macaca mulatta] MNPLLILAFVGVAVAAPFDDDDKIVGGYTCEENSVPYQVSLNSGYHFCGGSLINEQWVVSAAHCYKTRIQ VRLGEHNIEVLEGTEQFINAAKIIRHPDYDRKTLNNDILLIKLSSPAVINARVSTISLPTAPPAAGAEAL ISGWGNTLSSGADYPDELQCLEAPVLSQAECEASYPGKITSNMFCVGFLEGGKDSCQGDSGGPVVSNGQL QGIVSWGYGCAQKNRPGVYTKVYNYVDWIRDTIAANS >gi|6755891|ref|NP_035775.1|mesotrypsin [Mus musculus] MNALLILALVGAAVAFPVDDDDKIVGGYTCQENSVPYQVSLNSGYHFCGGSLINDQWVVSAAHCYKTRIQ VRLGEHNINVLEGNEQFVNAAKIIKHPNFNRKTLNNDIMLLKLSSPVTLNARVATVALPSSCAPAGTQCL ISGWGNTLSFGVSEPDLLQCLDAPLLPQADCEASYPGKITGNMVCAGFLEGGKDSCQGDSGGPVVCNREL QGIVSWGYGCALPDNPGVYTKVCNYVDWIQDTIAAN >gi|6981422|ref|NP_036861.1|protease, serine, 2 [Rattus norvegicus] MRALLFLALVGAAVAFPVDDDDKIVGGYTCQENSVPYQVSLNSGYHFCGGSLINDQWVVSAAHCYKSRIQ VRLGEHNINVLEGNEQFVNAAKIIKHPNFDRKTLNNDIMLIKLSSPVKLNARVATVALPSSCAPAGTQCL ISGWGNTLSSGVNEPDLLQCLDAPLLPQADCEASYPGKITDNMVCVGFLEGGKDSCQGDSGGPVVCNGEL QGIVSWGYGCALPDNPGVYTKVCNYVDWIQDTIAAN >gi|27819626|ref|NP_777115.1|pancreatic anionic trypsinogen [Bos taurus] MHPLLILAFVGAAVAFPSDDDDKIVGGYTCAENSVPYQVSLNAGYHFCGGSLINDQWVVSAAHCYQYHIQ VRLGEYNIDVLEGGEQFIDASKIIRHPKYSSWTLDNDILLIKLSTPAVINARVSTLALPSACASGSTECL . . .

  8. Step1: Load the sequences

  9. Sequences and conservation view

  10. Step2: Perform Alignment

  11. Sequences and conservation view

  12. Sequences and conservation view

  13. Step 3: Create tree

  14. Step 4: NJPlot

  15. Step 4: NJPlot

  16. The Newick tree format is used to represent trees as strings A C B D In Newick format: ((A,C),(B,D)); Each pair of parenthesis () enclose a clade in the tree, and the comma separates the members of the corresponding clade. “;” – is always the last character

  17. How robust is our tree?

  18. How robust is our tree? • We need some statistical way to estimate the confidence in the tree topology • But we don’t know anything about the tree topology distribution or parameters • The only data source we have is our data (MSA) • So, we must rely on our own resources: “pull up by your own bootstraps”

  19. Bootstrap (and jackknife)

  20. Jackknife 1. We create n (typically 100-1000) new MSAs (pseudo-data sets) by randomly sampling half of the characters. (random samples without replacement) We do not change the number of sequences, just the number of positions! POS: 52316 1 : TATTT 2 : CATTT 3 : CACTT N : AACTT POS: 18745 1 : TTTAT 2 : TAACC 3 : TAACC N : TGGGA POS: 18394 1 : TTGTA 2 : TAGAC 3 : TAAAC N : TGAGG

  21. Sp1 Sp2 Sp3 Sp4 Jackknife 2. We reconstruct a tree from each data set, using the same method used for reconstructing the original tree POS: 52316 1 : TATTT 2 : CATTT 3 : CACTT N : AACTT POS: 18745 1 : TTTAT 2 : TAACC 3 : TAACC N : TGGGA POS: 18394 1 : TTGTA 2 : TAGAC 3 : TAAAC N : TGAGG Sp1 Sp1 Sp2 Sp2 Sp3 Sp3 Sp4 Sp4

  22. In 67% of the data sets, the node SP1+SP2 was found Sp1 67% Sp1 Sp2 100% Sp2 Sp3 Sp4 Sp3 Sp4 Back to Jackknife 3. For each node in our original tree, we count the number of times it appeared in the Jackknife analysis Sp1 Sp1 Sp2 Sp2 Sp3 Sp3 Sp4 Sp4

  23. Bootstrap The same as jackknife, but instead of sampling K/2 positions, we sample K positions with replacement

  24. Bootstrap 1. Resample K positions n times 12345 K 1 : ATCTG…A 2 : ATCTG…C 3 : ACTTA…C N : ACCTA…T 11244 K 1 : AATTT…T 2 : AATTT…G 3 : AACTT…T N : AACTT…T 47789…K 1 : TTTAT…T 2 : TAACC…G 3 : TAACC…T N : TGGGA…T 15578… K 1 : AGGTA…T 2 : AGGAC…G 3 : AAAAC…A N : AAAGG…C

  25. Sp1 Sp2 Sp3 Sp4 Bootstrap 2. Reconstruct a tree from each data set using the same method used for reconstructing the original tree 11244 K 1 : AATTT…T 2 : AATTT…G 3 : AACTT…T N : AACTT…T 47789…K 1 : TTTAT…T 2 : TAACC…G 3 : TAACC…T N : TGGGA…T 15578… K 1 : AGGTA…T 2 : AGGAC…G 3 : AAAAC…A N : AAAGG…C Sp1 Sp1 Sp2 Sp2 Sp3 Sp3 Sp4 Sp4

  26. Sp1 Sp2 Sp3 Sp4 Bootstrap 3. For each node in our original tree, we count the number of times it appeared in the bootstrap analysis Sp1 Sp1 Sp2 Sp2 Sp3 Sp3 Sp4 Sp4 • The jackknife method is less general than bootstrap • Jackknife explores the data differently • Jackknife is easier to apply to complex sampling schemes 67% Sp1 100% Sp2 Sp3 Sp4

  27. Step 3.5 - Bootstrap

  28. Bootstrap values on NJPlot Note:ClustalX saves trees as .ph filetrees with bootstrap are saved as .phb You might have to reopen the tree…

More Related