1 / 34

Large-Scale Multiple Sequence Alignment

Large-Scale Multiple Sequence Alignment. Tandy Warnow Founder Professor of Computer Science The University of Illinois at Urbana-Champaign http:// tandy.cs.illinois.edu. 1kp: Thousand Transcriptome Project. N. Matasci iPlant. T. Warnow, S. Mirarab, N. Nguyen

patrina
Download Presentation

Large-Scale Multiple Sequence Alignment

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Large-Scale Multiple Sequence Alignment Tandy Warnow Founder Professor of Computer Science The University of Illinois at Urbana-Champaign http://tandy.cs.illinois.edu

  2. 1kp: Thousand Transcriptome Project N. Matasci iPlant T. Warnow, S. Mirarab, N. Nguyen UIUC UCSD UCSD J. Leebens-Mack U Georgia N. Wickett Northwestern G. Ka-Shu Wong U Alberta Plus many many other people… • Plant Tree of Life based on transcriptomes of ~1200 species • More than 13,000 gene families (most not single copy) Gene Tree Incongruence Challenge: Alignments and trees on > 100,000 sequences

  3. 1000-taxon models, ordered by difficulty (Liu et al., Science 324(5934):1561-1564, 2009)

  4. A C B D Re-aligning on a tree A B Decompose dataset C D Align subsets A B C D Estimate ML tree on merged alignment ABCD Merge sub-alignments

  5. Obtain initial alignment and estimated ML tree Tree Use tree to compute new alignment Estimate ML tree on new alignment Alignment SATé and PASTA Algorithms Repeat until termination condition, and return the alignment/tree pair with the best ML score

  6. 1000 taxon models, ordered by difficulty, Liu et al., Science 324(5934):1561-1564, 2009 24 hour SATé-I analysis, on desktop machines (Similar improvements for biological datasets)

  7. 1000-taxon models ranked by difficulty SATé-2 better than SATé-1 SATé-1 (Liu et al., Science 2009): can analyze up to 8K sequences SATé-2 (Liu et al., Systematic Biology 2012): can analyze up to ~50K sequences

  8. PASTA: even better than SATé-2 PASTA: Mirarab, Nguyen, and Warnow, J Comp. Biol. 2015 • Simulated RNASim datasets from 10K to 200K taxa • Limited to 24 hours using 12 CPUs • Not all methods could run (missing bars could not finish)

  9. 1kp: Thousand Transcriptome Project N. Matasci iPlant T. Warnow, S. Mirarab, N. Nguyen UIUC UCSD UCSD J. Leebens-Mack U Georgia N. Wickett Northwestern G. Ka-Shu Wong U Alberta Plus many many other people… • Plant Tree of Life based on transcriptomes of ~1200 species • More than 13,000 gene families (most not single copy) Gene Tree Incongruence Challenge: Alignments and trees on > 100,000 sequences

  10. 1KP dataset: more than 100,000 p450 amino-acid sequences, many fragmentary

  11. 1KP dataset: more than 100,000 p450 amino-acid sequences, many fragmentary All standard multiple sequence alignment methods we tested performed poorly on datasets with fragments.

  12. UPP UPP = “Ultra-large multiple sequence alignment using Phylogeny-aware Profiles” Nguyen, Mirarab, and Warnow. Genome Biology, 2014. Purpose: highly accurate large-scale multiple sequence alignments, even in the presence of fragmentary sequences.

  13. UPP UPP = “Ultra-large multiple sequence alignment using Phylogeny-aware Profiles” Nguyen, Mirarab, and Warnow. Genome Biology, 2014. Purpose: highly accurate large-scale multiple sequence alignments, even in the presence of fragmentary sequences. Uses an ensemble of HMMs

  14. Simple idea (not UPP) • Select random subset of sequences, and build “backbone alignment” • Construct a Hidden Markov Model (HMM) on the backbone alignment • Add all remaining sequences to the backbone alignment using the HMM

  15. One Hidden Markov Model for the entire alignment?

  16. One Hidden Markov Model for the backbone alignment? HMM 1

  17. Or 2 HMMs? HMM 1 HMM 2

  18. Or 4 HMMs? HMM 1 HMM 2 HMM 3 HMM 4

  19. Or all 7 HMMs? HMM 1 m HMM 2 HMM 4 HMM 7 HMM 3 HMM 5 HMM 6

  20. UPP Algorithmic Approach • Select random subset of full-length sequences, and build “backbone alignment” with PASTA • Construct an “Ensemble of Hidden Markov Models” on the backbone alignment • Add all remaining sequences to the backbone alignment using the Ensemble of HMMs

  21. Evaluation • Simulated datasets (some have fragmentary sequences): • 10K to 1,000,000 sequences in RNASim – complex RNA sequence evolution simulation • 1000-sequence nucleotide datasets from SATé papers • 5000-sequence AA datasets (from FastTree paper) • 10,000-sequence Indelible nucleotide simulation • Biological datasets: • Proteins: largest BaliBASE and HomFam • RNA: 3 CRW datasets up to 28,000 sequences

  22. RNASim Million Sequences: alignmenterror • Notes: • We show alignment error • using average of SP-FN and SP-FP. • UPP variants have better alignment scores than PASTA. • (Not shown: Total Column Scores – PASTA more accurate than UPP) • No other methods tested could complete on these data

  23. RNASim Million Sequences: tree error • Using 12 processors: • UPP(Fast,NoDecomp) took 2.2 days, • UPP(Fast) took 11.9 days, and • PASTA took 10.3 days

  24. UPP is very robust to fragmentary sequences Under high rates of evolution, PASTA is badly impacted by fragmentary sequences (the same is true for other methods). UPP continues to have good accuracy even on datasets with many fragments under all rates of evolution. Performance on fragmentary datasets of the 1000M2 model condition

  25. UPP Running Time Wall-clock time used (in hours) given 12 processors

  26. What about BAli-Phy? • BAli-Phy (Redelings and Suchard): leading method for statistical co-estimation of alignments and trees: • Like Bayesian phylogeny estimation, it is expected to be the most rigorous and accurate technique for estimating trees and alignments!

  27. BAli-Phy: Better than PASTA! Alignment Accuracy (TCscore) MAFFT PASTA BAli-Phy 40% 30% 20% 10% 0% #Taxa: Total-ColumnScore 100 RNAsim(RNA) 100 Indelible(DNA) 200 200 Simulator Simulated datasets with 100 or 200sequences. *Averages over 10replicates

  28. But: BAli-Phy is limited to smalldatasets From www.bali-phy.org/README.html, 5.2.1. Too many taxa? “BAli-Phy is quite CPU intensive, and so we recommend using 50 or fewer taxa in order to limit the *me required to accumulate enough MCMC samples. (Despite this recommendation, data sets with more than 100 taxa have occasionally been known to converge.) We recommend initially pruning as many taxa as possible from your data set, then adding some back if the MCMC is not too slow.”

  29. A C B D Re-aligning on a tree A B Decompose dataset C D Align subsets: MAFFT A B C D Estimate ML tree on merged alignment ABCD Merge sub-alignments

  30. A C B D Re-aligning on a tree A B Decompose dataset C D Align subsets: BAli-Phy?? A B C D Estimate ML tree on merged alignment ABCD Merge sub-alignments

  31. Comparing default PASTA to PASTA+BAli-Phy on simulated datasets with 1000 sequences Decomposition to 100-sequence subsets, one iteration of PASTA+BAli-Phy

  32. Comparing UPP variants where the backbone alignment is computed using either default PASTA or PASTA+BAli-Phy Total ColumnScore 0.100 PAST A+BAli−Phy Better 0.075 PASTA+BAli−Phy ● data Indelible M2 RNAsim ● 0.050 ● ● ● ●● ● ● ● PASTABetter 0.025 ● 0.000 0.000 0.025 0.050 PASTA 0.075 0.100 Recall (SP−Score) Tree Error: Delta RF (FastTree−2) 1.000 0.015 ●r PAST A+BAli−Phy Better PASTA+BAli−Phy Bette ● 0.975 PASTA+BAli−Phy 0.010 ●● ● ● ● ● PASTA ●● ● 0.950 ● ● ● ● ● ● 0.005 ● PASTABetter PASTABetter ● ● 0.925 0.900 0.000 0.900 0.925 0.950 PASTA 0.975 1.000 0.000 0.005 0.010 PASTA+BAli−Phy 0.015 Results on 10,000-sequence datasets, backbone size1000

  33. Summary • Large-scale multiple sequence alignment (MSA) is achievable with good accuracy using divide-and-conquer plus iteration, and enable preferred MSA methods to be used on very large datasets • PASTA and UPP can provide good accuracy on large (million-sequence) datasets with high heterogeneity • Fragmentary sequences present additional challenges that UPP can address • PASTA and UPP at https://github.com/smirarab/ • PASTA+BAli-Phy at http://github.com/MGNute/pasta

  34. Acknowledgments PASTA and UPP: Nam Nguyen (now postdoc at UIUC) and Siavash Mirarab (now faculty at UCSD), undergrad: Keerthana Kumar (at UT-Austin) PASTA+BAli-Phy: Mike Nute (PhD student at UIUC) Current NSF grants:ABI-1458652 (multiple sequence alignment) Grainger Foundation (at UIUC), and UIUC TACC, UTCS, Blue Waters, and UIUC campus cluster PASTA, UPP, SEPP, and TIPP are available on github at https://github.com/smirarab/; see also PASTA+BAli-Phy at http://github.com/MGNute/pasta

More Related