1 / 25

Alex Zelikovsky Department of Computer Science Georgia State University

Viral Quasispecies Reconstruction Based on Unassembled Frequency Estimation. Alex Zelikovsky Department of Computer Science Georgia State University Joint work with Serghei Mangul, Irina Astrovskaya, Bassam Tork, Ion Mandoiu. Outline. Introduction ML Model EM Algorithm VSEM Algorithm

najila
Download Presentation

Alex Zelikovsky Department of Computer Science Georgia State University

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Viral Quasispecies Reconstruction Based on Unassembled Frequency Estimation Alex Zelikovsky Department of Computer Science Georgia State University Joint work with Serghei Mangul, Irina Astrovskaya, Bassam Tork, Ion Mandoiu

  2. Outline • Introduction • ML Model • EM Algorithm • VSEM Algorithm • Experimental Results • Conclusions and future work ISBRA 2011, Central South University, Changsha, China

  3. 454Pyrosequencing • Emulsion PCR • Single nucleotide addition • Natural nucleotides • DNA ploymerase pauses until complementary nucleotide is dispensed • Nucleotide incorporation triggers enzymatic reaction that results in emission of light ISBRA 2011, Central South University, Changsha, China

  4. MLModel • Panel : bipartite graph • RIGHT: strings • unknown frequencies • LEFT: reads • observed frequencies • EDGES: probability of the read to be emitted by the string • weights are calculated based on the mapping of the reads to the strings reads R1 strings S1 R2 S2 R3 S3 R4 ISBRA 2011, Central South University, Changsha, China

  5. ML estimates of string frequencies • Probability that a read is sampled from string is proportional with its frequency f(j) • ML estimates for f(j) is given by n(j)/(n(1) + . . . + n(N)) • n(j) - number of reads sampled from string j ISBRA 2011, Central South University, Changsha, China

  6. EM algorithm • E-step: Compute the expected number n(j) of reads that come from string j under the assumption that string frequencies f(j) are correct • M-step: For each string j, set the new value of f(j) equal to the portion of reads being originated by string j among all observed reads in the sample ISBRA 2011, Central South University, Changsha, China

  7. ML Model Quality • How well the maximum likelihood model explain the reads • Measured by deviation between expected and observed read frequencies • expected read frequency: ISBRA 2011, Central South University, Changsha, China

  8. VSEM : Virtual String EM (incomplete) panel + virtual string with 0-weights in virtual string ML estimates of string frequencies update weights of reads in virtual string EM EM deviation between expected /observed read frequencies Stop condition yes Output : string frequencies, reads no Compute expected read frequencies ISBRA 2011, Central South University, Changsha, China

  9. Example : 1st iteration Incomplete Panel Full Panel reads reads R1 R1 strings strings S1 S1 R2 R2 S2 S2 R3 R3 VS S3 R4 R4 VS ISBRA 2011, Central South University, Changsha, China

  10. Example : 1st iteration Incomplete Panel Full Panel reads reads R1 R1 strings strings S1 S1 R2 R2 S2 S2 R3 R3 VS S3 R4 R4 VS ISBRA 2011, Central South University, Changsha, China

  11. Example : 1st iteration Incomplete Panel Full Panel reads reads R1 R1 strings strings S1 S1 R2 R2 S2 S2 R3 R3 VS S3 R4 R4 VS ISBRA 2011, Central South University, Changsha, China ISBRA 2011, Central South University, Changsha, China 11

  12. Example : 1st iteration Incomplete Panel Full Panel reads reads R1 R1 strings strings S1 S1 R2 R2 S2 S2 R3 R3 VS S3 R4 R4 VS ISBRA 2011, Central South University, Changsha, China

  13. Example : 1st iteration Incomplete Panel Full Panel reads reads R1 R1 strings strings S1 S1 R2 R2 S2 S2 R3 R3 VS S3 R4 R4 VS Incomplete Panel ISBRA 2011, Central South University, Changsha, China

  14. Example : last iteration Incomplete Panel Full Panel reads reads R1 R1 strings strings S1 S1 R2 R2 S2 S2 R3 R3 VS S3 R4 R4 VS ISBRA 2011, Central South University, Changsha, China

  15. VSEM : Virtual String EM • Decide if the panel is likely to be incomplete • Estimate total frequency of missing strings • Identify read spectrum emitted by missing strings ISBRA 2011, Central South University, Changsha, China

  16. ViSpA • ViSpA [Astrovskaya et al. 2011] – viral spectrum assembling tool for inferring viral quasispecies sequences and their frequencies from pyroseqencing shotgun reads • align reads • built a read graph : • V – reads • E – overlap between reads • each path – candidate sequence • filter based on ML frequencies ISBRA 2011, Central South University, Changsha, China

  17. reads removing duplicated & rare qsps ViSpA-VSEM assembled Qsps Qsps Library ViSPA Weighted assembler Stopping condition reads, weights VSEM Virtual String EM NO YES ViSpA ML estimator Viral Spectrum +Statistics ISBRA 2011, Central South University, Changsha, China

  18. Simulation Setup and Accuracy Measures • Real quasispecies sequences data from [von Hahn et al. 2006] • 44 sequences (1739 bp long) from the E1E2 region of Hepatitis C virus • Error-free data was simulated by in-house simulator • populations sizes: 10, 20, 30, and 40 sequences • population distributions: geometric, skewed normal, uniform • Accuracy measures • Kullback-Leibler divergence • Correlation between real and predicted frequencies • Average prediction error ISBRA 2011, Central South University, Changsha, China

  19. Experimental Validation of VSEM • Detection of panel incompleteness • VSEM can detect 1% of missing strings • Improving quasispecies frequencies • Detection of reads emitted by missing string • Correlation between predicted reads and reads emitted by missing strings >65% ISBRA 2011, Central South University, Changsha, China

  20. EM vs VSEM ISBRA 2011, Central South University, Changsha, China

  21. ViSpA vs ViSpA-VSEM • 100K reads from 10 QSPS • average length 300 ISBRA 2011, Central South University, Changsha, China

  22. ViSpA vs ViSpA-VSEM • 100K reads from 10 QSPS • average length 300 ISBRA 2011, Central South University, Changsha, China

  23. Conclusions & Future Work • Apply VSEM to RNA-Seq data • Assemble missing strings from the set of reads emitted by missing strings • Handle chimerical strings presented in the panel ISBRA 2011, Central South University, Changsha, China

  24. Acknowledgments • NFS … ISBRA 2011, Central South University, Changsha, China

  25. 非常感謝 ISBRA 2011, Central South University, Changsha, China

More Related