1 / 19

Developing Accessible Application Software for Individual de novo Genome Projects

Developing Accessible Application Software for Individual de novo Genome Projects. Vince Forgetta , PhD Candidate Ken Dewar PhD, Supervisor Department of Human Genetics, McGill University Montreal, Quebec, Canada December 8 th , 2011. Next-Gen Gap.

rea
Download Presentation

Developing Accessible Application Software for Individual de novo Genome Projects

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Developing Accessible Application Software for Individual de novo Genome Projects Vince Forgetta, PhD Candidate Ken Dewar PhD, Supervisor Department of Human Genetics, McGill University Montreal, Quebec, Canada December 8th, 2011

  2. Next-Gen Gap “Unfortunately, the software and computer hardware demands on these analyses are not much less than those of the large Genome Centers. From this perspective, the gap between large-scale genome centers and individual investigators may seem to be growing, not shrinking, as the next-generation platforms’ apparent promise of a ‘Genome Center in a box’ may have only been half delivered, providing data without a full suite of tools.” (Nature Methods 6, S2 - S5 (2009)) Download Data Learn *NIX Install Software and Dependencies Run Software … Wait? … Problems? Bacterial genome in < 1 week for ~ $3000 (Genome Assembly)+

  3. Three Common Methodologies in de novo Genome Analysis Display and analysis of genome annotations Quality assessment of a genome assembly Comparison and mining of genomic data from public repositories. • One or more methodologies used to address needs in three specific projects; projects used as a vehicle to develop software:

  4. Assembly Quality Assessment

  5. Assembly Analysis • Researchers should have easy access to determine quality and perform simple analysis. DNA Sequencing Centre Researcher Assembly • Delays and limits on data access exist: • - Viewers need to be installed and have specific software (e.g. Linux) or hardware requirements (e.g. RAM). • - Assembly data (multiple GBs) must be downloaded.

  6. Objective • Develop a simple assembly viewer that operates within a web-browser, allowing a researcher to rapidly analyze and access their data.

  7. Method Parser/Converter: Used python to parse, analyze, and convert assembly data into web accessible formats (HTML, JSON, JPG images) which are stored on sequence centre servers. Interface: Use browser-based interface (HTML) to dynamically access data (Javascript) on servers. Incorporates pre-existing web-technologies (JQuery, SeadragonDeepzoom AJAX). Usage: - after genome assembly, parser/converter is run on sequencing center servers - researcher accesses interface over the internet using a modern web browser

  8. Performance Parser/Converter: • Multiple platforms (Windows/OS X/Linux) • Multi-processor support. • Low memory usage (< 250Mb of memory per processor). User interface: • Client-side programming  decreased server load • Data is downloaded is on-demand  limited bandwidth users. • Sole system requirement: a modern web-browser (Firefox, Opera, Google Chrome)  ease of installation. • Low memory usage (peaks at ~ 250 Mb).

  9. The Interface • Dynamic Charts: • toggle axis value • identify points • summarize regions Assembly statistics, batch download of sequence and statistical data. • Table of contig/scaffold statistics: • Sortable/Filter by column • Access to contig sequence/quality and read sequences. • Contig Assembly: • Pan/Zoom • Identify position, read names, mismatches

  10. Demo

  11. 3. Data Mining

  12. blip.codeplex.com BLAST Pivot Microsoft Research Summer Internship Microsoft Biology Foundation Redmond, Washington, USA Mentor - Simon Mercer

  13. blip.codeplex.com BLAST NCBI ACGTCACTGACTGACTAGCTAGCTAGCTAGCATCGATCGATCGATCGATCGATCGACGTAACTAGCACGACTGACTCT ? Species, Function, … Local

  14. blip.codeplex.com Limitation Scientist + = >gi|301326298|ref|ZP_07219671.1| TIM-barrel protein, nifR3 family [Escherichia coli MS 78-1] Length=321 Score = 583.563 bits (1503), Expect = 8.65371E-165 Identities = 280/281 (100%), Positives = 280/281 (100%), Gaps = 0/281 (0%) Frame = 0 Query 1 MMSSNPQVWESDKSRLRMVHIDEPGIRTVQIAGSDPKEMADAARINVESGAQIIDINMGC 60 MMSSNPQVWESDKSRLRMVHIDEPGIRTVQIAGSDPKEMADAARINVESGAQIIDINMGC Sbjct 41 MMSSNPQVWESDKSRLRMVHIDEPGIRTVQIAGSDPKEMADAARINVESGAQIIDINMGC 100 Query 61 PAKKVNRKLAGSALLQYPDVVKSILTEVVNAVDVPVTLKIRTGWAPEHRNCEEIAQLAED 120 PAKKVNRKLAGSALLQYPDVVKSILTEVVN VDVPVTLKIRTGWAPEHRNCEEIAQLAED Sbjct 101 PAKKVNRKLAGSALLQYPDVVKSILTEVVNTVDVPVTLKIRTGWAPEHRNCEEIAQLAED 160 Query 121 CGIQALTIHGRTRACLFNGEAEYDSIRAVKQKVSIPVIANGDITDPLKARAVLDYTGADA 180 CGIQALTIHGRTRACLFNGEAEYDSIRAVKQKVSIPVIANGDITDPLKARAVLDYTGADA Sbjct 161 CGIQALTIHGRTRACLFNGEAEYDSIRAVKQKVSIPVIANGDITDPLKARAVLDYTGADA 220 Query 181 LMIGRAAQGRPWIFREIQHYLDTGELLPPLPLAEVKRLLCAHVRELHDFYGPAKGYRIAR 240 LMIGRAAQGRPWIFREIQHYLDTGELLPPLPLAEVKRLLCAHVRELHDFYGPAKGYRIAR Sbjct 221 LMIGRAAQGRPWIFREIQHYLDTGELLPPLPLAEVKRLLCAHVRELHDFYGPAKGYRIAR 280 Query 241 KHVSWYLQEHAPNDQFRRTFNAIEDASEQLEALEAYFENFA 281 KHVSWYLQEHAPNDQFRRTFNAIEDASEQLEALEAYFENFA Sbjct 281 KHVSWYLQEHAPNDQFRRTFNAIEDASEQLEALEAYFENFA 321 + = ~5000 genes E. coli Programmer

  15. blip.codeplex.com Blast in Pivot 1 3 2 Pivot ACGTCACTGACTGACTAGCTAGCTAGCTAGCATCGATCGATCGATCGATCGATCGACGTAACTAGCACGACTGACTCT ACGTCACTGACTGACTAGCTAGCTAGCTAGCATCGATCGATCGATCGATCGATCGACGTAACTAGCACGACTGACTCT ACGTCACTGACTGACTAGCTAGCTAGCTAGCATCGATCGATCGATCGATCGATCGACGTAACTAGCACGACTGACTCT ? ? ? BLAST

  16. blip.codeplex.com E. coli ECD227 Divergent Strain ????? Species? Function? E. coli E. coli ECD-227 Antibiotic Resistant! Acknowledgement Moussa Diarra, Heidi Rempel

  17. Demo

  18. Conclusions • ContiGo: used by clients of the Genome Centre at McGill (release soon). • BL!P: >500 downloads (blip.codeplex.com).

  19. Acknowledgements E. coli ECD-227 H. Rempel Andrew Metcalfe M. S. Diarra BL!P/Microsoft Simon Mercer Xin-Yi Chua Mauro Luigi Drago Beatriz Diaz Acosta Vivek Kumar Bob Davidson Mike Zyskowski Xiaoji Chen Bob Silverstein Vikram Bapat Jared Jackson Wei Lu The Pivot Team Ophiostoma novo-ulmi Jan Kieleczawa Michael Zianni Robert Steen Deborah Grove Anoja Perera Robert Lyons Jr. Sushmita Singh Doug Bintzler Scottie Adams Deborah Grove Gregory Grove Robert Lyons Jr. Suzanne Genik Chris Wright Alvaro Hernandez Sharon Bachman Lorie Hetrick Sushmita Singh Nichole Peterson Gary Leveque Joana Dias Clotilde Teiling Tim Harkins C. difficile Ken Dewar Andre Dascal Matthew Oughton Joana Dias Gary Leveque Pascale Marquis Corina Nagy Amelie Villeneuve Ivan Brukner, Mark Miller Vivian Loo Mike Mulvey Dale Gerding Maya Rupnik Elaine Mardis V. Magrini M. Hickenbotham K. Haub C. Markovic J. Nelson

More Related