1 / 38

Cloud based Dynamic workflow with QOS for Mass Spectrometry Data Analysis

Cloud based Dynamic workflow with QOS for Mass Spectrometry Data Analysis. Thesis Defense: Ashish Nagavaram Graduate student Computer Science and Engineering Advisor: Dr. Gagan Agrawal Committee: Dr. Rajiv Ramnath Dr. Michael Freitas. Introduction. Cloud computing Resources on demand

genica
Download Presentation

Cloud based Dynamic workflow with QOS for Mass Spectrometry Data Analysis

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Cloud based Dynamic workflow with QOS for Mass Spectrometry Data Analysis Thesis Defense: Ashish Nagavaram Graduate student Computer Science and Engineering Advisor: Dr. GaganAgrawal Committee: Dr. Rajiv Ramnath Dr. Michael Freitas

  2. Introduction • Cloud computing • Resources on demand • pay-as-you-go • Elasticity • Resource Allocation on the cloud • Dynamic resource allocation Cloud based Dynamic workflow with QOS for Mass Spectrometry Data Analysis

  3. Motivation • Use elasticity of cloud for executing scientific applications • Over provisioning and Under provisioning • Avoid wastage of resources • No Generalized scientific workflow to execute application in dynamic fashion • Allocate resources during the execution • Meet time constraints by using more resources Cloud based Dynamic workflow with QOS for Mass Spectrometry Data Analysis

  4. Background-MassMatrix • Developed by Dr. HuaXu and Dr. Michael Freitas at Ohio State University • A database search program with rapid characterization of proteins and peptides • Supports multiple data formats like .mgf, .mzXML and raw data • The input database are of the formats .fasta or .BAS Cloud based Dynamic workflow with QOS for Mass Spectrometry Data Analysis

  5. MassMatrix Application Flow Theoretical Protein database Digest the sequence Has the sequence been searched before? yes Do not add it to the final result no Full scan search for finding matching peptides MS/MS data input file Clear insignificant peptides Statistical analysis to generate results results Cloud based Dynamic workflow with QOS for Mass Spectrometry Data Analysis

  6. Contributions (1/2) • Providing a framework for parallelization of the MassMatrix application • Creating a dynamic workflow • Resources are allocated adaptively • QOS is achieved by parameter prediction • Gives user control by using benefit function Cloud based Dynamic workflow with QOS for Mass Spectrometry Data Analysis

  7. Contributions (2/2) • Allows to specify the time constraint in which the application should be completed • “A cloud-based Dynamic Workflow for Mass spectrometry Data Analysis” - Ashish Nagavaram, GaganAgrawal, Michael Freitas, Gaurang Mehta 7th IEEE Conference on E-Science, Dec 2011 Cloud based Dynamic workflow with QOS for Mass Spectrometry Data Analysis

  8. Outline • Introduction • Motivation • Background • Parallelization of MassMatrix • Adaptive Resource allocation • Experimental Results • Parameter Prediction • Conclusion Cloud based Dynamic workflow with QOS for Mass Spectrometry Data Analysis

  9. Parallel MassMatrix • Parallelize the full-scan search phase • Takes the longest time to execute • The rest of the phases are sequential • A split-merge approach is followed • The user can specify the number of splits • Splits are made based on specific tags • Index embedded in the file-split name • Other options also considered Cloud based Dynamic workflow with QOS for Mass Spectrometry Data Analysis

  10. Parallel MassMatrix (contd.) • Only input file split • When we split database also leads to redundant results • When split both input and database we have the same problem • The intermediate files are written to disk • Pointers serialized • Written as comma separated values • A python script keeps polling the job queue to check if the parallel phase has been completed • Suspends the sequential phase until then Cloud based Dynamic workflow with QOS for Mass Spectrometry Data Analysis

  11. Parallel MassMatrix (Contd.) • The intermediate files are read back in and re-indexed while merging • The merging process is complicated • Complex data structures (matrix of matrices) • Have to get inside each data-structure to maximize them • Intermediate files are indexed among each other • While re-indexing maintain both local and global index • The data structures are also re-numbered while merging Cloud based Dynamic workflow with QOS for Mass Spectrometry Data Analysis

  12. Parallel MassMatrix (contd.) • Intermediate files are merged in order of the split they process • Unnecessary intermediate files are not loaded back • Saves memory • Helps in case of large data files Cloud based Dynamic workflow with QOS for Mass Spectrometry Data Analysis

  13. MassMatrix Flow (Parallel) Python Script split1 massmatrix Configuration File Input File split2 massmatrix Merge Input Database splitN massmatrix Sequential phase Cloud based Dynamic workflow with QOS for Mass Spectrometry Data Analysis 13

  14. Experimental results (Parallelization) Experimental setup: • 8 core Intel Xeon node with 6GB of DDR400 RAM • The theoretical database used was of 20 MB • .fasta format database is used • The code was run for 6 different datasets • Each had 50,000 records on average • Is of .mgf format • Experiments are run for 1, 2, 4 and 8 splits • Run on a single node with 8 cores Cloud based Dynamic workflow with QOS for Mass Spectrometry Data Analysis

  15. Experimental results (Parallelization) Execution times when datasets are run for 1, 2, 4 and 8 splits Cloud based Dynamic workflow with QOS for Mass Spectrometry Data Analysis

  16. Experimental results (Parallelization) Execution times for datasets when run on 1, 2, 4 and 8 cores Cloud based Dynamic workflow with QOS for Mass Spectrometry Data Analysis

  17. Background (Pegasus) • Used to help creating adaptive version of MassMatrix • Is a software system to manage workflows • Manages resources on local, grid and cloud • Provides API’s to create workflows • Creates a DAG to represent dependencies • DAG has a connection between nodes if there is dependency • Creates a plan for the execution of the application • Executes application according to this plan. Cloud based Dynamic workflow with QOS for Mass Spectrometry Data Analysis

  18. Background (Condor) • Uses wrangler to start nodes in the cloud • New nodes added to cluster automatically • Uses Amazon private and public keys to identify user • Configuration specified in xml file • Condor is the job scheduler used • Developed at University of Wisconsin • Jobs are stored in a queue • Jobs submitted from queue to the cluster in FIFO • Provides fault tolerance through check pointing Cloud based Dynamic workflow with QOS for Mass Spectrometry Data Analysis

  19. The Pegasus workflow Pegasus workflow showing the workflow of MassMatrix Application Cloud based Dynamic workflow with QOS for Mass Spectrometry Data Analysis

  20. Parallel Pegasus workflow Pegasus workflow for parallel version of MassMatrix application Cloud based Dynamic workflow with QOS for Mass Spectrometry Data Analysis

  21. Adaptive Resource Allocation • An approach for dynamic resource allocation • Decision based on rate of execution • Calculates number of additional resources to meet time constraint • Initial assumption that input is divided into equal splits • Decision made on the basis of execution time of initial N splits Cloud based Dynamic workflow with QOS for Mass Spectrometry Data Analysis

  22. Adaptive Resource Allocation (Contd.) • The code initially is run with N resources • For our case we used N=4 • Let Tper_split be the execution time of a single split • Tconstraint be the user specified time constraint • Then we can say that Ttime_constraint = Tconstraint – ( 2 × Tper_split ) (1) Cloud based Dynamic workflow with QOS for Mass Spectrometry Data Analysis

  23. Adaptive Resource Allocation (Contd.) • Another N splits must have already started execution • Hence we do not consider them in calculation • Hence if we use N resources the predicted execution time is Texecution_pred = Tper_split × ( {split_count} - 2 × N ) (2) Cloud based Dynamic workflow with QOS for Mass Spectrometry Data Analysis

  24. Adaptive Resource Allocation (Contd.) • Based on equations (1) and (2) we can calculate the number of needed as • Nodesrequired is the number of additional nodes that need to be spawned Cloud based Dynamic workflow with QOS for Mass Spectrometry Data Analysis

  25. Adaptive Algorithm Algorithm showing the steps involved in calculating the additional resources needed to meet the time constraint Cloud based Dynamic workflow with QOS for Mass Spectrometry Data Analysis

  26. Experimental Goals • To evaluate efficiency of our system with different datasets • The framework is effective • calculates the additional nodes required • Meets the time constraints • Tested for different time constraints Cloud based Dynamic workflow with QOS for Mass Spectrometry Data Analysis

  27. Experimental results (Adaptive) Experimental setup: • Cloud infrastructure: Amazon EC2 • submit host to submit jobs to the cloud • Pegasus version 3.0.2 • Condor job scheduler version 7.5.6 • Results for 2 datasets and different time constraints Cloud based Dynamic workflow with QOS for Mass Spectrometry Data Analysis

  28. Experimental Results (contd.) Results obtained when algorithm is ran for different time constraints on the dataset1 Cloud based Dynamic workflow with QOS for Mass Spectrometry Data Analysis

  29. Experimental Results (contd.) Results Obtained for dataset2 when run with same time constraints Cloud based Dynamic workflow with QOS for Mass Spectrometry Data Analysis

  30. Benefit function and Parameter prediction (QOS) Motivation: • Provide Quality of service • Tradeoff between execution time vs. quality of results • Quality depends on the parameter values • Provide a way for the user to control the quality of results • Quality defined as equation in terms of parameters • User has flexibility to decide which parameter has more importance • Makes prediction such that execution time is as close as possible to time constraint Cloud based Dynamic workflow with QOS for Mass Spectrometry Data Analysis

  31. Benefit function and Parameter prediction (QOS) • Benefit function - is an equation made of some or all parameters of the application • We use this equation to set the parameter importance • This is the minimal set of equations needed to obtain the required quality • The goal is to maximize this benefit function within the user specified time constraint • Calculated for different parameter combinations • Decision made using tables constructed from data of previous executions • Hash tables are used Cloud based Dynamic workflow with QOS for Mass Spectrometry Data Analysis

  32. Benefit function and Parameter prediction (QOS) • Tables contain parameter combination to execution time mappings and vice versa • Multiple datasets can be used for prediction • Parameters are mapped to average execution time • Reduces error percentage Cloud based Dynamic workflow with QOS for Mass Spectrometry Data Analysis

  33. Parameter prediction process Cloud based Dynamic workflow with QOS for Mass Spectrometry Data Analysis

  34. Experimental Results • Experiments conducted on a Linux desktop machine with 2 cores and 1 GB of memory • The tables are populated using two datasets data1.mgf and data2.mgf • The parameter combinations are predicted for two other datasets data3.mgf and data4.mgf Cloud based Dynamic workflow with QOS for Mass Spectrometry Data Analysis

  35. Experimental Results Parameter Prediction results when run for different Benefit function and constraints Cloud based Dynamic workflow with QOS for Mass Spectrometry Data Analysis

  36. Experimental Results Parameter Prediction results for a different Benefit Function Cloud based Dynamic workflow with QOS for Mass Spectrometry Data Analysis

  37. Conclusion • Displayed a framework for dynamic execution of scientific workflows • User specified time constraint can be used to drive the allocation of resources • Effective dynamic allocation • Maximizing Benefit function • Parameter prediction within this value • Provide quality results based on user requirements Cloud based Dynamic workflow with QOS for Mass Spectrometry Data Analysis

  38. Thank you Cloud based Dynamic workflow with QOS for Mass Spectrometry Data Analysis

More Related