Evaluating distributed platforms for protein guided scientific workflow
This presentation is the property of its rightful owner.
Sponsored Links
1 / 29

Evaluating Distributed Platforms for Protein-Guided Scientific Workflow PowerPoint PPT Presentation


  • 32 Views
  • Uploaded on
  • Presentation posted in: General

XSEDE '14, July 13 - 18 2014, Atlanta, GA, USA. Evaluating Distributed Platforms for Protein-Guided Scientific Workflow. Natasha Pavlovikj , Kevin Begcy, Sairam Behera, Malachy Campbell, Harkamal Walia, Jitender S.Deogun University of Nebraska-Lincoln. Introduction.

Download Presentation

Evaluating Distributed Platforms for Protein-Guided Scientific Workflow

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Evaluating distributed platforms for protein guided scientific workflow

XSEDE '14, July 13 - 18 2014, Atlanta, GA, USA

Evaluating Distributed Platforms for Protein-Guided Scientific Workflow

Natasha Pavlovikj, Kevin Begcy, Sairam Behera,

Malachy Campbell, Harkamal Walia, Jitender S.Deogun

University of Nebraska-Lincoln


Introduction

Introduction

  • Gene expression and transcriptome analysis are one of the main focuses of research for a great number of biologists and scientists

  • The analysis of this so called “big data” is done by using a complex set of multitude of software tools

  • Enhanced demand of powerful computational resources where the data can be stored and analyzed


Assembly pipeline

Assembly Pipeline

  • Assembly of raw sequence data is a complex multi-stage process composed of preprocessing, assembling, and post-processing

  • Assembly pipeline is used to simplify the entire assembly process by automating steps of the pipeline


Blast2cap3

blast2cap3

  • Multiple approaches used for assembling the filtered reads produce high redundancy of the resulting transcripts

  • Overlap-based assembly program CAP3 is used to merge transcripts based on the overlapping region with specified identity

  • However, because most of the produced transcripts code for a protein, a protein similarity should be also considered during the merging


Blast2cap31

blast2cap3

  • Blast2cap3 is a protein-guided assembly approach that first clusters the transcripts based on similarity to a common protein and then passes each cluster to CAP3

  • Blast2cap3 is a Python script written by Vince Buffalo from Plant Sciences Department, UCD

  • The recent use of blast2cap3 on the wheat transcriptome assembly shows that blast2cap3 generates fewer artificially fused sequences and reduces the total number of transcripts by 8-9%


Blast2cap32

blast2cap3

  • The assembled transcripts are aligned with protein datasets closely related to the organism for which the transcripts are generated, and afterwards, transcripts sharing a common protein hit are merged using CAP3

  • The current implementation of blast2cap3 supports only serial execution


Pegasus workflow management system

Pegasus Workflow Management System

  • The modularity of blast2cap3 allows us to decompose the existing approach on multiple tasks, some of which can be run in parallel

  • The protein-guided assembly can be structured into a scientific workflow


Pegasus workflow management system1

Pegasus Workflow Management System

  • Pegasus WMS is a framework that automatically maps high-level scientific workflows organized as directed acyclic graph (DAG) onto wide range of execution platforms, including clusters, grids, and clouds

  • Pegasus uses DAX (directed acyclic graph in XML) files to specify an abstract workflow

  • The abstract workflow contains information and description of all executable files and logical names of the input files used by the workflow


Blast2cap3 with pegasus wms

blast2cap3 with Pegasus WMS

  • Each node represents a workflow task, while each edge represents the dependency between the tasks

  • Archive of all required built libraries and tools (Python, Biopython, CAP3)

  • The step of downloading and extracting this archive is defined as a task in the workflow

  • Pegasus WMS implementation of blast2cap3 reduces the running time of the current serial implementation of blast2cap3 for more than 95%


Execution platforms

Execution Platforms

  • The resources that scientific workflows require can exceed the capabilities of the local computational resources

  • Scientific workflows are usually executed on distributed platforms, such as campus clusters, grids or clouds

  • Used execution platforms


Sandhills university of nebraska campus cluster

Sandhills: University of Nebraska Campus Cluster

  • Sandhills is one of the High Performance Computing (HPC) Clusters at the University of Nebraska – Lincoln Holland Computing Center (HCC)

  • Used by faculty and students

  • Sandhills was constructed in 2011 and it has 1440 AMD cores housed in a total of 44 nodes

  • Every new user account of HCC is required to be associated with a faculty or research group


Osg open science grid

OSG: Open Science Grid

  • OSG is a national consortium of geographically distributed academic institutions and laboratories that provide hundreds computing and storage resources to the OSG users

  • OSG is organized into Virtual Organizations

  • OSG does not own any computing or storage resources, but allows users to use the resources contributed by the other members of the OSG and VO’s

  • Every new user applies for an OSG certificate


Amazon ec2 amazon elastic compute cloud

Amazon EC2: Amazon Elastic Compute Cloud

  • Amazon Elastic Compute Cloud (Amazon EC2) is a large commercial Web-based service provided by Amazon.com

  • Users have access to virtual machine (VM) instances where they deploy VM images with customized software and libraries

  • Amazon EC2 is a scalable, elastic and flexible platform

  • Amazon EC2 users are hourly billed for the number and the type of resources they are using


Experiments

Experiments

  • Investigate the behavior of the modified Pegasus WMS implementation of blast2cap3 when the workflow is composed of 30, 110, 210, 610, 1,010, and 2,010 tasks respectively

  • Run the workflow multiple times on the different execution platforms in order to detect the different workflow performance as well as the different resource availability over time


Experiments1

Experiments

  • Compare the total workflow running time between different execution platforms

  • Examine the number of running versus the number of idle jobs over time for each workflow


Experimental data

Experimental Data

  • Diploid wheat Triticum urartu dataset from NCBI

  • The assembled transcripts were generated using Velvet as a de novo assembler

  • These transcripts were aligned with closely related wheat organisms (Barley, Brachypodium, Rice, Maize, Sorghum, Arabidopsis)

  • “transcripts.fasta”, 404 MB big, 236,529 assembled transcripts

  • “alignments.out”, 155 MB big, 1,717,454 protein hits


Comparing running time on sandhills osg and amazon ec2 for workflows with different number of tasks

Comparing Running Time on Sandhills, OSG and Amazon EC2 for Workflows with Different Number of Tasks


Evaluating distributed platforms for protein guided scientific workflow

Comparing the Number of Running Jobsversus the Number of Idle Jobs Over Timefor Workflows with Different Task Number


Evaluating distributed platforms for protein guided scientific workflow

Comparing the Number of Running Jobsversus the Number of Idle Jobs Over Timefor Workflows with Different Task Number


Evaluating distributed platforms for protein guided scientific workflow

Comparing the Number of Running Jobsversus the Number of Idle Jobs Over Timefor Workflows with Different Task Number


Evaluating distributed platforms for protein guided scientific workflow

Comparing the Number of Running Jobsversus the Number of Idle Jobs Over Timefor Workflows with Different Task Number


Evaluating distributed platforms for protein guided scientific workflow

Comparing the Number of Running Jobsversus the Number of Idle Jobs Over Timefor Workflows with Different Task Number


Evaluating distributed platforms for protein guided scientific workflow

Comparing the Number of Running Jobsversus the Number of Idle Jobs Over Timefor Workflows with Different Task Number


Cost comparison of different execution platforms

Cost Comparison of Different Execution Platforms

  • The main and the most important difference between the commercial cloud and the academic distributed resources is the cost

  • Sandhills:

    • generally free resources

  • OSG:

    • completely free resources

  • Amazon EC2:

    • complex pricing model

    • 50 m1.large spot instance X $0.04 per hour = $122.84


Conclusion

Conclusion

  • Using more than 100 tasks in a workflow significantly reduces the running time for all execution platforms

  • The resource allocation on Sandhills and OSG is opportunistic, and its availability changes over time

  • The results are almost constant when Amazon EC2 is used

  • Workflow failures were not encountered on Sandhills and Amazon EC2


Conclusion1

Conclusion

  • The predictability of the Amazon EC2 resources leads to better workflow running time when the cloud is used as a platform

  • For our blast2cap3 workflow, better running time and better usage of the allocated resources were achieved when Amazon EC2 is used

  • Due to the Amazon EC2 cost, the academic distributed systems can be a good alternative


Acknowledgments

Acknowledgments

  • University of Nebraska Holland Computing Center

  • Open Science Grid


Evaluating distributed platforms for protein guided scientific workflow

Thank You


  • Login