1 / 7

Clusters Are: Too Big, Too Small, Too Expensive and You Probably Don’t Own One Anyway

Many Researchers Miss Valuable PTM Information for Lack of Compute Power. Post-translational modifications (PTMs) of proteins are a valuable source of biological insights, and mass spectrometry is one of the few techniques able to reliably prospect for and identify PTMs.

stella
Download Presentation

Clusters Are: Too Big, Too Small, Too Expensive and You Probably Don’t Own One Anyway

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Many Researchers Miss Valuable PTM Information for Lack of Compute Power Post-translational modifications (PTMs) of proteins are a valuable source of biological insights, and mass spectrometry is one of the few techniques able to reliably prospect for and identify PTMs. Yet many valuable data sets are not searched for PTMs due to computational constraints on investigators who do not have claim to significant time on a compute cluster. MR-Tandem adapts the popular X!Tandem peptide search engine to work in the Amazon Web Services cloud so that investigators can easily and inexpensively create their own temporary compute clusters on demand, and quickly conduct what would otherwise be excessively time consuming searches. Clusters Are: Too Big, Too Small, Too Expensive and You Probably Don’t Own One Anyway A compute cluster is never the right size – either it’s overburdened, or underutilized. The moment the parts are ordered it starts to become obsolete. The moment it’s installed (which is usually quite a while after the parts are ordered) components begin to age and eventually fail. Everything must be backed up, regularly, and off site. In practice, this doesn’t always happen, especially as data volumes grow. Clusters are a pain to own. Yet the only thing that’s worse than owning a cluster is not owning one, and having to negotiate for time on a shared cluster. What if you could have your own perfectly sized cluster that exists only when you need it, and have your data backed up in three different data centers automatically? Works With Any Hadoop Cluster Easily Launch Multiple Searches AWS bills by the hour (1 minute of computation costs the same as 60), so it makes sense to keep a cluster running for multiple searches instead of bringing a cluster up and down for each search. MR-Tandem will accept a file containing a list of parameter files instead of an actual parameter file, and each search will be executed in turn on the cluster before it is finally shut down. MR-Tandem also works with Hadoop clusters other than AWS Elastic Map Reduce. Operation is the same, except your files are stored in HDFS instead of S3, and backups are your cluster administrator’s business. Enter the Cloud MR-Tandem uses Amazon Web Services (AWS) Simple Storage Service (S3) and Elastic Map Reduce (EMR) to create on-demand compute clusters and secure redundant data stores. Unlike a traditional cluster, your up front costs are zero, and you never have to replace a burned out node. Your data is automatically backed up across multiple data centers. You don’t have to share your cluster with anybody and you don’t have to ask anybody for time on it. Your cluster can be just as big or small as you need it to be for the job at hand, and when the job is complete the cluster just goes away. Some of the biggest names on the web (Yelp, Netflix, IMDB, etc) use AWS. They decided AWS could do a better, more cost effective job of managing their mission critical compute and storage infrastructure than they could themselves. Your lab is going to do better than that? MR-Tandem Drops In Where You Already Use X!Tandem MR-Tandem uses exactly the same parameter files as your local copy of X!Tandem, so it’s easy to switch back and forth between local and cloud-based X!Tandem searches. MR-Tandem places its results on your computer in the same place your local copy of X!Tandem would, and all file references in the results are in terms of your local file system. MR-Tandem uses a Python script on your local computer to spin up a cluster in AWS Elastic Map Reduce. Any data files which aren’t already in the cloud are compressed and transferred, your search (or searches) are performed, and results are copied back to your computer. Your data files and results are also stored in Amazon’s S3 Simple Storage Service, which replicates them in three different data centers. Parallel vs. Simultaneous Search Simple Setup MR-Tandem is a Python script that manages your connection to AWS or any Hadoop cluster. A convenient installer for Windows provides the proper version of Python and needed Python modules. By default MR-Tandem provisions the Hadoop-enabled X!Tandem executable from a public S3 site. You can override this if needed, and code for building your own X!Tandem executable is in the Sashimi project on Sourceforge. The MR-Tandem Python script can be found in the Insilicos Cloud Army (ICA) project: http://ica.svn.sourceforge.net/viewvc/ica/trunk There are many systems that will launch multiple simultaneous searches on a cluster. While this is obviously faster than running them one at a time, it still doesn’t address the issue of very thorough searches that take hours or even days. The table to the right shows the completion time for a 233 MB mzXML file searched against a 32.5 MB database with PTMs using MR-Tandem. The single node case was run in traditional X!Tandem mode with two threads, the multinode cases are all single threaded. Note that as node counts become very high the task of processing the database to create theoretical spectra begins to eclipse the task of comparing them to the actual spectra, and each node is doing essentially identical work. Adding more nodes doesn’t help at that point. MR-Tandem: Parallel X!Tandem Using Hadoop MapReduce on Amazon Web Services Brian S. Pratt, J. Jeffry Howbert, Natalie I. Tasman, Erik J. Nilsson Insilicos LLC, Seattle WA Why Hadoop? MR-Tandem isn’t the only parallel implementation of X!Tandem. Standard X!Tandem already supports threaded execution, which is useful on multicore processors. The X!!Tandem project uses MPI to spread that threading model across multiple compute nodes, and MR-Tandem uses Hadoop to do the same thing. All three will yield identical results as long as the thread/node counts are the same. The problem with X!!Tandem is that MPI is notoriously brittle as cluster size goes up. Specialized hardware is needed to satisfy MPI’s intolerance of network latency, and a failure anywhere in the MPI ring results in the entire search failing. In practice we find that it is very difficult to instantiate an MPI ring of more than about 10 nodes on AWS. MR-Tandem’s use of Hadoop puts fault tolerance is at the core of its design, and MR-Tandem will happily run on commodity hardware. If a worker node should fail for any reason, its workload is transparently reassigned and the search will be completed despite the failure. Other Insilicos Cloud Army Projects Insilicos Cloud Army (ICA) was developed as a general purpose framework for parallel computing using AWS. It manages the uploading of data files and computational code from a user’s local computer to AWS, the distribution of work across an AWS cluster, and the compilation and return of results to the local computer. It has been adapted to support cluster communication using both MPI and Elastic MapReduce, and can deploy either self-contained compute engines like X!Tandem or user-defined code. An example of another ICA project is Ensemble Cloud Army (ECA). ECA is designed for parallel computation of machine learning ensembles (e.g. ensembles of decision trees or neural networks). Ensembles achieve classification accuracy superior to that of an single model by running large numbers (100’s to 1000’s) of base classifiers, and aggregating their predictions via a voting process. The training and prediction processes of each base classifier are independent of the other base classifiers, so the ensemble as a whole is ideally suited to parallelization on a cluster. In the current implementation of ECA, machine learning computations are driven by scripts in the statistical language R. The Figure below shows typical improvements in accuracy obtained with ECA on large datasets (covertype = 581,012 land usage records; Jones = 227,260 protein secondary structure records; 90 to 180 base classifiers, 3 to 60 cluster nodes). Speedups were qualitatively similar to those seen with MR-Tandem. Acknowledgments It all begins with X!Tandem, of course. http://www.thegpm.org/TANDEM/index.html The X!!Tandem project provided insights and some code for expanding the X!Tandem threading model onto networked compute nodes. http://wiki.thegpm.org/wiki/X!!Tandem The Ensemble Cloud Army project provided valuable insights into distributed computing on AWS. http://sourceforge.net/projects/ica/ This work was funded in part by NIH grant # HG006091.

  2. It’s just X!Tandem • No new algorithms • No new search settings • Existing multithread model running on networked nodes instead of onboard cores • So it’s actually just X!!Tandem • Except it’s Hadoop, not MPI

  3. Hadoop? • Hadoop is an open source implementation of the MapReduce algorithm invented at Google • Classically used for text search, but also a robust general framework for distributed computing • Designed for cheap commodity hardware, and lots of it • More scalable and fault tolerant than MPI

  4. Motivation • Data is just getting bigger, searches getting longer • Researchers short on compute power may shortchange themselves on search: • Overly selective database • Insufficient decoys • Too few PTMs

  5. The Problem With Clusters • Too small • Too big • Too expensive • You don’t own one

  6. Why Amazon? • At about $0.10 per hour per core you probably can’t run a cluster as cheaply as they can • Especially since you have no up front costs • Data and results stored in 3 data centers • Good enough for Netflix, probably good enough for the average lab • MR-Tandem will run on any Hadoop system, though

  7. Cloud Computing • Computing as a utility • Prediction: next wave of young investigators will never own clusters • Prediction: granting agencies will soon view compute power as a consumable – more like a reagent than a mass spec

More Related