1 / 16

Using Metacomputing Tools to Facilitate Large Scale Analyses of Biological Databases

Using Metacomputing Tools to Facilitate Large Scale Analyses of Biological Databases. Vinay D. Shet CMSC 838 Presentation Authors: Allison Waugh, Glenn A. Williams, Liping Wei, and Russ B. Altman. Motivation. Biological databases are growing at a very high rate

risa
Download Presentation

Using Metacomputing Tools to Facilitate Large Scale Analyses of Biological Databases

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Using Metacomputing Tools to Facilitate Large Scale Analyses of Biological Databases Vinay D. Shet CMSC 838 Presentation Authors: Allison Waugh, Glenn A. Williams, Liping Wei, and Russ B. Altman

  2. Motivation • Biological databases are growing at a very high rate • Protein Data Bank (PDB) increased from 5811 entries to 12110 in three years • Computational tools required to efficiently access and analyze this data • Typical data analyses • Linear scans across database looking for something • “all-versus-all” comparisons within database • High performance distributed computing resources can play important role in these analyses • Authors use a distributed computing environment, LEGION, to enable large scale analysis on PDB CMSC 838T – Presentation

  3. Motivation • Similar to evaluation of threaded-blast project • We run threaded blast over Sun SMP with 24 processors • Authors run program called FEATURE over LEGION framework • Can access hundreds of CPUs worldwide • Can spawn sequential versions of FEATURE on all of them CMSC 838T – Presentation

  4. Talk Overview • Overview of talk • Motivation • Background • LEGION • FEATURE • Methods • Experiments • Results • Discussions • Related work • Observations CMSC 838T – Presentation

  5. Background • LEGION (Worldwide Virtual Computer) • Metacomputing environment comprised of geographically distributed, heterogeneous collections of workstations and supercomputers • Connects resources to make up a single, worldwide, virtual computer • Coordinates large number of parallel jobs on a mixture of processors SMPs, MPPs, PCs on any network • Legion provides the software infrastructure so that a system of heterogeneous, geographically distributed, high performance machines can interact seamlessly. • No manual installation of binaries over multiple platforms (LEGION does it automatically) CMSC 838T – Presentation

  6. Background • LEGION • LAM - MPI implementation for workstation clusters • Legion supports transparent scheduling, data management, fault tolerance, site autonomy, single file name space , efficient scheduling comprehensive resource management, and a wide range of security options. CMSC 838T – Presentation

  7. Background • FEATURE • Site characterization and recognition system • Site is a microenvironment distinguished by some structural or functional role • Identifies functional or structural sites of interest in query protein CMSC 838T – Presentation

  8. Background • FEATURE • Measures spatial distributions of chemical and physical properties to create statistical model of microenvironment • Compares regions of query protein with known sites and control non-sites and assigns scores indicating likelihood of region being site • Produces list of potential sites locations with corresponding scores • Has been used to recognize ion, ligand and enzyme binding sites • FEATURE is typical data-driven algorithm requiring large data storage and efficient data analysis • Requires 12 hours on single processor to evaluate 580 non-redundant PDB entries CMSC 838T – Presentation

  9. Methods • FEATURE run on all protein entries in May 2000 PDB • Searched for potential Calcium binding sites • FEATURE has 90% sensitivity and 100% specificity to this • Three experiments conducted • Sequential scan of PDB subset using single processor • Comprehensive scan of PDB using LEGION system using 50 processors • Set of runs of LEGION using constant PDB subset but varying processors • Input parameters to FEATURE and statistical model for Ca remained constant CMSC 838T – Presentation

  10. Methods • Experiments • Sequentially scanned arbitrary 726 proteins from PDB • Runs made on single processor Sun E450 machine with 300 MHz Ultra-Sparc CPU • Comprehensive scan of all proteins (10,996 total) in PDB • Maximum # of processors: 50 • FEATURE code compiled for various platforms so binaries can be run on different machines across LEGION • Scanned subset of proteins with varying number of processors • Arbitrarily selected 4997 proteins for each run • Varied number of processors using values 20, 40, 60, and 80 CMSC 838T – Presentation

  11. Results • FEATURE reported six run time failures due to non-standard PDB file formats for sequential run • FEATURE also run time assertion failures, illegal instructions or segmentation faults during second experiment CMSC 838T – Presentation

  12. Results CMSC 838T – Presentation

  13. Discussion • FEATURE performance deteriorates after # of processors exceeds 60 • Optimal max number is constrained by • client’s process table which keeps track of each LEGION process spawned • amount of memory available to support spawned processes • Thus even if LEGION contains 100s of nodes, users cannot use them • Also LEGION provides minimal fault-tolerance (if any instance fails user must wait till everything has finished to re-spawn) • Authors maintained local copy of database but concede that this is not realistic situation as • updates to PDB occur frequently • Consumes lot of disk space CMSC 838T – Presentation

  14. Related Work • Threaded BLAST and MPI Blast • Authors work is similar to threaded blast • MPI Blast is a parallelized version of Blast so single query can be split across multiple processors • FEATURE is not truly parallelized CMSC 838T – Presentation

  15. Observations • Running CPU intensive tasks over many processors is definitely useful • However, LEGION does not scale well as there is performance degradation after 60 processors • They have not utilized true parallelism in FEATURE • It seems to me that there is lot of potential to parallelize FEATURE given that many potential sites can be examined simultaneously • What is performance enhancement in parallelized version? CMSC 838T – Presentation

  16. Questions CMSC 838T – Presentation

More Related