1 / 21

NetSolve

NetSolve. Henri Casanova and Jack Dongarra University of Tennessee and Oak Ridge National Laboratory http://www.cs.utk.edu/netsolve. Objectives. Harnessing vast computational resources on the network Hardware Software Convenient for scientific computing community

werner
Download Presentation

NetSolve

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. NetSolve Henri Casanova and Jack Dongarra University of Tennessee and Oak Ridge National Laboratory http://www.cs.utk.edu/netsolve

  2. Objectives • Harnessing vast computational resources on the network • Hardware • Software • Convenient for scientific computing community • Reducing installation and programming overhead • Masking complexity related to distributed computing

  3. Data Data Code Code Server Client Computation on the server Computation-Sharing Models Proxy Computing

  4. Computation-Sharing ModelsCode Shipping Code Code Data Client Server Computation on the client

  5. Computation-Sharing ModelsRemote Computation Data Data Code Client Server Computation on the server

  6. Design issues • Platform independence to accommodate heterogeneity • User friendly • Extensibility • Load balancing • Fault tolerance

  7. NetSolve Architecture “OS” Resources

  8. NetSolve Organization and Operation

  9. NetSolve Client Interface C, Fortran, Java, Matlab, and Mathematica >> a = rand(100); b= rand(100,1); >> x = netsolve(’ax = b’, a, b); >> a = rand(100); b= rand(100,1); >> request = netsolve_nb (’send’, ’ax = b’, a, b); >> x = netsolve_nb(’probe’, request); Not ready >> x= netsolve_nb(’wait’, request);

  10. NetSolve Wrappers • Problem description file for extensibility @PROBLEM ipars @INCLUDE ”ipars.h” @LIB /home/user/lib/libipars.a @DECRIPTION Parallel Sub-Surface Flow Simulator @INPUT 2 @OBJECT STRING CHAR model @OBJECT FILE CHAR infile • Compiled into wrappers around scientific libraries • XDR for platform-independent data transfer

  11. NetSolve Load Balancing • Assigning a task to the “best” machine • Establishing a performance model Network delay, server properties, task properties • Measuring and monitoring dynamic system states • Load balancing at a finer granularity • Parallelism through non-blocking interface • Task migration

  12. NetSolve Fault Tolerance • Inter-server fault tolerance Fault tolerance among NetSolve servers • Intra-server fault tolerance Fault tolerance within a NetSolve server

  13. NetSolve Fault Tolerance Inter-server Fault Tolerance Performed by NetSolve agents • Basic approach • Failure detection + task reallocation • Overload detection + task migration • Introducing NetSolve storage servers • Store checkpoints or any information related to fault tolerance (must be platform-independent) • No reliance on failed or overloaded server for task migration

  14. NetSolve Fault ToleranceIntra-server Fault Tolerance • Not a new problem • Could be invisible to NetSolve • Can take advantage of platform-specific features for fault tolerance • Possible integration with inter-server fault tolerance

  15. Diskless Checkpointing Checksums and Reverse Computation • Diskless checkpointing eliminates the need for stable storage • N servers + a checkpointing server • At any point, consistent checkpoints taken at N servers (stored in memory) • A checksum of checkpoints stored at the checkpointing server • Rollback using reverse computation • State recovery using the checksum

  16. Applications • MCell with NetSolve Large code, small data • Matlab with NetSolve Tradeoffs between parallelism and overhead • IPARS with NetSolve • ImageVision with NetSolve

  17. Integration with ScaLAPACK

  18. Integration with Condor

  19. Integration with Ninf

  20. Conclusion • An interesting infrastructure for sharing computational resources Both software and hardware • Convenience, performance, and reliability • Playground for fault tolerance Both general and specific

More Related