1 / 8

Crystal Ball Panel

SOS. 7. ORNL Heterogeneous Distributed Computing Research. Crystal Ball Panel. Al Geist ORNL March 6, 2003. ORNL Heterogeneous Distributed Computing Research. Look into the Future. Federated Tera-clusters. Petascale systems. Reply Hazy Try Again. Adaptable software. HPC Linux.

aharpe
Download Presentation

Crystal Ball Panel

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. SOS 7 ORNL Heterogeneous Distributed Computing Research Crystal Ball Panel Al Geist ORNL March 6, 2003

  2. ORNL Heterogeneous Distributed Computing Research Look into the Future Federated Tera-clusters Petascale systems Reply Hazy Try Again Adaptable software HPC Linux Fault Tolerance High performance I/O Eight Ball

  3. Resource Management Accounting & user mgmt System Monitoring System Build & Configure Job management ORNL Heterogeneous Distributed Computing Research Scalable Systems Software for Terascale Centers ORNL ANL LBNL PNNL SNL LANL Ames IBM Cray Intel Unlimited Scale NCSA PSC SDSC Collectively (with labs, NSF centers, and industry) define standard interfaces between systems components for interoperability Goal Create scalable, standardized management tools for efficiently running our large computing centers Part of the DOE SciDAC effort www.scidac.org/ScalableSystems

  4. Progress so far on Integrated Suite Working Components and Interfaces (bold) Grid Interfaces Meta Scheduler Meta Monitor Meta Manager Meta Services Accounting Scheduler System & Job Monitor Node State Manager Service Directory Standard XML interfaces Node Configuration & Build Manager authentication communication Event Manager Important! Allocation Management Usage Reports Validation & Testing Process Manager Job Queue Manager Components written in any mixture of C, C++, Java, Perl, and Python Hardware Infrastructure Manager Checkpoint / Restart

  5. ORNL Heterogeneous Distributed Computing Research Underneath it all Rogue OS and/or daemons cited as problem by existing computer centers Single System Img Adaptive O/S Asymmetric Kernels A scalable file system Scalable High Performance OS What will it be? Linux Lightweight kernel (like Red, BG/L) Scyld approach Other? Fast-OS effort

  6. MTBF Time Ckpt restart Scale ORNL Heterogeneous Distributed Computing Research Scale up and Fall Down Fault Tolerance serious issue when scaling to 100 TF and beyond RAS critical Checkpointing eventually becomes ineffective Need a Fault Tolerance Overhaul Needs: Adaptive runtime MPI Fault Tolerance New FT paradigms

  7. Petascale Paths ORNL Heterogeneous Distributed Computing Research General Purpose vs Simple and Custom Software: Minimum OS w/ High performance but limited app support Full OS Tuned to hardware adapt on the fly Autonomic algorithms Hardware: Customized clusters for each group Centralized general purpose machine Internet in a box Or “out of the box”

  8. ORNL Heterogeneous Distributed Computing Research Big Science The final word - don’t lose track of why we justify petascale systems Science will ultimately be driven by computation, simulation and modeling. Science drivers are key to success in HPC and visa versa

More Related