1 / 13

End to End Computing at ORNL

End to End Computing at ORNL. Scott A. Klasky Computing and Computational Science Directorate Center for Computational Sciences. Petascale data workspace. Massively parallel simulation. Impact areas of R&D.

Download Presentation

End to End Computing at ORNL

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. End to End Computing at ORNL Scott A. Klasky Computing and Computational Science DirectorateCenter for Computational Sciences

  2. Petascale data workspace Massivelyparallelsimulation

  3. Impact areas of R&D • Asynchronous I/O using data in transit (NXM) techniques (M. Parashar, Rutgers; M. Wolf, GT) • Workflow automation(SDM Center; A. Shoshani, LBNL; N. Podhorszki, UC Davis) • Dashboard front end(R. Barreto, ORNL; M. Vouk, NCSU) • High-performance metadata-rich I/O(C. Jin, ORNL; M. Parashar, Rutgers) • Logistical networking integrated into data analysis/visualization (M. Beck, UTK) • CAFÉ common data model(L. Pouchard, ORNL; M. Vouk, NCSU) • Real-time monitoring of simulations(S. Ethier, PPPL; J. Cummings, Cal Tech;Z. Lin, U.C. Irvine; S. Ku, NYU) • Visualization in real-time monitor • Real-time data analysis

  4. Hardware architecture Lustre HPSS Restart files JaguarI/O nodes Ewok Jaguar compute nodes 40 G/s Sockets later Simulation control Login nodes Nfs DB Lustre2 Ewok-web Job control Ewok-sql Seaborg

  5. Asynchronous petascale I/O for data in transit • High-performance I/O • Asynchronous • Managed buffers • Respect firewall constraints • Enable dynamic control with flexible MxN operations • Transform using shared-space framework (Seine)

  6. Workflow automation • Automate the data processing pipeline, including transfer of simulation output to the e2e system, execution of conversion routines, image creation, archival using the Kepler workflow system • Requirements for Petascale computing • Easy to use • Dashboard front-end • Autonomic

  7. Dashboard front end • Desktop Interface uses asynchronous Javascript and XML (AJAX); runs in web browser • Ajax is a combination of technologies coming together in powerful ways (XHTML, CSS, DOM, XML, XST, XMLhttpRequest, and Javascript) • The user’s interaction with the application happens asynchronously – independent of communication with the server • Users can rearrange the page without clearing it

  8. High-performance metadata-rich I/O Step 1: • Two-step process to produce files: Write out binary data + tags using parallel I/O on XT3. (May or may not use files; could use asynchronous I/O methods) • The tags contain the metadata information that is placed inside the files • Workflow transfers this information to Ewok (IB cluster with 160P) Service on Ewok decodes files into hdf5 files and places metadata into XML file (one XML file for all of the data) • Cuts I/O overhead in GTC from 25% to <3% Step 2:

  9. Logistical networking: High-performance ubiquitous and transparent data access over the WAN Directoryserver NYU PPPL Jaguar Cray XT3 Ewok cluster MIT Portals UCI Depots

  10. CAFÉcommon data model (Combustion S3D), astrophysics (Chimera), fusion (GTC/XGC Environment) • Stores and organizes four types of information about a given run: • Provenance • Operational profile • Hardware mapping • Analysis metadata • A scientist can seamlessly find input and output variables of a given run, unit, average, min and max values for a variable Provenance CAFÉ model

  11. Using end-to-end technology, we have created a monitoring technique for fusion codes (XGC, GTC) Kepler is used to automate the steps. The data movement task will be using the Rutgers/GT data-in-transitroutines Metadata are placed in the movement from the XT3 to the IB cluster. Data go from NXM processors using the Seine framework Data are archived into the High-Performance Storage System, metadata are placed in DB. Visualization and analysis services produce data/information that are placed on the AJAX front-end Data are replicated from ewok to other sites using logisitcal networking Users access the files from our metadata server at ORNL Scientists need tools to let them see and manipulate their simulation data quickly Archiving data, staging it to secondary or tertiary computing resources for annotation and visualization, staging it yet again to a web portal…. These work, but we can accelerate the rate of insight for some scientists by allowing them to observe data during a run The per-hour facilities cost of running a leadership-class machine is staggering Computational scientists should have tools to allow them to constantly observe and adjust their runs during their scheduled time slice, as astronomers at an observatory or physicists at a beam source can Real-time monitoring of simulations

  12. Typical monitoring Look at volume-averaged quantities At four key times, this quantity looks good Code had one error that didn’t appear in the typical ASCII output to generate this graph Typically, users run gnuplot/grace to monitor output More advanced monitoring 5 seconds move 300 MB, and process the data Need to use FFT for 3-D data, and then process data + particles 50 seconds (10 time steps) move and process data 8 GB for 1/100 of the 30 billion particles Demand low overhead <3%! Real-time monitoring

  13. Contact Scott A. Klasky Lead, End-to-End Solutions Center for Computational Sciences (865) 241-9980 klasky@ornl.gov 13 Klasky_E2E_0611

More Related