1 / 11

On the Path to Petascale: Top Challenges to Scientific Discovery

On the Path to Petascale: Top Challenges to Scientific Discovery. Scott A. Klasky NCCS Scientific Computing End-to-End Task Lead. 1. Code Performance. From 2004 - 2008, computing power for codes like GTC will go up 3 orders of magnitude! 2 Paths for Pscale computing for most simulations.

vivian
Download Presentation

On the Path to Petascale: Top Challenges to Scientific Discovery

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. On the Path to Petascale:Top Challenges to Scientific Discovery Scott A. Klasky NCCS Scientific Computing End-to-End Task Lead

  2. 1. Code Performance • From 2004 - 2008, computing power for codes like GTC will go up 3 orders of magnitude! • 2 Paths for Pscale computing for most simulations. • More physics. Larger problems. • Code Coupling. • My personal definition of leadership class computing. • “Simulation runs on >50% of cores, running for >10 hours.” • One ‘small’ simulation will cost $38,000 on a Pflop computer. • Science scales with processors. • XGC and GTC fusion simulations will run on 80% of cores for 80 hours ($400,000/simulation).

  3. Data Generated. • MTF will be ~2 days. • Restarts contain critical information to replay the simulation at different times. • Typical Restarts = 1/10 of memory. Dumps every 1 hour. (Big 3 apps support this claim). • Analysis files dump every physical timestep. Typically every 5 minutes of simulation. • Analysis files vary. We estimate for ITER size simulations data output will be roughly 1GB/5 minutes. • DEMAND I/O < 5% of calculation. • Total simulation will potentially produce =1280TB + 960GB. • Need > (16*1024+12)/(3600 * .05) = 91GB/sec. • Asynchronous I/O is needed!!! (Big 3 apps (combustion, fusion, astro allow buffers). • Reduces I/O rate to (16*1024+12)/3600 = 4.5GB/sec. (with lower overhead). • Get the data off the HPC, and over to another system! • Produce HDF5 files on another system (too expensive for HPC system).

  4. Workflow Automation is desperately needed. (with high-speed data-in-transit techniques). • Need to integrate Autonomics into workflows…. • Need to make it easy for the scientists. • Need to make it fault tolerant/robust.

  5. A few days in the life of Sim Scientist.Day 1 -morning. • 8:00AM Get Coffee, Check to see if job is running. • Ssh into jaguar.ccs.ornl.gov (job 1) • Ssh into seaborg.nersc.gov (job 2) (this is running yea!) • Run gnuplot to see if run is going ok on seaborg. This looks ok. • 9:00AM Look at data from old run for post processing. • Legacy code (IDL, Matlab) to analyze most data. • Visualize some of the data to see if there is anything interesting. • Is my job running on jaguar? I submitted this 4K processor job 2 days ago! • 10:00AM scp some files from seaborg to my local cluster. • Luckily I only have 10 files (which are only 1 GB/file). • 10:30AM first file appears on my local machine for analysis. • Visualize data with Matlab.. Seems to be ok.  • 11:30AM see that the second file had trouble coming over. • Scp the files over again… Dohhh

  6. Day 1 evening. • 1:00PM Look at the output from the second file. • Opps, I had a mistake in my input parameters. • Ssh into seaborg, kill job. Emacs the input, submit job. • Ssh into jaguar, see status. Cool, it’s running. • bbcp 2 files over to my local machine. (8 GB/file). • Gnuplot data.. This looks ok too, but still need to see more information. • 1:30PM Files are on my cluster. • Run matlab on hdf5 output files. Looks good. • Write down some information in my notebook about the run. • Visualize some of the data. All looks good. • Go to meetings. • 4:00PM Return from meetings. • Ssh into jaguar. Run gnuplot. Still looks good. • Ssh into seaborg. My job still isn’t running…… • 8:00PM Are my jobs running? • ssh into jaguar. Run gnuplot. Still looks good. • Ssh into seaborg. Cool. My job is running. Run gnuplot. Looks good this time!

  7. And Later • 4:00AM yawn… is my job on jaguar done? • Ssh into jaguar. Cool. Job is finished. Start bbcp files over to my work machine. (2 TB of data). • 8:00AM @@!#!@. Bbcp is having troubles. Resubmit some of my bbcp from jaguar to my local cluster. • 8:00AM (next day). Opps still need to get the rest of my 200GB of data over to my machine. • 3:00PM My data is finally here! • Run Matlab. Run Ensight. Oppps…. Something’s wrong!!!!!!!!! Where did that instability come from? • 6:00PM finish screaming!

  8. Typical Monitoring Look at volume averaged quantities. At 4 key times this quantity looks good. Code had 1 error which didn’t appear in the typical ascii output to generate this graph. Typically users run gnuplot/grace to monitor output. Need metadata integrated into the high-performance I/O, and integrated for simulation monitoring. • More advanced monitoring • 5 seconds move 600MB, and process the data. • Really need to use FFT for 3D data, and then process data + particles • 50 seconds (10 time steps) move & process data. • 8 GB for 1/100 of the 30 billion particles. • Demand low overhead <5%!

  9. Parallel Data Analysis. • Most applications use scalar data analysis. • IDL • Matlab. • Ncar graphics. • Need techniques such as PCA • Need help, since data analysis is written quickly, and changed often… No harden versions…. Maybe….

  10. New Visualization Challenges. • Finding the needle in the haystack. • Feature identification/tracking! • Analysis of 5D+time phase-space (with 1x1012) particles! • Real-time visualization of codes during execution. • Debugging Visualization.

  11. Where is my data? • ORNL, NERSC, HPSS (NERSC,ORNL), local cluster, laptop? • We need to keep track of multiple copies? • We need to query the data. Query based visualization methods. • Don’t want to distinguish between different disks/tapes.

More Related