MS698: Implementing an Ocean Model

MS698: Implementing an Ocean Model • Benchmark tests. • Dealing with big data files. • (specifically big NetCDF files). • Signell paper. • Activity: • nctoolbox on matlab. April 4, 2014

Benchmark Tests

Activity Today: Part I: Analyze Benchmark Tests • Plot the walltime vs. number of processors. • Make another figure of the speedup vs. the number of processors. • Discuss the model performance when run in parallel: • Do you see a difference in speedup depending on how the model tiles were configured? • Does it seem worthwhile to run the model in parallel up to the number of processors tested (8)? • Based on these benchmarks, how would you set up a parallel run if you wanted to represent a long period of time?

Themes of the paper • Model output makes big data. ~terabyte scale. Data access is limited by bandwidth in many cases. • But, often, you don’t want or need the entire file or data set, you just need part of it. • Especially in collaborative settings where different models are being combined or compared – it can be a pain to compare models (different grids, different timesteps, different variable names, different units…). • Because the output data is “big” it may be spread across multiple files so that each file < 2GB.

Section 2 gives 5 pieces of advice • Store data in a machine-independent, self-describing format. (like NetCDF). • Use CF (Climate and Forecast) conventions. This makes it easier for processing scripts to figure out the model grid, variables, etc. Especially important for users who do not know the details of the model. • Use and develop generic tools that work with CF – compliant data. • Use OPeNDAP to distribute data. Lets the data be served over the internet so that subsets of the data can be accessed at one time. • Use a THREDDS catalog. This lets you string a lot of data files together into one dataset.

Now for our example: How can we deal with big datasets? • Use matlab script on • /export/home/ckharris/MODELS/ROMS/RIVERPLUME2/MS698…

We are going to use nctoolbox in matlab to analyze big NetCDF data on the vlab computers. • Why are we going to use the vlab computers? • Because it has a new enough version of Matlab and javaand can access our model output. • Avoid the step of logging onto the cluster or poverty. • You might be able to do this from poverty as well. • You can use these tools on pacific but running jobs interactively on the cluster requires some extra steps. • Why do we want to use the nctoolbox? • It gives us some useful tools for concatenating across history files. • It has some useful tools for analyzing ocean model data.

Now for our example: How can we deal with big datasets? • Use matlab script on • /export/home/ckharris/MODELS/ROMS/RIVERPLUME2/MS698…

Activity for today • Plot the walltime vs. number of processors. • Make another figure of the speedup vs. the number of processors. • Discuss the model performance when run in parallel: • Do you see a difference in speedup depending on how the model tiles were configured? • Does it seem worthwhile to run the model in parallel up to the number of processors tested (8)? • Based on these benchmarks, how would you set up a parallel run if you wanted to represent a long period of time? • Use the nctoolbox to plot a timeseries of the data from the RIVERPLUME 2 test case./export/home/ckharris/MODELS/ROMS/RIVERPLUME2/MS698…

MS698: Implementing an Ocean Model