Running CCSM
Download
1 / 41

Running CCSM - PowerPoint PPT Presentation


  • 167 Views
  • Uploaded on

Running CCSM. Tony Craig CCSM Software Engineering Group [email protected] Outline. General review of CCSM Setting up and running a simple case Datasets Production Modifying source code Errors Tools Performance. Review of CCSM. Five components / Ten models

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about ' Running CCSM' - dayton


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

Running CCSM

Tony Craig

CCSM Software Engineering Group

[email protected]


Outline
Outline

  • General review of CCSM

  • Setting up and running a simple case

  • Datasets

  • Production

  • Modifying source code

  • Errors

  • Tools

  • Performance


Review of ccsm
Review of CCSM

  • Five components / Ten models

    • Atmosphere(3) : atm, datm, latm

    • Ocean(2) : ocn, docn

    • Land(2) : lnd, dlnd

    • Ice(2+) : ice, ice (prescribed mode), ice (mixed layer ocean mode), dice

    • Coupler(1) : cpl

  • Communication via MPI between components and coupler only

  • Each component runs on multiple processors via MPI, OpenMP, MPI/OpenMP


Component parallelization
Component parallelization

  • atm : MPI, OpenMP, or MPI/OpenMP

  • lnd : MPI, OpenMP, or MPI/OpenMP

  • Ice : MPI only

  • ocn : MPI only

  • cpl : OpenMP only

  • The data models, datm, docn, dice, dlnd, and latm : serial only, 1 processor


Configurations
Configurations

  • A = datm, dlnd, docn, dice, cpl

  • B = atm, lnd, ocn, ice, cpl

  • C = datm, dlnd, ocn, dice, cpl

  • D = datm, dlnd, docn, ice, cpl

  • F = atm, lnd, docn, ice (prescribed mode), cpl

  • G = latm, dlnd, ocn, ice, cpl

  • H = atm, dlnd, docn, dice, cpl

  • I = datm, lnd, docn, dice, cpl

  • K = atm, lnd, docn, dice, cpl

  • M = latm, dlnd, docn, ice (ml ocn mode), cpl


Resolutions
Resolutions

  • atm/lnd/datm/dlnd = T42, T31

  • ocn/ice/docn/dice = gx1v3, gx3, gx3v4

  • latm = T62

  • Scientifically validated combinations

    • B, T42_gx1v3 = b20.007 control run (test.a1 case)

    • B, T31_gx3v4 = paleo control run (test.a2 case)


Available configurations

*

= supported (subject to change)

= b20.007 control

= paleo control

*

*

“Available” configurations


Platforms
Platforms

  • IBM

  • SGI

  • Compaq*


Review of scripts
Review of scripts

  • Main script (test.a1.run)

    • Sets primary ccsm environment variables

    • Calls $model.setup.csh

      • Gets input datasets

      • Builds components

    • Runs model

    • Archives

    • Harvests


Setting up a simple case
Setting up a simple case

  • Use the GUI !!

    • The GUI modifies the scripts and creates a new case for you

    • Input $CASE, $CSMROOT, $CSMDATA, $EXEROOT

    • Input resolution

    • Input configuration (A-M)

    • Sets processor layout based on configuration (first guess)

    • Sets some batch environment variables

    • Works well in the NCAR environment, other sites require post script-generation tuning


Setting up a simple case without gui
Setting up a simple case, without GUI

  • Create new case directory under scripts, copy over test.a1 files

  • Rename file test.a1.run to $CASE.run

    • Edit $CASE, $CSMROOT, $CSMDATA, $EXEROOT, $ARCROOT

    • Edit batch environment parameters

    • Edit $GRID

    • Edit $SETUPS

    • Edit $NTASKS, $NTHRDS


Ntasks nthrds batch
$NTASKS, $NTHRDS, batch

  • $NTASKS are the total number of MPI tasks for each component

  • $NTHRDS are the number of OpenMP threads per MPI task

  • $NTASKS*$NTHRDS = total number of processors for each component

  • Tuning required to get optimal load balance

  • Batch parameters should match processors used, consistency important, task_geometry (loadleveler) is very powerful


Component parallelization1
Component parallelization

  • atm : MPI, OpenMP, or MPI/OpenMP

  • lnd : MPI, OpenMP, or MPI/OpenMP

  • ice : MPI only, NTHRDS=1

  • ocn : MPI only, NTHRDS=1

  • cpl : OpenMP only, NTASKS=1

  • The data models, datm, docn, dice, dlnd, and latm : serial only, 1 processor, NTASKS=1, NTHRDS=1


Main script configuration summary
Main script configuration summary

  • B case

    MODELS ( atm lnd ocn ice cpl)

    SETUPS ( atm lnd ocn ice cpl)

    NTASKS ( 8 2 40 8 1)

    NTHRDS ( 4 4 1 1 4)

  • datm/dlnd/ocn/ice case

    MODELS ( atm lnd ocn ice cpl)

    SETUPS ( datm dlnd ocn ice cpl)

    NTASKS ( 1 1 64 16 1)

    NTHRDS ( 1 1 1 1 4)


Runtype
$RUNTYPE

  • Startup - initial startup of model using arbitrary initialization

    • set $CASE, $BASEDATE

  • Continue - continuation of case, bit-for-bit guaranteed, uses model restart files

    • set $CASE

  • Branch - start new case as a bit-for-bit continuation of another case, uses model restart files, requires continuous date

    • set $CASE, $REFCASE, $REFDATE

  • Hybrid - start new case, not bit-for-bit continuation, uses model initial files in atm and land, can change starting date

    • set $CASE,$BASEDATE,$REFCASE,$REFDATE


Coupler namelist
Coupler namelist

  • Stop_option: ndays, nmonths, newmonth, halfyear, newyear, newdecade

  • Stop_n : integer (ndays, nmonths)

  • Rest_freq : ndays, monthly, quarterly, halfyear, yearly

  • Rest_n : integer (ndays)

  • Diag_freq : daily, weekly, biweekly, monthly, quarterly, yearly, ndays

  • Diag_n : integer (ndays)

  • info_bcheck : integer


Data sets
Data Sets

  • Types

    • Grid files, binary

    • Namelist input, ascii

    • Initial datasets, binary/netcdf

    • Restart datasets, binary

    • History datasets, netcdf

    • Log files, ascii

  • inputdata directory

    • This is usually pointed to by $CSMDATA


Data flow input

scripts/$CASE

$CSMDATA = inputdata

$EXEROOT

Setup scripts

$ARCROOT/restart

Mass Store

Data Flow, Input

  • Everything is copied to $EXEROOT

  • Tools and scripts attempt to automate most of the “get input files”

  • Main script variables include $CSMDATA, $LFSINP, $LMSINP, $MACINP, $RFSINP, $RMSINP


Data flow output
Data Flow, Output

  • Output files are moved out of $EXEROOT

  • Harvesting is a separate process

  • Writing of restart files coordinated by the coupler

  • Writing of history files is not coordinated between components, monthly average is default

  • Main script variables include $LMSOUT, $MACOUT, $RFSOUT

Scripts

$EXEROOT

Mass Store

archiving

$ARCROOT

harvesting


Log files
Log Files

  • Each component produces a log file, $model.log.$LID

  • $LID is a system date stamp

  • Date stamps are the same on all log files for a run

  • Log files are written into the $EXEROOT/$model directories during execution

  • Log files are copied to $SCRIPTS/logs at the end of a run

  • There are separate stdout and stderr that sometimes contain output information


Archiving ccsm archive
Archiving, ccsm_archive

  • Means moving model output to a separate area on a local disk, ccsm_archive

  • Local disk area is set by $ARCROOT in the main script

  • Benefits

    • Allows separation of running and harvesting

    • Mass storage availability does not prevent continued execution of the model

    • Allows users to run in volatile temporary space

    • Supports simple harvesting in a clustered machine environment (like nirvana)


Harvesting case har
Harvesting, $CASE.har

  • Means copying model output to the local mass store

  • Separate script in scripts/$CASE, $CASE.har

  • Typically submitted in batch, can also be run interactively

  • Submitted by main script after model run, off by default

  • Sources ccsm_joe for important environment variables

  • Harvests all files in $ARCROOT/{atm,lnd,ocn,ice,cpl}

  • Verifies accurate copy on mass store before removing

  • Can scp files to remote machines


Exact restart
Exact Restart

  • CCSM can stop and restart exactly

  • The coupler controls the frequency of restart file writes

  • Restart files guarantee bit-for-bit continuity at a checkpoint boundary

  • rpointer files are updated in the scripts/$CASE directory after each run


Restart file management 1
Restart file management (1)

  • ccsm_archive

    • In scripts/$CASE

    • Called from main script after model run is complete, commented out by default

    • $ARCROOT/restart contains the latest full set of restart files

    • ccsm_archive copies full set of restart datasets into $ARCROOT/restart after each run

    • ccsm_archive then tars up that restart set into the $ARCROOT/restart.tars directory

    • These tar files can be large, regular clean up required


Restart file management 2
Restart file management (2)

  • ccsm_getrestart

    • In scripts/tools

    • Called from main script before model run starts, commented out by default

    • Copies the latest set of restart files from $ARCROOT/restart to the appropriate directories

  • To “backup” model run to previous model date

    • Assumes both ccsm_archive and ccsm_getrestart have been active in the main script

    • Delete all files in $ARCROOT/restart

    • Untar an $ARCROOOT/restart.tars file into $ARCROOT/restart

    • Resubmit


Auto resubmit
Auto-Resubmit

  • RESUBMIT file in scripts/$CASE directory

    • contains a single integer

    • If the integer is >0, main script resubmits itself and decrements the integer

  • Runaway jobs

    • FIRST! set value in RESUBMIT file to 0

    • Attempt to kill running jobs


Production
Production

  • Modify coupler namelist in cpl.setup.csh, set run length and restart frequency, turn down diagnostic frequency, set info_bcheck to 0.

  • Run a startup, hybrid, or branch case $RUNTYPE

  • Transition to continue $RUNTYPE

  • Turn on archiving, harvesting, and ccsm_getrestart

  • Edit RESUBMIT file to initiate auto-resubmission


Monitoring a run
Monitoring a run

  • Monitor the batch jobs using llq, bjobs, qstat

  • Verify that runs complete successfully, check for timing information at the end of a log file

  • Tail -f $EXEROOT/cpl/cpl.log*

  • If runs are not succeeding,

    • tail each log file

    • grep for ENDRUN in atm and lnd log files

    • Check stdout and stderr files for component messages or system messages

    • Look for core files in $EXEROOT/$model

    • Look for zero length files in $EXEROOT/$model

    • Check email


Modifying source code
Modifying source code

  • Modifying files in the ccsm models directory is not recommended

  • Create directories under scripts/$CASE

    • src.atm, src.lnd, src.ocn, src.ice, src.cpl

    • Copy subset of model source code to these directories and modify it

    • Has highest priority with respect to build

  • Benefits include

    • Release source code remains unmodified and available

    • Allows implementation of case dependent code modifications


Multiple machine support
Multiple Machine Support

  • Should run on blackforest, babyblue, and ute “out of the box”

  • “Other” machines include seaborg, nirvana, eagle, falcon, cheetah

  • Supported platforms are indicated in $OS, $SITE, $MACH, $ARCH environment variables in the main script

  • See also scripts/tools/test.a1.mods.$MACH for suggested changes to test.a1.run for “other” machines.


Running on a new machine
Running on a “New” Machine

  • Main script

    • Set batch queue commands

    • Add new $OS, $SITE, $MACH, $ARCH options

    • Set standard CCSM path names, $CSMROOT, …

    • Harvester submission issues

    • Set data movement variables, $LMSINP, …

  • Harvester script

    • May require modification

  • Tools

    • May need to modify ccsm_msread, ccsm_mswrite

  • Build

    • Modify models/bld/Macros.$OS file


Ccsm joe
ccsm_joe

  • Created by main script

  • Updated every time the main script runs

  • Case dependent

  • Records important ccsm environment variables

  • Can be “sourced” by other scripts to inherit ccsm environment variables


Interactive batch issues
Interactive/Batch Issues

  • Can run main script interactively

  • Typically used to build and pre-stage initial data

  • Uncomment “exit” command in main script to stop the script before script starts ccsm execution

  • Batch environment highly site dependent

    • NQS

    • Loadleveler

    • LSF

    • PBS


Common errors 1
Common Errors (1)

  • Model won’t build

    • Try rebuilding clean

    • Remove all obj directories, these are $OBJROOT/model/obj which is normally equivalent to $EXEROOT/model/obj

    • When rebuilding, make sure $SETBLD is true in main script

  • Model won’t continue due to restart problem

    • Determine cause of problem; quota, hardware, script, zero length files, rpointer problems

    • Fix if possible

    • Back up to latest “good” restart dataset

    • Rerun


Common errors 2
Common Errors (2)

  • Ice model stops due to mp transport error

    • Double ndte in ice.setup.csh ice model namelist

    • Back up to latest “good” restart dataset

    • Run past previous stop date

    • Reset ndte value

  • Ocean model non-convergence

    • Add about 10% to the number of model timesteps/hour in ocn.setup.csh, DT_COUNT

    • Back up to latest “good” restart dataset

    • Run past previous stop date

    • Reset DT_COUNT

    • Non-convergence on first timestep is special case


Tools
Tools

  • Under scripts/tools

    • ccsm_getfile : hierarchical search for file

    • ccsm_getinput : hierarchical search for input file

    • ccsm_msread : copies a file from local mass store

    • ccsm_mswrite : copies a file to local mass store

    • ccsm_checkenvs : echo ccsm environment variables, used to created ccsm_joe

    • ccsm-getrestart : copies restart files from $ARCROOT/restart to appropriate $EXEROOT and scripts/$CASE directories


Performance
Performance

  • This is complicated!

  • Issues

    • Performance of components and system as a function of resolution and configuration

    • Scalability of individual components, scaling efficiency of individual components

    • Task/Thread counts

    • Components sharing nodes, overloading nodes with multiple components, overloading threads, overloading tasks

    • Load balance of coupled system



Ccsm load balancing
CCSM Load Balancing

40 ocean

32 atm

16 ice

12 land

04 cpl

104 total

processors

53.2

8.6

40.4

6.2

15.0

9.4

3.0

10.0

10.0

5

3

2

55

Timings in seconds per day


Component hardware layout
Component/Hardware layout

  • Machine, set of nodes

  • Nodes, group of processors that share memory

  • Processors, individual computing elements

  • General rules

    • Do not oversubscribe processors, place only 1 MPI task or 1 thread on each processor

    • Minimize the number of nodes used for a given component and processor requirement

    • Multiple components can share a node as long as there is no oversubscription of processors

    • Test several decompositions, layouts, task/thread combinations to try to optimize performance


Summary
Summary

  • CCSM is a complicated multi-executable climate model, expect there to be “spin-up” time

  • CCSM is a scientific research code

  • There are many possible components, configurations, platforms, and resolutions; we are unable to test everything

  • Users are responsible for validating their science

  • NCAR can help with software/configuration problems, [email protected]

  • Please report bugs, fixes, improvements, and ports to new hardware, so we can incorporate those changes! [email protected]


ad