1 / 17

Using the BYU SP-2

Learn about the BYU SP-2 system, its interactive nodes for login and testing, batch scheduling system, and parallel file system. Explore compiler options, libraries, and documentation. Understand job scheduling, backfill scheduling, and using LoadLeveler. Use sample scripts and commands for job management.

hortonj
Download Presentation

Using the BYU SP-2

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Using the BYU SP-2

  2. Our System • Interactive nodes (2) • used for login, compilation & testing • marylou10.et.byu.edu • I/O and scheduling nodes (7) • used for the batch scheduling system and the parallel file system • Compute nodes (26) • 22 4 processor • 4 16 processor

  3. Compilers • xlc C • xlC C++ • xlf Fortran • Parallel Compilers • mpcc • mpCC • mpxlf • Optimization • -O5 -qarch=pwr3 -qtune=pwr3 -qhot • Libraries • -lblas, -lfftw, -llapack, -lessl

  4. Other Stuff • Documentation • http://www-1.ibm.com/servers/eserver/pseries/library/sp_books/ • http://marylou.byu.edu • Launching parallel jobs • done through the batch scheduler • Your job is a shell script that you hand to the batch scheduler for execution • Can look at xloadl for help creating script

  5. Batch job scheduler • Batch Schedulers • PBS (Portable Batch System) open source • LoadLeveler - descendent of Condor • The process • user submits jobs to queue • machines register with scheduler offering to run jobs of certain class • scheduler allocates jobs to machines and tracks them • once started, jobs are scheduled by kernel

  6. Scheduling parallel jobs • jobs can ask for • number of nodes (1 CPU) • number of tasks per node (multiple CPUs) • non shared nodes (multiple CPUs) • mixing jobs can be bad • two intense I/O processes on a 2 CPU node can ruin performance for both • same for two RAM intensive processes

  7. Scheduling parallel jobs (2) • All allocated nodes and processors and resources are allocated for the duration of the entire job • No dynamic adjustments, except by creating jobs with multiple steps • each step can have different requirements • each step can express dependency on other steps

  8. Scheduling parallel jobs (3) • Management must • allow some jobs to use the entire machine • allow short jobs to get started quickly they should not have to wait weeks in the queue • Some very long jobs may be needed, but are to be avoided

  9. Backfill scheduling Job C 10 nodes system Job D Job B Job A time B A C D

  10. Backfill scheduling • Requires real time limit to be set • More accurate (shorter) estimate gives more chance to be running earlier • Short jobs can move through system quicker • Uses system better by avoiding waste of cycles during wait

  11. Using LoadLeveler • Graphical user interface: xloadl • Make shell script with LoadLeveler keywords as shell comments # @output = thing.log # @error = thing.err # @class = short # @queue # @executable = thingx # @node = 6,10 # @tasks_per_node = 4 # @requirements = (Adapter==hps_us)

  12. Sample LoadLeveler Script #!/bin/ksh # @ job_type = parallel # @ input = /dev/null # @ output = $(Executable).$(Cluster).$(Process).out # @ error = $(Executable).$(Cluster).$(Process).err # @ initialdir = /gstudent/student_rt_y/directory # @ notify_user = student_rt_y@byu.edu # @ class = short # @ notification = complete # @ checkpoint = no # @ restart = no # @ requirements = (Arch == "power3") # @ blocking = unlimited # @ total_tasks = 4 # @ network.MPI = switch,shared,US # @ queue ./your_exe_and_any_args

  13. Sample serial job #!/bin/ksh # @ job_type = serial # @ input = /dev/null # @ output = $(Executable).$(Cluster).$(Process).out # @ error = $(Executable).$(Cluster).$(Process).err # @ initialdir = /gstudent/student_rt_y # @ notify_user = student_rt_y@byu.edu # @ class = medium # @ notification = complete # @ checkpoint = no # @ restart = no # @ queue paupnew Hlav3ashort.paup

  14. LoadLeveler commands • llq: shows all jobs • can also use showq • llq -s JobID : show why not running • llclass : shows classes • llstatus : shows machines • llcancel JobID : cancel job • llhold JobID : put job in hold state

  15. Sample llq output bash-2.05a$ llq Id Owner Submitted ST PRI Class Running On ------------------------ ---------- ----------- -- --- ------------ ----------- m1015i.1127.0 mdt36 8/7 12:41 R 50 long m1009i m1015i.1128.0 mdt36 8/7 12:41 R 50 long m1019i m1015i.1497.0 jl447 8/12 16:25 R 50 long m1012i m1015i.1544.0 to5 8/13 08:44 R 50 long m1045i m1015i.1545.0 to5 8/13 08:44 R 50 long m1045i … m1015i.1602.0 taskman 8/14 08:13 R 50 short m1017i m1015i.1598.0 taskman 8/14 08:13 R 50 short m1014i m1015i.1601.0 taskman 8/14 08:13 R 50 short m1017i m1015i.1599.0 taskman 8/14 08:13 R 50 short m1014i m1015i.1600.0 taskman 8/14 08:13 R 50 short m1011i m1015i.1626.0 mendez 8/14 13:07 I 50 long m1015i.1625.0 cr66 8/14 12:40 I 50 medium m1015i.1513.0 jl447 8/13 07:08 I 50 long m1015i.1572.0 dvd 8/13 10:45 I 50 medium m1015i.1576.0 dvd 8/13 11:22 I 50 medium m1015i.1577.0 dvd 8/13 11:25 I 50 medium m1015i.1566.0 mdt36 8/13 08:51 I 50 long m1015i.1564.0 mdt36 8/13 08:50 I 50 long … m1015i.1612.0 taskman 8/14 08:27 I 50 short m1015i.1624.0 taskman 8/14 08:57 I 50 short m1015i.1623.0 taskman 8/14 08:57 I 50 short 58 job step(s) in queue, 23 waiting, 0 pending, 35 running, 0 held, 0 preempted

  16. Sample showq output bash-2.05a$ showq ACTIVE JOBS-------------------- JOBNAME USERNAME STATE PROC REMAINING STARTTIME m1015i.1581.0 taskman Running 1 18:39:00 Wed Aug 14 08:06:24 m1015i.1582.0 taskman Running 1 18:39:00 Wed Aug 14 08:06:24 m1015i.1580.0 taskman Running 1 18:39:00 Wed Aug 14 08:06:24 … m1015i.1615.0 taskman Running 1 21:33:42 Wed Aug 14 11:01:06 m1015i.1613.0 taskman Running 1 23:43:05 Wed Aug 14 13:10:29 m1015i.1575.0 dvd Running 4 2:15:10:38 Wed Aug 14 04:38:02 m1015i.1127.0 mdt36 Running 8 2:23:14:21 Wed Aug 7 12:41:45 … m1015i.1567.0 jar65 Running 4 9:04:07:44 Tue Aug 13 17:35:08 m1015i.1569.0 jar65 Running 4 9:08:28:16 Tue Aug 13 21:55:40 m1015i.1547.0 to5 Running 8 9:21:11:49 Wed Aug 14 10:39:13 m1015i.1546.0 to5 Running 8 9:21:11:49 Wed Aug 14 10:39:13 35 Active Jobs 150 of 184 Processors Active (81.52%) 26 of 34 Nodes Active (76.47%) IDLE JOBS---------------------- JOBNAME USERNAME STATE PROC WCLIMIT QUEUETIME m1015i.1513.0 jl447 Idle 2 5:00:00:00 Tue Aug 13 07:08:09 m1015i.1572.0 dvd Idle 8 3:00:00:00 Tue Aug 13 10:45:18 … 23 Idle Jobs NON-QUEUED JOBS---------------- JOBNAME USERNAME STATE PROC WCLIMIT QUEUETIME Total Jobs: 58 Active Jobs: 35 Idle Jobs: 23 Non-Queued Jobs: 0

  17. LoadLeveler environment • Normally same as your login environment • Limits are set, use llclass -l to see values • ulimit -S -a • ulimit -H -a • Big heap requirements • -bmaxdata:0x80000000 up to 2 GB data (heap) • -q64 -bmaxdata:0x…. Up to 8 EB

More Related