260 likes | 394 Views
This update provides a comprehensive overview of the TSCC (Ticketing System for Compute Clusters) as of May 2014, highlighting its diverse architecture and hybrid usage model, including general computing and GPU nodes. The document details user participation, job statistics, improvements made to operational performance, and anticipated hardware upgrades. Key challenges faced with node operations and job management have been addressed, showcasing the team's commitment to enhancing user experience in high-performance computing.
E N D
Triton Shared Computing Cluster Project Update Jim Hayes jhayes@sdsc.edu 2014-05-21
TSCC To Date • A Look Under the Hood • Plans for the Near Future
Project Model • TSCC is a heterogeneous cluster with a hybrid usage model • General computing, GPU, large-memory nodes; h/w choices reviewed annually and expected to change • “Condo” nodes purchased by researchers, access shared among condo owners; “hotel” nodes provided by RCI project, access via cycle purchase • Operation and infrastructure costs paid by condo owners, subsidized by RCI project and hotel sales • All users have access to infiniband and 10GbE networking, home and parallel file systems • TSCC one-year production anniversary 2014-05-10
Participation • Condo • 15 groups w/169 users • 116 general compute nodes (5+ in pipeline), 20 GPU • Hotel • 192 groups w/410 users • 46 general compute nodes, 3 GPU, 4 PDAFM • Of 94 trial (90-day, 250 SU) accounts, 20 bought cycles • Classes
Jobs • ~1.54 million jobs since 5/10/2013 production • 800K hotel • 391K home • 42K condo • 98K glean • >6 million SUs spent
Job Stats – Node Count Single node job count is 1.53M – about 233x 2-node count
Issues from Prior Meeting • Maui scheduler crashing/hanging/unresponsive – Fixed • Upgrade of torque fixed communication problems • Duplicate job charges/hanging reservations – Managed • Less frequent post-upgrade; wrote scripts to detect/reverse • Glean queue not working – Fixed • Gave up on maui and wrote script to manage glean queue • X11 Forwarding failure – Fixed • Problem was missing ipv6 conf file • Home filesystem unresponsive under write load – Fixed • Post upgrade, fs handles user write load gracefully • zfs/nfs quotas broken; handled manually • User access to snapshots not working; restores via ticket
TSCC To Date • A Look Under the Hood • Plans for the Near Future
TSCC Rack Layout tscc-0-n tscc-1-n tscc-2-n tscc-3-n tscc-5-n tscc-6-n tscc-7-n
Networking v v v tscc-0-n tscc-1-n tscc-2-n tscc-3-n tscc-5-n tscc-6-n tscc-7-n Infinibandx36 1GbEx40 10GbEx32
Node Classes tscc-0-n tscc-1-n tscc-2-n tscc-3-n tscc-5-n tscc-6-n tscc-7-n Hotel GPU Condo GPU Hotel Hotel PDAFM Condo Administration Home F/S
Processors tscc-0-n tscc-1-n tscc-2-n tscc-3-n tscc-5-n tscc-6-n tscc-7-n 2x8 2.6 GHz Sandy Bridge 4x8 2.5GHz Opteron 2x6 2.3GHz Sandy Bridge 2x6 2.6 GHz Ivy Bridge 2x8 2.2GHz Sandy Bridge
Memory tscc-0-n tscc-1-n tscc-2-n tscc-3-n tscc-5-n tscc-6-n tscc-7-n 128GB 64GB 256GB 512GB 32GB
GPUs tscc-0-n tscc-1-n tscc-2-n tscc-3-n tscc-5-n tscc-6-n tscc-7-n 4x GTX 780 Ti 4x GTX 780 4x GTX 680
Infiniband Connectivity tscc-0-n tscc-1-n tscc-2-n tscc-3-n tscc-5-n tscc-6-n tscc-7-n rack0 ibswitch1 rack0 ibswitch2 rack0 ibswitch2 rack1 ibswitch4 rack1 ibswitch3 ibswitch6 ibswitch7 IB switch
qsub Switches for IB Jobs • Only necessary for condo/glean jobs, because • Hotel nodes all in same rack (rack0) • Home nodes on same switch • To run on a single switch, specify switch property, e.g., • qsub -q condo -l nodes=2:ppn=8:ib:ibswitch3 • qsub -q hotel -l nodes=2:ppn=8:ib:ibswitch1 • To run in a single rack (IB switches interconnected), specify rack property, e.g., • qsub -q condo -l nodes=2:ppn=8:ib:rack1
Queues • All users have access to hotel, gpu-hotel, and pdafm queues • qsub [-q hotel] – max time 7d, total cores/user 176 • qsub -q gpu-hotel – max time 14d, total cores/user 36 • Non-GPU jobs may be run in gpu-hotel • qsub -q pdafm – max time 3d, total cores/user 96 • Condo owners have access to home, condo, glean queues • qsub -q home – unlimited time, cores • qsub -q condo – max time 8h, total cores/user 512 • qsub -q glean – unlimited time, max total cores/user 1024 • No charge to use, but jobs will be killed for higher-priority jobs • GPU owners have access to gpu-condo queue • qsub -q gpu-condo – max time 8h, total cores/user 84 • GPU node jobs allocated 1 GPU per 3 cores • Queue limits subject to change w/out notice
Commands to Answer FAQs • “Why isn’t my job running?” • checkjobjob id, e.g., checkjob 1234567 • “BankFailure” indicates not enough SUs remain to run job • “No Resources” may indicate bad request (e.g., ppn=32 in hotel) • “What jobs are occupying these nodes?” • lsjobs --property=string, e.g., lsjobs --property=hotel • “How many SUs do I have left?” • gbalance -u login, e.g., gbalance -u jhayes • “Why is my balance so low?” • gstatement -u login, e.g., gstatement-u jhayes • “How much disk space am I using?” • df -h /home/login, e.g., df -h /home/jhayes
TSCC To Date • A Look Under the Hood • Plans for the Near Future
Hardware Selection/Pricing • Jump to Haswell processors in 4th quarter 2014 • Go back to vendors for fresh quotes • Original vendor agreements for fixed pricing expired 12/2013-1/2014 • Interim pricing on HP nodes $4,300 • GPU pricing still ~$6,300 • Final price depends on GPU selected--many changes in NVIDIA offerings since January, 2013
Participation • First Haswell purchase will be hotel expansion • Hotel getting increasingly crowded in recent months • 8 nodes definite, 16 if $$ available • Goal for coming year is 100 new condo nodes • Nominal “break even” point for cluster is ~250 condo nodes • Please help spread the word!
Cluster Operations • Adding i/o nodes by mid-June • Offload large i/o from logins w/out burning SUs • General s/w upgrade late June • Latest CentOS, application versions • Research user-defined web services over summer • oasis refresh toward end of summer • Automate Infiniband switch/rack grouping • Contemplating transition from torque/maui to slurm • maui is no longer actively developed/supported • If we make the jump, we’ll likely use translation scripts to ease transition