1 / 16

Cray XT3 Experience so far

Cray XT3 Experience so far. Horizon Grows Bigger Richard Alexander 24 January 2006 raa@cscs.ch. Support Team. Dominik Ulmer Hussein Harake Richard Alexander Davide Tacchella Neil Stringfellow Francesco Benvenuto Claudio Redaelli + Cray Support. Short History.

woody
Download Presentation

Cray XT3 Experience so far

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Cray XT3 Experience so far Horizon Grows Bigger Richard Alexander 24 January 2006 raa@cscs.ch

  2. Support Team Dominik Ulmer Hussein Harake Richard Alexander Davide Tacchella Neil Stringfellow Francesco Benvenuto Claudio Redaelli + Cray Support RAA - CSCS

  3. Short History • Cray turn over on 8 July 05 • Early users • CSCS staff • PSI staff • Very early adopters • Initially marked by great instability RAA - CSCS

  4. What is an XT3? • Massively parallel Opteron-based UP • Catamount job launcher on compute • One executable • No sockets, no fork, no shared memory • Suse Linux on “service nodes” • I/O nodes, login nodes, system db, boot • “yod or pbs-mom” nodes RAA - CSCS

  5. Current configuration • 12 cabinets (rack) of approximately 96 compute nodes per rack • 4 login nodes with DNS rotary name • Software maintenance workstation (smw) Suse Linux w/ 2.4 kernel PBS/Pro TotalView Portland Group Compilers 6.0.5 RAA - CSCS

  6. Changes • 6 Unicos/LC upgrades - today 1.03.14 (1.3) • Multiple hardware upgrades • Most important was raising the Seastar ASIC from 1.5v to 1.6v • Multiple software upgrades • Multiple Lustre versions (CFS) High performance file system • Multiple firmware upgrades - “portals” • We have been up for 1 week at a time! RAA - CSCS

  7. Acceptance Period Began: 21 Nov 05 Functional tests Performance tests Stability test for 30 days (7am - 6pm). RAA - CSCS

  8. Cray XT 3: Commissioning & Acceptance • Acceptance completed 22 December 2006 • Put into production (i.e. opened to all users) Jan18; slow ramp-up planned • Rebooted almost daily during acceptance, less often afterwards • Current usage: • >75% node utilization • Jobs using 64-128 nodes are “typical” Two groups telling us they have done science they couldn’t do before. RAA - CSCS

  9. Cray XT3 Use Model:Fit the work to the hardware • Palu - 1100 compute nodes: Batch only • Gele - 84 compute nodes: New users • compile, debug, scale • Test system - 56 nodes: Test environment • Delivery planned March 06 RAA - CSCS

  10. Open issues and next steps • High speed network not 100% stable • High-speed file system Lustre young and immature • Current batch system limited in functionality • Bugs lead to intermittent node failures RAA - CSCS

  11. High Speed Network • Stability improved significantly in last last month • Stability still not satisfactory; Cray analyzing problems • Most errors affect/abort single jobs • Occasional errors require a reboot (~1/week?) • No high speed network => Nothing Works RAA - CSCS

  12. File system Lustre • Genuine parallel file system • Great performance • Very young and immature feature set • When Lustre is unhealthy (~once or twice a month) the system must be rebooted • Cray continuously improving Lustre; deadline for stable version hard to predict • Real Lustre Errors versus HSN problems! Difficult to differentiate RAA - CSCS

  13. Job Scheduling • PBS/Pro batch system • Current PBS can not implement desired management policies (Priority to large jobs, Back filling, Reservations) • Cray is delivering a “special” version of PBS with a TCL scheduler • Will implement Priorities and Back filling • Testing begins February ‘06 • Roadmap for Reservations still t.b.d. RAA - CSCS

  14. Problemes du Jour:Intermittent node failures • 3-15 times per day nodes die due to several bugs • Cray working on each of these bugs • Extremely difficult to diagnose • Currently nodes stay down until next machine reboot • Single node reboot available March - April 2006?? RAA - CSCS

  15. Near Term Future • Single node reboot • PBS scheduler - a beginning • Upgrade gele - novice user platform • Bring up smallest system - systems • Try out dual core - upgrade some system • Test Linux on compute node RAA - CSCS

  16. Summary • System into production and available for development work • System gaining maturity, but still far from SX-5 or IBM Power-4 standards. • Main current open issues in • Parallel file system maturity • Scheduling system • Node failures More Interruptions that we would like! RAA - CSCS

More Related