1 / 162

VMS Clusters: Advanced Concepts

VMS Clusters: Advanced Concepts. CETS2001 Seminar 1090 Sunday, September 9, 2001, 210A Keith Parris. Speaker Contact Info. Keith Parris E-mail: parris@encompasserve.org or keithparris@yahoo.com Web: http://www.geocities.com/keithparris/ and http://encompasserve.org/~kparris/

kiml
Download Presentation

VMS Clusters: Advanced Concepts

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. VMS Clusters:Advanced Concepts CETS2001 Seminar 1090 Sunday, September 9, 2001, 210A Keith Parris

  2. Speaker Contact Info Keith Parris E-mail: parris@encompasserve.org or keithparris@yahoo.com Web: http://www.geocities.com/keithparris/ and http://encompasserve.org/~kparris/ Integrity Computing, Inc. 2812 Preakness Way Colorado Springs, CO 80916-4375 (719) 392-6696

  3. Topics to be covered • Large Clusters • Multi-Site Clusters • Disaster-Tolerant Clusters • Long-Distance Clusters • Performance-Critical Clusters • Recent Cluster Developments

  4. Large Clusters • Large by what metric? • Large in node-count • Large in CPU power and/or I/O capacity • Large in geographical span

  5. Large Node-Count Clusters • What is a “large” number of nodes for a cluster? • VMS Cluster Software SPD limit: • 96 VMS nodes total • 16 VMS nodes per CI Star Coupler

  6. Large Node-Count Clusters • What does a typical large node-count cluster configuration consist of? • Several core boot and disk servers • Lots & lots of workstation satellite nodes

  7. Large Node-Count Clusters • Why build a large node-count cluster? • Shared access to resources by a large number of users • Particularly workstation users • Easier system management of a large number of VMS systems • Managing two clusters is close to twice the work of managing one • Adding just 1 more node to an existing cluster is almost trivial in comparison

  8. Large Node-Count Clusters • Challenges in building a large node-count cluster: • System (re)booting activity • LAN problems • System management • Hot system files

  9. System (re)booting activity • Reboot sources: • Power failures • LAN problems • VMS or software upgrades • System tuning

  10. System (re)booting activity • Fighting reboot pain: • Power failures: • UPS protection • LAN problems: • Redundancy • Sub-divide LAN to allow “Divide and conquer” troubleshooting technique • Monitoring

  11. System (re)booting activity • Fighting reboot pain: • VMS or software upgrades; patches: • Try to target safe “landing zones” • Set up for automated reboots • System tuning: • Run AUTOGEN with FEEDBACK on regular schedule, pick up new parameters during periodic automated reboots

  12. System (re)booting activity • Factors affecting (re)boot times: • System disk throughput • LAN bandwidth, latency, and quality • Boot and disk server horsepower

  13. System disk throughput • Off-load work from system disk • Move re-directable files (SYSUAF, RIGHTSLIST, queue file, etc.) off system disk • Put page/swap files on another disk (preferably local to satellite nodes) • Install applications on non-system disks when possible • Dump files off system disk to conserve space • All this reduces write activity to system disk, making shadowset or mirrorset performance better

  14. System disk throughput • Avoid disk rebuilds at boot time: • Set ACP_REBLDSYSD=0 to prevent boot-time rebuild of system disk • While you’re at it, use MOUNT/NOREBUILD on all disks mounted in startup • But remember to set up a batch job to do • $ SET VOLUME/REBUILD • commands during off-hours to free up disk blocks incorrectly left marked allocated when nodes crash (blocks which were in node’s free-extent cache)

  15. System disk throughput • Faster hardware: • Caching in the disk controller • 10K rpm or 15K rpm magnetic drives • Solid-state disks • DECram disks (shadowed with non-volatile disks)

  16. System disk throughput • Multiple system disk spindles • Host-based volume shadowing allows up to 3 copies of each system disk • Controller-based mirroring allows up to 6 spindles in each mirrorset • Controller-based striping or RAID-5 allows up to 14 spindles in a storageset • And you can layer these

  17. System disk throughput • Multiple separate system disks for groups of nodes • Use “cloning” technique to replicate system disks and avoid doing “n” upgrades for “n” system disks • Consider throttling satellite boot activity to limit demand

  18. System disk “Cloning” technique • Create “Master” system disk with roots for all nodes. Use Backup to create Clone system disks. • Before an upgrade, save any important system-specific info from Clone system disks into the corresponding roots on Master system disk • Basically anything that’s in SYS$SPECIFIC:[*] • Examples: ALPHAVMSSYS.PAR, MODPARAMS.DAT, AGEN$FEEDBACK.DAT • Perform upgrade on Master disk • Use Backup to copy Master to Clone disks again.

  19. LAN bandwidth, latency, and quality • Divide LAN into multiple segments • Connect systems with switches or bridges instead of contention-based hubs • Use full-duplex links when possible

  20. LAN bandwidth, latency, and quality • Use faster LAN technology at concentration points like backbones and at servers: • e.g. if using Fast Ethernet for satellites, consider using Gigabit Ethernet for server LAN adapters • Provide redundant LANs for servers, backbone

  21. LAN bandwidth, latency, and quality • Try to avoid saturation of any portion of LAN hardware • Bridge implementations must not drop small packets under heavy loads • SCS Hello packets are small packets • If two in a row get lost, a node without redundant LANs will see a Virtual Circuit closure; if failure lasts too long, node will do a CLUEXIT bugcheck

  22. LAN bandwidth, latency, and quality • Riding through temporary LAN problems while you troubleshoot: • Raise RECNXINTERVAL parameter • Default is 20 seconds • It’s a dynamic parameter

  23. LAN bandwidth, latency, and quality • Where redundant LAN hardware is in place, use the LAVC$FAILURE_ANALYSIS tool from SYS$EXAMPLES: • It monitors and reports, via OPCOM messages, LAN component failures and repairs • Described in Appendix D of the OpenVMS Cluster Systems Manual • Workshop 1257: Network Monitoring for LAVCs • Tuesday 1:00 pm, Room 208A • Thursday 8:00 am, Room 208A

  24. LAN bandwidth, latency, and quality • VOTES: • Most configurations with satellite nodes give votes to disk/boot servers and set VOTES=0 on all satellite nodes • If the sole LAN adapter on a disk/boot server fails, and it has a vote, ALL satellites will CLUEXIT! • Advice: give at least as many votes to node(s) on the LAN as any single server has, or configure redundant LAN adapters

  25. LAN redundancy and Votes 0 0 0 1 1

  26. LAN redundancy and Votes 0 0 0 1 1

  27. LAN redundancy and Votes Subset A 0 0 0 1 1 Subset B Which subset of nodes does VMS select as the optimal subcluster?

  28. LAN redundancy and Votes 0 0 0 1 1 One possible solution: redundant LAN adapters on servers

  29. LAN redundancy and Votes 1 1 1 2 2 Another possible solution: Enough votes on LAN to outweigh any single server node

  30. Boot and disk server horsepower • MSCP-serving is done in interrupt state on Primary CPU • Interrupts from LAN Adapters come in on CPU 0 (Primary CPU) • Multiprocessor system may have no more MSCP-serving capacity than a uniprocessor • Fast_Path on CI may help

  31. Large Node-Count Cluster System Management • Console management software is very helpful for reboots & troubleshooting • If that’s not available, consider using console firmware’s MOP Trigger Boot function to trigger boots in small waves after a total shutdown • Alternatively, satellites can be shut down with auto-reboot and then MOP boot service can be disabled, either on an entire boot server, or for individual satellite nodes, to control rebooting

  32. Hot system files • Standard multiple-spindle techniques also apply here: • Disk striping (RAID-0) • Volume Shadowing (host-based RAID-1) • Mirroring (controller-based RAID-1) • RAID-5 array (host- or controller-based) • Consider solid-state disk for hot system files, such as SYSUAF, queue file, etc.

  33. High-Horsepower Clusters • Why build a large cluster, in terms of CPU and/or I/O capacity? • Handle high demand for same application(s) • Pool resources to handle several applications with lower overall costs and system management workload than separate clusters

  34. High-Horsepower Clusters • Risks: • “All eggs in one basket” • Hard to schedule downtime • Too many applications, with potentially different availability requirements • System tuning and performance • Which application do you optimize for? • Applications may have performance interactions

  35. High-Horsepower Clusters • Plan, configure, and monitor to avoid bottlenecks (saturation of any resource) in all areas: • CPU • Memory • I/O • Locking

  36. High-Horsepower Clusters • Generally easiest to scale CPU by first adding CPUs within SMP boxes • within limits of VMS or applications’ SMP scalability • Next step is adding more systems • But more systems implies less local locking • Local locking code path length and latency are much lower than remote (order-of-magnitude)

  37. High-Horsepower Clusters • Memory scaling is typically easy with 64-bit Alphas: Buy more memory • May require adding nodes eventually

  38. High-Horsepower Clusters • I/O scalability is generally achieved by using: • Multiple I/O adapters per system • More disk controllers, faster controllers, more controller cache • More disks; faster disks (solid-state) • Disk striping, mirroring, shadowing

  39. High-Horsepower Clusters • Challenges in I/O scalability: • CPU 0 interrupt-state saturation • Interconnect load balancing

  40. CPU 0 interrupt-state saturation • VMS receives interrupts on CPU 0 (Primary CPU) • If interrupt workload exceeds capacity of primary CPU, odd symptoms can result • CLUEXIT bugchecks, performance anomalies • VMS has no internal feedback mechanism to divert excess interrupt load • e.g. node may take on more trees to lock-master than it can later handle • Use MONITOR MODES/CPU=0/ALL to track CPU 0 interrupt state usage and peaks

  41. CPU 0 interrupt-state saturation • FAST_PATH capability can move some of interrupt activity to non-primary CPUs • Lock mastership workload can be heavy contributor to CPU 0 interrupt state • May have to control or limit this workload

  42. Interconnect load balancing • SCS picks path in fixed priority order: • Galaxy Shared Memory Cluster Interconnect (SMCI) • Memory Channel • CI • DSSI • LANs, based on: • Maximum packet size, and • Lowest latency • PORT_CLASS_SETUP tool available from CSC to allow you to change order of priority if needed • e.g. to prefer Gigabit Ethernet over DSSI

  43. Interconnect load balancing • CI Port Load Sharing code didn’t get ported from VAX to Alpha • MOVE_REMOTENODE_CONNECTIONS tool available from CSC to allow you to statically balance VMS$VAXcluster SYSAP connections across multiple CI adapters

  44. High-Horsepower Clusters • Locking performance scaling is generally done by: • Improving CPU speed (to avoid CPU 0 interrupt-state saturation) • Improving cluster interconnect performance (lower latency, higher bandwidth, and minimizing host CPU overhead) • Spreading locking workload across multiple systems

  45. High-Horsepower Clusters • Locking performance scaling • Check SHOW CLUSTER/CONTINUOUS with ADD CONNECTIONS, ADD REM_PROC and ADD CR_WAITS to check for SCS credit waits. If counts are present and increasing over time, increase the SCS credits at the remote end as follows:

  46. High-Horsepower Clusters • Locking performance scaling • For credit waits on VMS$VAXcluster SYSAP connections: • Increase CLUSTER_CREDITS parameter • Default is 10; maximum is 127

  47. High-Horsepower Clusters • Locking performance scaling • For credit waits on VMS$DISK_CL_DRVR / MSCP$DISK connections: • For VMS server node, increase MSCP_CREDITS parameter. Default is 8; maximum is 128. • For HSJ/HSD controller, lower MAXIMUM_HOSTS from default of 16 to actual number of VMS systems on the CI/DSSI interconnect

  48. Multi-Site Clusters • Consist of multiple “Lobes” with one or more systems, in different locations • Systems in each “Lobe” are all part of the same VMS Cluster and can share resources • Sites typically connected by bridges (or bridge-routers; pure routers don’t pass SCS traffic)

  49. Multi-Site Clusters • Sites linked by: • DS-3/T3 (E3 in Europe) or ATM Telco circuits • Microwave link: DS-3/T3 or Ethernet • “Dark fiber” where available: • FDDI: 40 km with single-mode fiber; 2 km multi-mode fiber • Ethernet over fiber (10 mb, Fast, Gigabit) • Fiber links between Memory Channel switches ; up to 3 km • Dense Wave Division Multiplexing (DWDM), then ATM

  50. Multi-Site Clusters • Inter-site link minimum standards are in OpenVMS Cluster Software SPD: • 10 megabits minimum data rate • “Minimize” packet latency • Low SCS packet retransmit rate: • Less than 0.1% retransmitted. Implies: • Low packet-loss rate for bridges • Low bit-error rate for links

More Related