1 / 41

Riding the Dot-Com Wave: A Case Study in Extreme VMScluster Scalability

Riding the Dot-Com Wave: A Case Study in Extreme VMScluster Scalability. CETS2001 Session 1255 Wednesday, Sept. 12, 2:45 pm, 303B Keith Parris. Topics. Scale of workload and configuration growth Changes made to scale the cluster Challenges to extreme scalability and high availability

Download Presentation

Riding the Dot-Com Wave: A Case Study in Extreme VMScluster Scalability

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Riding the Dot-Com Wave:A Case Study in ExtremeVMScluster Scalability CETS2001 Session 1255 Wednesday, Sept. 12, 2:45 pm, 303B Keith Parris

  2. Topics • Scale of workload and configuration growth • Changes made to scale the cluster • Challenges to extreme scalability and high availability • Surprises along the way • Lessons learned

  3. Hardware Growth • 1981: 1 Alpha Microsystems PC • 1983: 1 VAX 11/750 • 1993: 1 MicroVAX 4100 • 1996: 4 VAX 7700s, 1 8400 • 1997: 6 VAX 7700s, 2 VAX 7800s, 2 8400s • 1999: 18 GS-140s • 2001: 2 clusters; one with 12 GS-140s, the other with 3 GS-140s and 2 GS-160s

  4. Workload Growth Rate • As measured in yearly peak transaction counts • 1996-1997: 2X • 1997-1998: 2X • 1998-1999: 2X • 1999-2000: 3X • We’ll focus on these years

  5. Scaling the Cluster: Memory • Went from 1 GB to 20 GB on systems

  6. Scaling the Cluster: CPU • Upgraded VAX 7700 nodes by adding CPUs • Upgraded key nodes from VAX 7700 to VAX 7800 CPUs • Ported application from VAX to Alpha • Went from 2-CPU 8400s to 6-CPU GS-140s, then added 12-CPU GS-160s • From 200 Mhz EV4 chips to 731 Mhz EV67

  7. Scaling the Cluster: I/O • Went from JBOD to RAID, and raised number of members in RAID arrays over time • Added RMS Global Buffers • 3600 RPM magnetic disks to 5400 RPM to 7200 RPM to 10K RPM • Put hot files onto large arrays of Solid-State Disks • Upgraded from CMD controllers to HSJ40s; added writeback cache; upgraded to HSJ50s and doubled # of controllers; upgraded to HSJ80s

  8. Scaling the Cluster: I/O • Changed from shadowsets of controller-based stripesets to use host-based RAID 0+1 arrays of controller-based mirrorsets • Avoided any single controller being a bottleneck by spreading RAID array members across multiple controllers • Provides faster time-to-repair for shadowset member failures • Provided faster cross-site shadow copy times

  9. Shadowsets of Stripesets • Volume shadowing thinks it sees large disks • Shadow copies and merges occur sequentially across entire surface • Failure of 1 member implies full shadow copy of stripeset to fix Host-based shadowset Controller-based stripeset Controller-based stripeset

  10. Host-based RAID 0+1 arrays Host-based RAID 0+1 array • Individual disks are combined first into shadowsets • Host-based RAID software combines the shadowsets into a RAID 0 array • Shadowset members can be spread across multiple controllers Host-based shadowset Host-based shadowset Host-based shadowset

  11. Host-based RAID 0+1 arrays Host-based RAID 0+1 array • Shadow copies and merges occur in parallel on all shadowsets at once • Failure of 1 member requires full shadow copy of only that member to fix Host-based shadowset Host-based shadowset Host-based shadowset

  12. Scaling the Cluster: I/O & Locking • Implemented Fast_Path on CI • Tried Memory Channel and failed • CPU 0 saturation in interrupt state occurred when lock traffic moved from CI (with Fast_Path) to MC (without Fast_Path) • Went from 2 CI star couplers to 6 • Distributed lock traffic across CIs

  13. Scaling the Cluster: • Implemented Disaster-Tolerant Cluster • Effectively doubled hardware: CPUs, I/O Subsystems, Memory • Significantly improved availability • But relatively long inter-site distance added inter-site latency as a new potential factor in performance issues

  14. Scaling the Cluster:Datacenter space • Multi-site clustering and Volume Shadowing provided the opportunity to move to larger datacenters, without downtime, on 3 separate occasions

  15. Challenges: • Downtime cost $Millions per event • Longer downtime meant even-larger risk • Had to resist initial pressure to favor quick fixes over any diagnostic efforts that would lengthen downtime • e.g. crash dump files

  16. Challenges: • Network focus in application design rather than Cluster focus • Triggered by history of adding node after node, connected by DECnet, rather than forming a VMS Cluster early on • Important functions assigned to specific nodes • Failover and load balancing problematic • Systems had to boot/reboot in specific order

  17. Challenges: • Web interface design provided quick time-to-market using screen scraping, but had fragile 3-process chain with link to Unix

  18. Challenges: • Fragile 3-process chain with link to Unix • Failure of Unix program, TCP/IP link, or any of the 3 processes on VMS caused all 3 VMS processes to die, incurring: • Process run-down and cleanup • Process creation and image activations for 3 new processes to replace the 3 which just died • Slowing response times could cause time-outs and initiate “meltdowns”

  19. Challenges: • Interactive system capacity requirements in an industry with historically batch-processing mentality: • Can’t run CPUs to 100% with interactive users like you can with overnight batch jobs

  20. Challenges: • Adding Solid-State Disks • Hard to isolate hot blocks • Ended up moving entire hot RMS files to SSD array

  21. Challenges: • Application program techniques which worked fine under low workloads failed at higher workloads • Closing files for backups • ‘Temporary’ infinite loop

  22. Challenges: • Standardization on Cisco network hardware and SNMP monitoring • Even on GIGAswitch-based inter-site cluster link

  23. Challenges: • Constant pressure to port to Unix: • Sun proponents continually told management: • “We will be off the VAX in 6 months” • Adversely affected VMS investments at critical times • e.g. RZ28D disks, star couplers

  24. Surprises Along the Way: • As more Alpha nodes were added, lock tree remastering activity caused “pauses” of 10 to 50 seconds every 2-3 minutes • Controlled with PE1=100 during workday

  25. Surprises Along the Way: • Shadowing patch c. 1997 changed algorithm for selecting disk for read operations, suddenly sending ½ of the read requests to the other site, 130 miles (4-5 milliseconds) farther away • Subsequent patch kit allowed control of behavior with SHADOW_SYS_DISK SYSGEN parameter

  26. Surprises Along the Way: • VMS may allow a lock master node to take on so much workload that CPU 0 ends up saturated in interrupt state later • Caused CLUEXIT bugchecks and performance anomalies • With the help of VMS Source Listings and advice from VMS Engineering, wrote programs to spread lock mastership of the hot files across a set of several nodes, and held them there using PE1

  27. Surprises Along the Way: • CI Load Sharing code never got ported from VAX to Alpha • Nodes crashing and rebooting changed assignments of which star couplers were used for lock traffic between pairs of nodes • Caused unpredictable performance anomalies • CSC and VMS Engineering came to the rescue with a program called MOVE_REMOTENODE_CONNECTIONS to set up a (static) load-balanced configuration

  28. Surprises Along the Way: • As disks grew larger, default extent sizes and RMS bucket sizes grew by default as files were CONVERTed onto larger disks using default optimize script • Data transfer sizes gradually grew by a factor of 14X over 4 years • Solid-state disks don’t benefit from increased RMS bucket sizes like magnetic disks do • Fixed by manually selecting RMS bucket sizes for hot files on solid-state disks

  29. Challenges Left in VMScluster Scalability and High Availability • Can’t enlarge disks or RAID arrays on-line • Can’t re-pack RMS files on-line • Can’t de-fragment open files (with DFO) • Disks are getting lots bigger but not as much faster • I/Os per second per gigabyte is actually falling

  30. Lessons Learned: • To provide good system performance one must gain knowledge of application behavior

  31. Lessons Learned: • High-availability systems require: • Best possible people to run them • Best available vendor support: • Remedial • Engineering

  32. Lessons Learned: • Many problems can be avoided entirely (or at least deferred) by providing “reserve” computing capacity • Avoids saturation conditions • Avoids error paths and other seldom-exercised code paths • Provides headroom for peak loads, and to accommodate rapid workload growth when procurement efforts have long lead times

  33. Lessons Learned: • Staff size must grow with workload growth and cluster size, but with VMS clusters, not at as high a rate • Staff size went from 1 person to 8 people (plus vendor HW/SW support) with 24X workload growth

  34. Lessons Learned: • Visibility of system workload and system performance is key, to: • Spot surges in workload • Identify bottlenecks as each new one arises • Provide quick turn-around of performance info into changes and optimizations • Overnight, and even mid-day

  35. Lessons Learned: • With present technology, some scheduled downtime will be needed: • to optimize performance • to do hardware upgrades & maintenance • You’re going to have to have some downtime: do you want to schedule some or just deal with it when it happens on its own? • Scheduled downtime helps prevent or minimize unscheduled downtime

  36. Lessons Learned: • Despite the redundancy within a cluster, a VMS Cluster viewed as a whole can be a Single Point of Failure • Solution: Use multiple clusters, with the ability to shift customer data quickly between them if needed • Can hide scheduled downtime from users

  37. Lessons Learned: • It was impossible to optimize system performance by system tuning alone • Deep knowledge of application program behavior had to be gained, by: • Code examination • Constant discussions with development staff • Observing system behavior under load

  38. Lessons Learned: • Application code improvements are often sorely needed, but their effect on performance can be hard to predict; they may actually hurt things, or make dramatic order-of-magnitude improvements • They are also often found due to serendipity or sudden inspiration, so it’s also hard to plan them or to predict when they might occur

  39. Lessons Learned: • Effect of hardware upgrades is easier to predict: double the hardware will double the cost, and will generally provide close to double the performance • Order-of-magnitude improvements are harder` to obtain, and more expensive

  40. Success Factors: • Excellent people • Best technology • Quick procurement, preferably proactive • Top-notch vendor support • Services (CSC, MSE) • VMS Engineering; Storage Engineering

  41. Speaker Contact Info Keith Parris E-mail: parris@encompasserve.org or keithparris@yahoo.com Web: http://www.geocities.com/keithparris/ and http://encompasserve.org/~kparris/ Integrity Computing, Inc. 2812 Preakness Way Colorado Springs, CO 80916-4375 (719) 392-6696

More Related