1 / 24

RAS - Reliability, Availability, Serviceability

RAS - Reliability, Availability, Serviceability. Product Support Engineering. VMware Confidential. Module 2 Lessons. Lesson 1 – vCenter Server High Availability Lesson 2 – vCenter Server Distributed Resource Scheduler Lesson 3 – Fault Tolerance Virtual Machines

lavada
Download Presentation

RAS - Reliability, Availability, Serviceability

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. RAS - Reliability, Availability, Serviceability Product Support Engineering VMware Confidential

  2. Module 2 Lessons • Lesson 1 – vCenter Server High Availability • Lesson 2 – vCenter Server Distributed Resource Scheduler • Lesson 3 – Fault Tolerance Virtual Machines • Lesson 4 – Enhanced vMotion Compatibility • Lesson 5 – DPM - IPMI • Lesson 6 – vApps • Lesson 7 – Host Profiles • Lesson 8 – Reliability, Availability, Serviceability ( RAS ) • Lesson 9 – Web Access • Lesson 10 – vCenter Server Update Manager • Lesson 11 – Guided Consolidation • Lesson 12 – Health Status VI4 - Mod 2-8 - Slide

  3. Module 2-8 Lessons • Lesson 1 – Overview of RAS • Lesson 2 – RAS objectives • Lesson 3 – Networking vProbs • Lesson 4 – Storage vProbs • Lesson 5 – VMFS vProbs • Lesson 6 – Migration vProb VI4 - Mod 2-8 - Slide

  4. Introduction • The long-term goal of the ESX RAS project is to make ESX more Reliable, Available and Serviceable. • To do so the VMkernel needs to detect, report, recover, diagnose and repair/react to hardware and software problems which occur in the system. • ESX RAS 1.0 will focus on detecting asynchronous hardware and synchronous software observations and reporting them. VI4 - Mod 2-8 - Slide

  5. RAS Objectives • ESX RAS team objective is to increase the reliability, availability and serviceability of the vmkernel. This includes: • Hardening of vmkernel drivers (hardware errors): CPU, Memory, PCI(-X/Express), SCSI, Networking. • Hardening of vmkernel facilities (software errors): SCSI, Networking, VMotion, DMotion, etc. • Developing a standardized method of reporting observations from software and hardware error handlers. • Developing a method to diagnose a given stream of observations, down to one or more problems which may have caused them. • Develop method for determining predictive failure of a given (sub-)system and feed analysis to consumers (DRS, DPM, FT, HA) • Gather and write service actions which correspond to the problem or set of problems which are possibly present. • Develop automated policies for certain problems which may be taken care of without user action. • Maintain and improve logging, coredump, and PSOD infrastructure in the vmkernel VI4 - Mod 2-8 - Slide

  6. RAS Terms • RAS: Reliability, Availability, Serviceability. • Reliability: The ability of a system to perform and maintain its functions, in the face of hostile or unexpected circumstances. • Availability: The proportion of time a system is in a functioning condition. • Serviceability: The ability to debug or perform root cause analysis in pursuit of solving a problem with a product. • Hardening: To enhance a (sub-)system to be able to detect, report and handle errors which may be encountered, whether hardware or software related. Handling may involve panicing and/or attempting recovery from a given error or stream of errors. • VProb: A VProb is an automatically generated problem report. VI4 - Mod 2-8 - Slide

  7. RAS Categories • The framework defines the following use cases for vSphere 4.0: • Each of the use cases link to respective KBs which describe where the error happened (i.e. affected vmnic#, portgroup, vSwitches, storage path etc.) and provides troubleshooting tips to fix the issue. • Networking • vprob.net.connectivity.lost • vprob.net.redundancy.lost • vprob.net.redundancy.degraded • vprob.net.e1000.ts06.notsupported • Storage • vprob.storage.connectivity.lost • vprob.storage.redundancy.lost • vprob.storage.redundancy.degraded VI4 - Mod 2-8 - Slide

  8. RAS Categories • VMFS specific: • vprob.vmfs.nfs.server.disconnect • vprob.vmfs.nfs.server.restored • vprob.vmfs.heartbeat.timedout • vprob.vmfs.heartbeat.recovered • vprob.vmfs.heartbeat.unrecoverable • vrpob.vmfs.lock.corruptiondisk • vprob.vmfs.resource.corruptiondisk • vprob.vmfs.volume.locked • Migration Specific: • vprob.net.migrate.vmknic The Public KB’s will be available at GA time. VI4 - Mod 2-8 - Slide

  9. Networking VProb • vprob.net.connectivity.lost http://communities.vmware.com/viewwebdoc.jspa?documentID=DOC-6122&communityID=2701 • Connectivity to a physical network has been lost, all the affected portgroups are part of the message (e.g. >Lost network connectivity on virtual switch "system". Physical NIC vmnic1 is down. Affected port groups: "cos", "VM Network".<) VI4 - Mod 2-8 - Slide

  10. Networking VProb vprob.net.redundancy.lost http://communities.vmware.com/viewwebdoc.jspa?documentID=DOC-6097&communityID=2701 • Only one physical NIC is currently connected, one more failure will result in a loss of connectivity (e.g. >Lost uplink redundancy on virtual switch "system". Physical NIC vmnic0 is down. Affected port groups: "cos", "VM Network".<) VI4 - Mod 2-8 - Slide

  11. Networking VProb • vprob.net.redundancy.degraded http://communities.vmware.com/viewwebdoc.jspa?documentID=DOC-6098&communityID=2701 • One of the physical NICs in your NIC team has gone down, you still have n-1 NICs available (e.g. >Uplink redundancy degraded on virtual switch "vSwitch0". Physical NIC vmnic1 is down. 2 uplinks still up. Affected portgroups: "VM Network".<) VI4 - Mod 2-8 - Slide

  12. Networking VProb • vprob.net.e1000.tso6.notsupported (KB article) • Guest e1000 driver is misbehaving and sending TSO IPv6 packets, which will be dropped. The vprob specifies the affected VM, and the KB article discusses ways to fix this. http://communities.vmware.com/viewwebdoc.jspa?documentID=DOC-7393 • "Guest-initiated IPv6 TCP Segmentation Offload (TSO) packets ignored. Manually disable TSO inside the guest operating system in virtual machine "XYZ", or use a different virtual adapter." VI4 - Mod 2-8 - Slide

  13. Storage VProb • vprob.storage.connectivity.lost http://communities.vmware.com/viewwebdoc.jspa?documentID=DOC-6099&communityID=2701 • The connectivity to a specific device has been lost (e.g. "Lost connectivity to storage device naa.60a9800043346534645a433967325334. Path vmhba35:C1:T0:L7 is down") VI4 - Mod 2-8 - Slide

  14. Storage VProb • vprob.storage.redundancy.lost http://communities.vmware.com/viewwebdoc.jspa?documentID=DOC-6120&communityID=2701 • Only one path is remaining to a device and you no longer have any redundancy (e.g. "Lost path redundancy to storage device naa.60a9800043346534645a433967325334. Path vmhba35:C1:T0:L7 is down.") VI4 - Mod 2-8 - Slide

  15. Storage VProb • vprob.storage.redundancy.degraded http://communities.vmware.com/viewwebdoc.jspa?documentID=DOC-6099&communityID=2701 • One of your paths to a device has been lost but you still have n-1 paths remaining (e.g. "Path redundancy to storage device naa.60a9800043346534645a433967325334 degraded. Path vmhba35:C1:T0:L7 is down. 3 remaining active paths.") VI4 - Mod 2-8 - Slide

  16. VMFS vProb • vprob.vmfs.nfs.server.disconnect • vprob.vmfs.nfs.server.restored http://pseweb.vmware.com/twiki/bin/viewfile/Main/VmkernelRas?rev=1;filename=vprob.vmfs.volume.locked.htm • Lost connection to server nfs-server mount point /share, mounted as 1264e433-5854ee53-0000-000000000000 ("nfs-share") VI4 - Mod 2-8 - Slide

  17. VMFS vProb • vprob.vmfs.heartbeat.timedout http://pseweb.vmware.com/twiki/bin/viewfile/Main/VmkernelRas?rev=1;filename=vprob.vmfs.heartbeat.combined.htm • VMFS Volume Connectivity Degraded   496befed-1c79c817-6beb-001ec9b60619 san-lun-100 VI4 - Mod 2-8 - Slide

  18. VMFS vProb • vprob.vmfs.heartbeat.recovered http://pseweb.vmware.com/twiki/bin/viewfile/Main/VmkernelRas?rev=1;filename=vprob.vmfs.heartbeat.combined.htm • VMFS Volume Connectivity Restored 496befed-1c79c817-6beb-001ec9b60619 san-lun-100 VI4 - Mod 2-8 - Slide

  19. VMFS vProb • vprob.vmfs.heartbeat.unrecoverable http://pseweb.vmware.com/twiki/bin/viewfile/Main/VmkernelRas?rev=1;filename=vprob.vmfs.heartbeat.combined.htm • VMFS Volume Connectivity lost 496befed-1c79c817-6beb-001ec9b60619 san-lun-100 VI4 - Mod 2-8 - Slide

  20. VMFS vProb • vrpob.vmfs.lock.corruptiondisk • vprob.vmfs.resource.corruptiondisk http://pseweb.vmware.com/twiki/bin/viewfile/Main/VmkernelRas?rev=1;filename=vprob.vmfs.corruptioncombined.htm • Volume 4976b16c-bd394790-6fd8-00215aaf0626 (san-lun-100) may be damaged on disk. Corrupt lock detected at offset O • Volume 4976b16c-bd394790-6fd8-00215aaf0626 (san-lun-100) may be damaged on disk. Resource cluster metadata corruption detected VI4 - Mod 2-8 - Slide

  21. VMFS vProb • vprob.vmfs.volume.locked http://pseweb.vmware.com/twiki/bin/viewfile/Main/VmkernelRas?rev=1;filename=vprob.vmfs.volume.locked.htm • Volume on device naa.60060160b3c018009bd1e02f725fdd11:1 locked, possibly because remote host 10.17.211.73 encountered an error during a volume operation and couldn’t recover. VI4 - Mod 2-8 - Slide

  22. Migration Specific • vprob.net.migrate.vmknic http://pseweb.vmware.com/twiki/bin/viewfile/Main/VmkernelRas?rev=1;filename=vprob.net.migrate.vmkernel.htm • The ESX advanced config option /Migrate/Vmknic is set to an invalid vmknic: vmk0. /Migrate/Vmknic specifies a vmknic that VMotion binds to for improved performance. Please update the config option with a valid vmknic or, if you don't want VMotion to bind to a specific vmknic, remove the invalid vmknic and leave the option blank. VI4 - Mod 2-8 - Slide

  23. Lesson 2-8 Summary • Understand what vProbs are • Learn how to troubleshoot vProbs VI4 - Mod 2-8 - Slide

  24. Lesson 2-8 – Optional Lab 1 • OPTIONAL • Lab 1 involves generating vProb scenarios VI4 - Mod 2-8 - Slide

More Related