1 / 16

IBM Director 5.10 Topic : Improved Hardware Alerting (xSeries)

IBM Director 5.10 Topic : Improved Hardware Alerting (xSeries). Presenter’s Name : Rajat Jain (rjain@us.ibm.com) and Title. Basic Overview. Provide alerts for problems that occurred during POST Provide FRU numbers in Alerts Send a CIM Director alert when an ASR takes place

mdevore
Download Presentation

IBM Director 5.10 Topic : Improved Hardware Alerting (xSeries)

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. IBM Director 5.10Topic : Improved Hardware Alerting (xSeries) Presenter’s Name : Rajat Jain (rjain@us.ibm.com)and Title

  2. Basic Overview • Provide alerts for problems that occurred during POST • Provide FRU numbers in Alerts • Send a CIM Director alert when an ASR takes place • Differentiate between normal and recovery alerts

  3. Provide Alerts for problems that occurred during POST • Original requirement was to notify that system was running with disabled or non-functioning CPUs. • Additional enhancement was made to monitor system’s physical memory as well. • Details provided on following slides…..

  4. Report CPU problems that may have occurred during POST • Upon every system restart, last known configuration for the number of CPUs is compared with the current. (CPU speed is used as an index) • Alerts are generated for a degraded configuration. Detail is provided for the number of missing/disabled CPUs. • Special cases : • If configuration is enhanced (e.g. addition of CPUs), then a “normal” alert is generated for the first time only. • On scalable systems, an alert would be generated every time a partition is re-configured. For example, either CPUs are added to a partition, or removed.

  5. Report Physical Memory problems that may have occurred during POST • Upon every system restart, last known configuration for the physical memory size is compared with the current. • Alerts are generated for a degraded configuration. Detail is provided for the reduced memory size and the FRU Part # of the memory DIMM. • Special cases : • If configuration is enhanced (e.g. addition of DIMMs), then a “normal” alert is generated for the first time only. • If memory DIMMs are replaced or reconfigured to provide same size, then no alert is generated. For example, replacing two 256MB DIMMs with one 512 MB DIMM would not generate an alert. • On scalable systems, an alert would be generated every time a partition is re-configured. For example, either DIMMs are added to a partition, or removed. • No support for Hot swap memory.

  6. Screen Captures

  7. Provide FRU numbers in Alerts • The FRU number shall be included in the Alert text for the CIM events associated with the following components : • Power Supply alerts for RSA and IPMI systems • DASD backplane alerts for RSA and IPMI systems • Memory alerts (configuration downgraded or PFA) for any system where the getfru utility returns a FRU part # for memory. • Agents like DSA/ESA can parse the event text for the delimiters %FRU:1234567% • Sample screen captures provided on following page

  8. Send a CIM Director alert when an ASR takes place • If an Automatic Server Restart occurs on an IPMI system with a BMC (e.g. x346, x236, x336, x366), then CIM Alerts are generated upon the next restart of the server. • Warning - The last system restart was due to the automatic server restart hardware. • Normal/Recovery - The last system restart was not due to the automatic server restart hardware. • Recovery shall only be possible upon the next system reboot.

  9. Differentiate between normal and recovery alerts The original requirement was mainly for eliminating redundant normal alerts upon Director agent reboots. Additional enhancements have been made : Also consider inband hardware alerts with all severities (warning and criticals too). Basically, do not re-report an alert (regardless of the severity), if it has already been reported before, and the severity state is unchanged. Persistent across reboots Includes all inband hardware alerts If a new hardware component is added, only failures are reported during the first scan of the component. Example : Add a new Power Supply, if it is normal, no alerts are generated.

  10. List of systems / hardware types that are affected

  11. Debugging Tips & Common Pitfalls

  12. List of Known Issues

  13. Questions & Answers

More Related