150 likes | 434 Views
The slides that follow (excepting this one) are meant for poster board display to be arranged on tri-fold poster as follows. rsense: pSeries Support Center Error Log AnalysisDan Henderson p and i Series Availability Lead. rsense: pSeries Support Center Error Log Decode/Analysis and Correlation.
E N D
1. pSeries Hardware Support Center Error Log Analysis and Coordination:12 Years of rsense Daniel J. Henderson
p and i Series HW Availability Lead
4. Traditional HW Error Logging in IBM RS/6000 Systems HW Platform and device driver errors logged in OS error Log
Information Essential For repair logged in Customer/Servicer Readable Form
A Service Request Code Number for lookup in service publications
General Description of type of failure
FRU numbers telling what parts to replace for the failure
Detailed Information, known as “sense data” explaining exact nature of the failure and associated hardware state logged in a an ASCII hex format “sense data”
Error Log Analysis programs in OS
Identified log entries and reported on SRC and FRU callouts sufficient to direct service repair, But
Very little decoding of Sense Data
Very little, if any, correlation of multiple errors in log to either
Modify the hardware action plan
or threshold recoverable errors
5. Sample AIX Log Entry
6. Support Center Error Log Analysis : dsense In a hardware support center, decoding of sense data originally proceeded manually to:
Determine pattern of errors across multiple systems to look for pervasive issues
Provide additional fault detection/isolation when original hardware action plan provided did not satisfy customer needs
dsense was created in the early ’90s to automate translating of hex bytes to give:
A human readable description of the pertinent data of each byte
A “Bottom-Line” analysis of what each error log entry, creating a one line description of each error
A Summarization of multiple errors using the one-line description to give a much more accurate picture of system behavior over multiple days and multiple log entries
dsense eventually shipped as part of AIX diagnostics to allow on the spot analysis of errors rather than waiting for data to be transmitted to a support center.
7. pSeries Environment Error Log Analysis Challenge: Rsense response In pSeries a single hardware platform can host multiple OS images and I/O virtualization.
On high end systems a hardware management console consolidate error logs from multiple OS images to report basic error information for Service, but not detailed support center information
Requirements for detailed support center analysis even greater than before
Rsense program created concurrent with pSeries to provide that level of support center analysis.
Functionality expanding with advances in partitioning and virtualization to provide
Cross OS and platform
Summarization of multiple logs
Correlation of log entries to modify parts replacement strategies
Thresholding of soft errors
Pervasive issue detection
8. Same Log Filtered Through Rsense (Abbreviated 1/3)
9. Same Log Filtered Through Rsense (Abbreviated) 2-3
10. Same Log Filtered Through Rsense (Abbreviated) 3-3
11. rsense internals
12. rsense Scripting Language
13. Sample rsense Customized Summary
14. Multiple Log Coordination Application One Common hardware shared across two “systems” or operating system images.
Shared hardware unable to communicate error information directly
Any single OS instance unable to localize source of the fault
Error Log Coordination could determine if fault is with one node or the other, or the device in the middle
15. Multiple Log Coordination: Application Two Fault Encountered at IPL must be coordinated with previous run-time event
Graphically, previous example log for a system showed:
16. rsense in Product Engineering PFA and Data Mining In a support center, rsense scripts are easily written to mine called-home error log entries to investigate pervasive issues and to very quickly make ad-hoc studies.
Some advantages in using rsense over other scripting methods
In-depth analysis of sense data for many different platform and devices have been written for pSeries.
Since rsense separates the decoding of the log format from the decoding of sense data, same script can be written to do analysis of Linux, AIX and service processor firmware logs
Built in functions of rsense simplfy the process of script writing
17. rsense Future rsense continues to be enhanced to meet support center needs
Possible future activities for consideration:
Parsing and decoding Service Action Event log of Hardware Management Console
Summer 2005
Support for Linux evlog as it becomes adopted for logging of pSeries
Support for analysis of system soft errors as these are called home in pSeries
Incorporation of rsense capabilities for decode of logs generated for xSeries product space