Environmentals

Environmentals

Environmentals • Environmentals • Diagnostics • Service actions • RAS

Flowchart of hardware failure and actions normal operation Application Failure specify failure parts Diagnostics Service action Parts replacement CE Call Blue Gene Administrators work IBM CEs work

Diagnostics

What is “Diagnostics” for Blue Gene/P? • A set of hardware diagnostic test programs. • Test programs to locate hardware failures. • Diagnostics included for: • Memory subsystem • Compute logic • Instruction units • Floating point units • Torus and collective network • Internal and external chip communication • Global interrupts

Running diagnostics • Launched from a Navigator or from a shell. • Diagnostics test cases are designed to test all aspects of the hardware. • There is a groups of test cases based on time or a specific type of hardware, called test buckets. • 4 test buckets for the Navigator (small, medium, large, complete) • 12 test buckets for the command line. (small, medium, large, complete, servicecard, nodecard, linkcard, memory, ionode, multinode, power, gi*) * global interrupts • diag run time can be accommodated by choosing the test bucket. Details for the each test cases can be referenced from Redbook “Blue Gene/P System Administration, System Diagnostics Section”

Diagnostics from the Navigator • In the Blue Gene Navigator the Diagnostics link consist of three tab, • Summary tab • Default page for diagnostics • Summary of the diagnostics result • Shows status of current diag runs. • “Configure New Diagnostics Run” Button to launch diag • Locations tab • Provides a view of all the hardware that has had diags run on it. • Can be filtered by “Filter Options”, • Location • Hardware Status • Executed time • Hardware replacement status • Complete Runs tab • History of all completed diag run • Can be filterd by “Filter Options”,, • Diagnostics Status • Run ended time • Target midplanes, racks, blocks Diagnostics home page

Submitting a diagnostics run via Navigator Click “Configure new Diagnostics Run” button. • Select the midplanes to test. Tests are run separately on midplane blocks. Several blocks are run simultaneously depending on the size of the service node and the number of Ethernet I/O channels involved (i.e. the number of rows). • Select either a predefined test bucket or individually select tests to run. 3. Select the Run Options, Pset ratio override – use this option to specify a custom pSet ratio for the run. This is useful if, for instance, not all I/O nodes are cabled. Stop on first error encountered – by default the diagnostics will not stop if a failure is detected. This will make the diagnostics stop once the first failure is found. Save all output from diagnostics – by default diagnostics will not keep logs for a successful diagnostics test. This prevents the harness from deleting logs. 4. Click on “Start Diagnostics Run” button to start. 3 4

Viewing results via Navigator • Click on Completed Runs • Each run is displayed on a line. • Log directory hyperlink is the main diags.log file. • End time hyperlink goes to the summary for that run. 1 4 3

2 3 4 Viewing results via Navigator cont. • Each line represents the results for the given midplane location. • The log directory link goes to the main diags.log file. It is the same link as in the run summary page. • The location link goes to a summary for that midplane location. • There is a button to automatically collect the diagnostics logs into an archive for sending into IBM Support.

Viewing results via Navigator cont. • Each line represents results for a particular test on this midplane location. • Result is the worst status of all the hardware tested in this midplane. This can be Passed, Marginal, Failed, or Unknown. • Run result is the status of the run. This can be Uninitialized, Running, Canceled, or Completed. • Each line has a set of counts for hardware status as determined by the test. • The Summary of failed tests hyperlink lists all test case failures at once. • Details are listed by location and include a brief analysis of why the result was determined. • A link to the log for the test is included if applicable. • A link to the RAS events that occurred between the test’s start and end times for the given location is also included. Most diagnostics are RAS-based so this list should give good insight to the specifics of the failure. Keep in mind it is possible the harness deduces a hardware failure without RAS so there may not always be RAS events available.

Running diagnostics via the command line • The diagnostics script to run is /bgsys/drivers/ppcfloor/bin/rundiags.py. • The --help option can be used to display the various command line options. clappi@dd2sys1fen1:/bgsys/drivers/ppcfloor/bin> ./rundiags.py --help Blue Gene Diagnostics Version 1.6 Build 3 running on Linux 2.6.16.21-0.8-ppc64 ppc64, dd2sys1fen1:9.5.45.45 Blue Gene Diagnostics initializing... Usage: rundiags.py --midplanes midplane_list [OPTIONS ... ] rundiags.py --racks rack_list [OPTIONS ... ] rundiags.py --midplanes midplane_list --racks rack_list [OPTIONS ... ] rundiags.py --blocks block_list [OPTIONS ... ] --midplanes x,y,z... - a list of midplanes on which to run diagnostics (eg. R00-M0,R00-M1,R01-M0) --racks x,y,z... - a list of racks on which to run diagnostics. (eg. R00,R01,R10) --blocks x,y,z... - a list of blocks on which to run diagnostics. Note: the list of midplanes, racks, or blocks must not contain any whitespace and are comma separated. Either --midplanes, --racks, or --blocks must be specified. Note: the --midplanes and --racks can both be specified together in the same command, but not with the --blocks. The --blocks switch must be specified in the absence of the other two. OPTIONS (default values are displayed inside the []) --tests x,y,z - run only the specified tests. Additive with --buckets. [Run all tests] --buckets x,y,z - run only the tests in the specified buckets. Additive with --tests. [Run all tests] --stoponerror - stop on the first error encountered. [false] --sn x - name of the service node. [localhost] --csport x - control system port. [32031] --csuser x - control system user. [current user ==> clappi] --mcserverport x - mcServer port. [1206] --dbport x - DB2 port. [50001] --dbname x - database name. [bgdb0] --dbschema x - database schema. [bgpsysdb]

Diagnostics output • The diagnostics will save all output into the diagnostics log directory for the run. • Each run gets a log directory under /bgsys/logs/BGP/diags whose name is based off the run ID. This is where the main diags.log file is stored. e.g. /bgsys/logs/BGP/diags/071002_133909_28920 • There is also a runscript_xxxxx.kshfile stored in the log directory. If this script is run it will re-run this particular diagnostics run including all options. • Each test per block gets a subdirectory for its logs. The name of the subdirectory is based off the block ID, test case name, and test start time. For example, bpclbist__DIAGS_R11-M1_144119362.

Diags.log • The diags.log file contains all harness output, including test result summaries, harness error dumps, and exception traces. • Example test summary: Sep 28 14:39:22 (I) [mp_R11-M1_ta_svccardenv, 0] svccardenv summary: Sep 28 14:39:22 (I) [mp_R11-M1_ta_svccardenv, 0] ================================================= Sep 28 14:39:22 (I) [mp_R11-M1_ta_svccardenv, 0] Run ID: 709281438572281 Sep 28 14:39:22 (I) [mp_R11-M1_ta_svccardenv, 0] Block ID: _DIAGS_R11-M1 Sep 28 14:39:22 (I) [mp_R11-M1_ta_svccardenv, 0] Start time: Sep 28 14:38:59 Sep 28 14:39:22 (I) [mp_R11-M1_ta_svccardenv, 0] End time: Sep 28 14:39:20 Sep 28 14:39:22 (I) [mp_R11-M1_ta_svccardenv, 0] Testcase: svccardenv Sep 28 14:39:22 (I) [mp_R11-M1_ta_svccardenv, 0] Passed: 0 Sep 28 14:39:22 (I) [mp_R11-M1_ta_svccardenv, 0] Marginal: 1 Sep 28 14:39:22 (I) [mp_R11-M1_ta_svccardenv, 0] R11-M1-N01 (SN 42R7201YL10K708607R): Power module R11-M1-N01-P3 continuously failed to share current for the 1.8V domain while the total 1 .8V current draw for the node card was above 30.0A. It has failed to share current properly at least 2 times during this diagnostic. Sep 28 14:39:22 (I) [mp_R11-M1_ta_svccardenv, 0] Failed: 1 Sep 28 14:39:22 (I) [mp_R11-M1_ta_svccardenv, 0] R11-M1-S (SN 42R7504YL10K711001P): R11-M1-S-P5 indicated driver faults. Driver fault byte is 0xFF. Faults: 0x01 -> forced fault, 0x08 -> o ver current, 0x10 -> over voltage, 0x40 -> incompatible V_SEL, 0x80 -> general fault). Sep 28 14:39:22 (I) [mp_R11-M1_ta_svccardenv, 0] 1 x (INFO:DIAGS:DIAG_COMMON:DIAG_8000) Diagnostic test Service_Card_Environmental passed. Sep 28 14:39:22 (I) [mp_R11-M1_ta_svccardenv, 0] Unknown: 0 Sep 28 14:39:22 (I) [mp_R11-M1_ta_svccardenv, 0] Hardware status: failed Sep 28 14:39:22 (I) [mp_R11-M1_ta_svccardenv, 0] Internal test failure: false Sep 28 14:39:22 (I) [mp_R11-M1_ta_svccardenv, 0] ================================================= • Each line in the log has a timestamp and severity level as with all other BG/P logs. After this the square brackets contain the name of the thread logging the message and the harness verbosity level for the message.

Service Actions

What is a Service Action? • There will be occasions that you will need to cycle power or replace hardware on your Blue Gene/P system. Any time service on the Blue Gene/P hardware is to be performed, the hardware needs to be prepared by generating a Service Action • The Service Actions are generated from, • Command line • Blue Gene Navigator

Command line Service Action

Command line Service Action cont. • Checks are done for conflicting service actions. • A rack service action does not allow you to power cycle the rack.

TBGPServiceAction Table (1 of 2) • SERVICEACTION field contains the current state of the service action. • STATUS field shows the current status of the service action. • INFOSERVICEACTION field provides status information related to the service action. Displayed from Navigator. • LOGFILENAMEPREPAREFORSERVICE field contains the fully qualified path name to the log file used when preparing the hardware for service. • LOGFILENAMEENDSERVICEACTION field contains the fully qualified path name to the log file used when ending the service action.

SERVICEACTION field INITIALIZED OPEN PREPARE END CLOSED STATUS field “I” – Initialized “O” – Open “P” – Prepared for service “A” – Actively processing “E” – Action ended in error “C” – Closed “F” – Forced closed “S” – Service (hw only) TBGPServiceAction Table (2 of 2)

Service Action Logs • Naming format syntax: <service action>-<location>-<timestamp>.log Example: ServiceNodeCard-R00-M0-N00-2007-10-03-13:32:25.log • Log files are stored in the /bgsys/logs/BGP directory. • Log files can be stored in a different directory by using the –log <path> optional parameter on the service action command. • Log file names are stored in the database entry for the service action.

Prepare For Service Flow (1 of 3) • Check for conflicting service actions in progress. • Ensure that there are no open service actions in progress. • End any jobs and free any blocks that use the hardware being serviced. • Open service action. A new database entry is created and the Service Action ID is assigned. • Set the Service Action entry to PREPARE state and ACTIVE status. • Send RAS event indicating that the service action has been started

Prepare For Service Flow (2 of 3) • Open a target set for the hardware to be serviced. • Set the hardware status to SERVICE in the database. • Using the appropriate ReadCardInfo API, obtain the current information for the hardware, primarily VPD. • Perform RAS analysis for hardware that has IBM supplied VPD. • Update the card’s VPD if RAS data is available. • Update database VPD and status.

Prepare For Service Flow (3 of 3) • Prepare the hardware for service. • Power off the hardware (if required) • Set “intervention required” in LEDs • Send RAS event indicating that the hardware is ready to be serviced. • Set the Service Action entry to PREPARE state and PREPARED status. • Perform the required service action. • If a failure occurs: • The Service Action entry is set to PREPARE state and ERROR status. • Send RAS event indicating that the service action failed.

End Service Action Flow (1 of 4) • Ensure that the service action is PREPARED. • Set the service action to END state and ACTIVE status in the database. • Open a target set for the hardware being serviced. • Make the hardware functional. • Initialize the hardware using the appropriate InitializeCards API. • Set the status for the serviced hardware to MISSING in the database.

End Service Action Flow (2 of 4) • Update the database with information for the serviced hardware. • Using the appropriate ReadCardInfo API, obtain the current information for the hardware. • Validate the hardware information: VPD, ECIDs (Node and LinkChips), Node memory size, memory module size, and voltage. • Update database entry based on the validation results. • Set to ERROR status if invalid data received • Set to ACTIVE status if hardware was not replaced. • Set to SERVICE status if the hardware was replaced.

End Service Action Flow (3 of 4) • Verify replaced hardware is functional • Run select diagnostic tests on replaced hardware (database status is SERVICE). • Service cards, link cards, node cards, and nodes • Update hardware status based on diagnostic results • Success: Set hardware status to ACTIVE and send RAS event indicating that the hardware is functional. • Failed: Set hardware status to ERROR

End Service Action Flow (4 of 4) • Validate the database configuration. • Check serviced hardware for a status of ERROR. • Fail service action if any hardware is found in ERROR status. • Send RAS event indicating that the service action is closed. • Close the service action. • If a failure occurs: • The Service Action entry is set to END state and ERROR status. • Send RAS event indicating that the service action failed.

Close Service Action Flow • A service action can be forced closed if it has the follow state and status combinations: • OPEN state, OPEN status • PREPARE state, ERROR status • PREPARE state, PREPARED status • END state, ERROR status • A service action with an ACTIVE status will be set to ERROR status if a failure occurs. The state is unchanged. • A RAS event is sent indicating that the service action was forced closed.

Service Action Component Overview ServiceClockCard ServiceRack ServiceMidplane ServiceMidplane ServiceNodeCard ServiceLinkCard ServiceAction ServiceBulkPowerModule ServiceFanModule

ServiceBulkPowerModule • There can not be multiple Bulk Power Module (BPM) service actions active within a rack. • A service action can not be started for a BPM if there is more than 1 failed BPM in the rack (a rack service action is required) unless the BPM to be serviced is one of the failed BPMs. • A service action can not be started for a BPM if there is an open rack or clock card service action that contains the BPM. • Software will attempt to turn off the BPM being serviced. If power can not be turned off, the service action will continue without error.

ServiceFanModule • There can not be multiple fan module (FM) service actions active within a midplane. • A service action can not be started for a FM if there is more than 1 failed FM in the midplane (a midplane service action is required) unless the FM to be serviced is one of the failed FMs. • A service action can not be started for a FM if there is an open midplane service action that contains the FM. • Software does not power off the FM, instead it attempts to flash the intervention LED. • The replacement of a FM does not have any effect on jobs using the midplane (BGL difference).

ServiceNodeCard (1 of 2) • Multiple individual node card service actions (Rxx-My-Nzz) are allowed within midplane Rxx-My. • The Rxx-Mx-N option (used on ServiceNodeCard, a midplane or a rack service action) allows any or all node cards to be serviced within midplane Rxx-My. • A new service action for Rxx-My-Nzz will conflict if there is an existing open service action for Rxx-My-Nzz, Rxx-My-N, Rxx-My, Rxx, or Rxx-K.

ServiceNodeCard (2 of 2) • RAS analysis is done for all service actions that include NodeCards. • RAS analysis for Nodes is only done for a node card service action. • Diagnostic testing will be done to ensure that replaced NodeCards and Nodes are functional. Nodes will be set to ERROR status if the ECID value read from the card does not match the VPD value. • Nodes with differing memory size, memory module sizes, or voltages will be set to ERROR status. • A NodeCard containing Nodes with ERROR status will be set to ERROR status.

ServiceLinkCard (1 or 2) • Multiple individual link card service actions (Rxx-My-Lz) are allowed within a midplane. • The Rxx-Mx-L option (used by a midplane or a rack service action) allows any or all link cards to be serviced in the midplane Rxx-My. • A new service action for Rxx-My-Lz will conflict if there is an existing open service action for Rxx-My-Lz, Rxx-My, Rxx, or Rxx-K. • It is safe to service a LinkCard without having to perform a service action on the other link cards in the same row or column (BGL difference).

ServiceLinkCard (2 of 2) • The midplane containing the link card being serviced is set to SERVICE status to prevent the job scheduler from running. • RAS analysis is done for all service actions that include a LinkCard. • Diagnostic testing will be done to ensure that the replaced LinkCard is functional. This includes cable verification. • LinkChips will be set to ERROR status if the ECID value read from the link chip does not match the VPD value. • A LinkCard containing LinkChips with ERROR status will be set to ERROR status.

ServiceMidplane (1 of 2) • A midplane service action allows any hardware, with the exception of a bulk power module, associated with the midplane to be serviced. • NOTE: ServiceCards may be serviced (BGL difference). • A service action can not be started for Rxx-My if there are any open service actions within midplane Rxx-My or there is an open rack or clock card service action for Rxx, or Rxx-K. • A new service action for a node card, link card, fan module, rack, or clock card will conflict with an open midplane service action if that hardware is associated with the midplane being serviced.

ServiceMidplane (2 of 2) • RAS analysis is done for the midplane, service card, all link cards, and all node cards. It is not done for the nodes within a node card. • Diagnostic testing will be done to ensure replaced service card, link cards, node cards, and nodes are functional. • It is safe to replace a service card using a midplane service action since all link cards and node cards are powered off. • Software does not power off the FMs, instead it attempts to flash the intervention LEDs. • The midplane service action uses the ServiceNodeCardRxx-My-N and ServiceLinkCardRxx-My-L support to service the node and link cards within the midplane.

ServiceRack (1 of 2) • A rack service action allows all service actions which do not require tools to be performed. • The clock card can only be replaced by a ServiceClockCard service action. • At least one BPM must be plugged in at all times during a rack service action so that 5V persistent power is maintained to the clock card. • The BPMs are powered off via software (the breaker is not flipped). This allows the clock card to remain functional. • To end the service action, the CE must reseat one of the BPMs to provide enough power to the master service card. The remaining BPMs are powered back on via software. • The rack service action uses the ServiceMidplaneRxx-Mx support to service the service card, node cards, link cards, and fan modules within those midplanes.

ServiceRack (2 of 2) • A service action can not be started for a rack if there are open service actions within a rack. • A new request to service hardware within the rack will conflict with an open rack service action. • RAS analysis is done for the midplanes, service cards, link cards, and node cards in the rack. It is not done for the nodes within a node card. • Diagnostic testing will be done to ensure replaced service cards, link cards, node cards, and nodes are functional. • REMINDER: Do not use ServiceRack to power cycle a rack.

ServiceClockCard (1 of 3) • ServiceClockCard is used to: • Prepare the rack so that the bulk power breaker can be manually turned off. • Service the specified clock card. • Once the breaker has been turned off, any component within the rack can be serviced. • Jobs are stopped and blocks are freed in all midplanes that are downstream from this rack in the clock tree. • The clock tree is defined in the TBGPClockCables table. • Downstream midplanes are set to SERVICE status to prevent the job scheduler from running. • The bulk power breaker must be turned on to repower the rack.

ServiceClockCard (2 of 3) • Service action conflict rules are the same as ServiceRackPLUS: • A clock card service action can not be started if there are open service actions in any of the midplanes that are downstream from this rack in the clock tree. • A new service action can not be started if there is an open clock card service action upstream from the hardware to be serviced in the clock tree.

ServiceClockCard (3 of 3) • RAS analysis is done for the clock card, midplanes, service cards, link cards, and node cards in the rack. It is not done for the nodes within a node card. • Diagnostic testing will be done to ensure replaced service cards, link cards, node cards, and nodes are functional. • The clock card, if replaced, is tested to verify that it is functional. If the clock card was not replaced, it is assumed to be functional. • Set the status of all midplanes that are downstream from this rack in the clock tree to ACTIVE status.

Service Action from Blue Gene Navigator Service action home page Filter Options for Service Actions history.

Starting and ending Service Action from the Navigator Select the target hardware type Select the target hardware location If there are no jobs affected by the service action click “Finish” to start the service action. The new service action will be in the “Service Actions History”. Click “End Service Actions” button to end the target Service Action.

Cycling power Prior to shutting power off on a rack you need to properly prepare the rack. Preparing a rack to be powered down requires that you run the ServiceClockCard. This process can be completed from either the Navigator or the command line. The command line syntax to start a service action on a clock card is: $ ServiceClockCard Rxx-K PREPARE The ServiceClockCard process terminates any jobs that are running on that specific rack and jobs that are running on racks that contain downstream clocks (any clocks that rely on the clock being shut down for a clock signal). Once the ServiceClockCard command completes you will receive a message indicating that the rack is ready for service. At this point you can power down the rack. To bring the rack back on line move the switch to the On position. Allow the rack sufficient time to power up, usually less than a minute, before ending the Service Action. A good indicator would be to watch for the lights on the Service cards to start blinking. To end the Service Action from the command line use the following command: $ ServiceClockCard Rxx-K END When powering up a multiple rack system be sure to power up the rack with the master clock first, followed by any secondary clocks, then racks that contain tertiary clocks and finally all other racks. Follow the same scheme when ending the Service Actions. The bulk power breaker must be turned on to repower the rack.

RAS

BG/P RAS Requirements • Improve the programming of RAS messages • RAS events should include a message id and brief description of the problem. • Users should be able to access additional details about an event including a detailed description about what happened along with the recommended actions to take, if any. • The diagnostics should utilize RAS events to report problems to simplify problem reporting and reduce the number of logs that users have to care about. • Minimize the time it takes to display a set of RAS events. • Application developers should be able to log events.

Event Logs Error codes Control System Compute and I/O Nodes MMCS Off-Core Diagnostics MC Server Machine Controller Card Controller CNK, Linux, and On-Core Diagnostics Card Controller BareMetal Card Controller Card Controller

Event Logs Error codes MMCS Navigator 5a. Persist formatted msg 4a. MMCS processes event 6. Query RAS events and run diagnostics. 4. Event sent to registered listeners Linux and CNK 2. MC reads and Interprets events bgcns Compute node service Funnels RAS events to The control system. MC Server Machine Controller 3. Event sent to registered listener 1a. Kernel creates a RAS event and uses mbox to deliver it. Off-Core Diags • RAS Events can be generated in various components and processes. 1b. MC generates RAS events for node, link, and service cards and the chips on the cards. 1c. Diagnostics generate RAS events to report test results.

Environmentals

Environmentals

Presentation Transcript