1 / 45

Toolkit for Event Analysis and Logging Education

Dec 2011. Toolkit for Event Analysis and Logging Education. Contents. Overview Locations Commands Alerts and Connectors Debug References. Overview. Overview. Common HPC Event Analysis Framework Combined best aspects and lessons learned from BlueGene ELA and Federation ELA

mahsa
Download Presentation

Toolkit for Event Analysis and Logging Education

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Dec 2011 Toolkit for Event Analysis and LoggingEducation

  2. Contents • Overview • Locations • Commands • Alerts and Connectors • Debug • References

  3. Overview

  4. Overview • Common HPC Event Analysis Framework • Combined best aspects and lessons learned from BlueGene ELA and Federation ELA • Addressed new p7 IH requirements • Common Event Repository • First release: CNM, Service Focal Point (HMC), PNSD, LL, GPFS (coming soon) • Analysis of Events to create Alerts • Rules based engine • Flexible alert delivery. For example, RMC and e-mail • Real-time Analysis and Historic Analysis • Real-time to be pro-active and react immediately to events • Historical allows for deeper debug on-site and off-site • Robust framework to prevent loss of alerts and events • Handles event flooding • Checkpoint/Shutdown/Restart • Open Source (pyteal.sourceforge.net) • Using ODBC • Python, C/C++, and Perl

  5. Detect Monitors Observation RAS Strategy TEAL Analyze Generic Analysis Custom Analysis Centralize Database Event Adapters Alert Generic filters, listeners Custom Data collection Find Data Mining Rules Historical analysis queries Manual Analysis Shouldn’t be manual? Query,e-mail, RMC Resolve As enabled Recommended Actions Auto-Recovery Custom Maintenance package escape Correct Refine Get Data Debug Analyze Behavior Release new rules Fix Framework • Grayed-out boxes are future possibilities

  6. CNM … Connector Connector TEAL Concepts Alert Alert Analyzer Event Analyzer Alert Filters semaphore Monitor Event Alert Listeners Event Log (table in xCat DB) Alert Log (table in xCat DB) teal.conf teal.conf teal.conf

  7. P7-IH Usage • Output is to an alert database • Monitored by the administrator and operators • Various methods of monitoring will be described • Commands are used to query the database • Primary users are the administrator and operator • Runs on the EMS • Commands are issued via the EMS command line • SSRs may run commands under engineering direction • Event database may be collected to work on new analysis algorithms, or bugs

  8. CNM SFP GPFS LL PNSD HMC(s) P7-IH Implementation Store Events Event Log (table in cluster DB) Systems Alert Log (table in cluster DB) Analyzed Events TEAL Network Events to SFP e-mail, RMC, query SFP to TEAL Customer Notify Admin, Operator

  9. Locations • Points to a specific event location • Can be physical, logical or a mixture of both • Is hierarchical in nature • Simple - one type of item per level • Complex - multiple types of items per level • Operations • Scoping • Validation • Casting (platform specific) • XML-based description • /opt/teal/data/ibm/teal/xml/percs_location.xml • Can use it to remind yourself of the location formats

  10. Location Code Examples FR • Complex • Compact ID • Optional Instance Values • Simple • Hierarchy innate in description CG SN DR Example: Example: <node>-<program>-<pid> comp01-firefox-1234 comp01-vncserver-4567 HB LL OM HF LR LD RM H:FR008-CG03-SN000-DR0

  11. P7-IH Locations • Application • A:c250mgrs20-pvt.ppd.pok.ibm.com##teal.py##28327 • Expect this from PNSD and GPFS – apps in general • Job • J:z25c4s9.ppd.pok.ibm.com.1.3 • Expect this from LoadLeveler • Hardware (aka logical hardware) • H:FR008-CG03-SN000-DR0-HB1-OM27-LR22 • Expect this from ISNM • pSeries (aka service/physical) • P:U9125.F2C.0286C66 • Expect this from SFP

  12. Commands

  13. TEAL EMS Command Line (/opt/teal/bin)

  14. Managing Alerts • Closing Alerts • tllsalert • tlchalert --id 1543 --state close • Querying Alerts • tllsalert –q”creation_time>2010-12-30 creation_time<2011-02-01” • tllsalert -q”event_loc=P” –f text • tllsalert -q”event_loc=H:FR007-CG03-SN016-DR0-HB0 event_scope=hub” • tllsalert –-with-assoc -f text • Removing Alerts • tlrmalert --older-than 2011-01-01-12:00:00 • Can only remove alerts • closed • not a duplicate • Can take a long time Output Options: csv, json, text, “brief”

  15. Managing Events • Listing events • tllsevent • tllsevent -q”src_loc=H:FR007-CG03-SN016-DR0-HB0 src_scope=hub” • tllsevent –e • tllsevent –q”time_logged=2011-04” • Removing Events • tlrmevent --older-than 2011-01-01-12:00:00 • Only Events not associated with: • an alert • a checkpoint

  16. Cleaning Out the DB • Close (by resolving) any active alerts (tlchalert) • Remove all closed alerts (tlrmalert -–older-than) • Remove all events not associated with an alert (tlrmevent -–older-than)

  17. Checkpoints tllschkpt CnmEventAnalyzer R 35301 PNSDEventAnalyzer R None LLEventAnalyzer S None SFPEventAnalyzer R None monitor_event_queue R 35301 MAX_event_rec_id 35301 tllschkpt –f text <- shows additional data • monitor_event_queue is last recovery type and start rec_id • GEAR based analyzers contain pool checkpoint information State when analyzer last checkpointed Last event processed by the monitor Maximum rec_id in event log

  18. Historic Analysis - Reanalyzing • User can set up a query for the criteria of interest • Filters and listeners in the configuration file for historic mode or all modes are executed • Choice of committing or not committing (default) the generated alerts • To capture all alerts produced, a file or print listener that does not specify any filters should be used • Time occurred or time logged can be used for analysis teal --historic -–query=”src_comp=CNM time_occurred>2011-02-01-10:00:00”

  19. TEAL historic and tlls* Options • rec_id (=,<,>,<=,>=) Can be a single value or a comma separated list of ids • event_id (=) Can be a single value or a comma-separated list of ids • time_occurred (=,<,>,<=,>=) A single value in the format of yyyy-mm-dd hh:mm:ss • time_logged (=,<,>,<=,>=) A single value in the format of yyyy-mm-dd hh:mm:ss • src_comp (=) Can be a single value or a comma-separated list of values • src_loc_type:src_loc (=) The location is optional otherwise all events with the same location type will be included. • src_scope (=) Level to scope all source locations to. This is only valid if the reporting location type is specified • rpt_comp (=) Can be a single value or a comma-separated list of values • rpt_loc_type:rpt_loc (=) The location is optional otherwise all events with the same location type will be included • rpt_scope (=) Level to scope all reporting locations to. This is only valid if the reporting

  20. Sample output – csv and json • csv – good for reading into spreadsheets, or program parsing rec_id,event_id,time_occurred,time_logged,src_comp,src_loc,src_loc_type,rpt_comp,rpt_loc,rpt_loc_type,event_cnt,elapsed_time 91455,BD700041,2011-02-09 15:06:19,2011-02-09 15:06:19,CNM,BB03-FR007-SN000-DR0-HB0-LD00,H,CNM,"TRMD",A,, • json– good for program parsing {"src_comp": "CNM", "rpt_loc_type": "A", "event_id": "BD700041", "src_loc_type": "H", "time_occurred": "2011-02-09 15:06:19", "rec_id": 91455, "event_cnt": null, "rpt_loc": "TRMD", "elapsed_time": null, "rpt_comp": "CNM", "time_logged": "2011-02-09 15:06:19", "src_loc": "BB03-FR007-SN000-DR0-HB0-LD00“}

  21. AlertsandConnectors

  22. FSP FSP FSP EMS CNM and TEAL SFP TEAL Listener Filter Network Events ISNM/CNM Analyzer Init Monitor Rules NM Event Alert 22

  23. Network Hardware Events Events reported by the HFI, ISR or Optical Module: HFI Events HFI Down – report for completeness of network status Link Events Link types are HFI-to-ISR links, Llocal (intra-drawer), Lremote (intra-SN), and D-link (inter-SN) Port Down/Port Up Threshold events: CRC, dropped flit, flit retry Correctable/uncorrectable errors on port-level routing structures Packet flow events, e.g. credit overflow, sender hang  informational Optical Module Events Module-level events affect a single D port or two LR ports Channel-level events affect a single D port. May affect one or two LR ports depending on which channels are affected. Some OM events are thresholded by LNMC 23

  24. Frame Events Reported directly to CNM by frame (BPA) firmware ISNM uses these events for analysis only – BPA creates any serviceable events for the problems it detects; ie. it suppresses network events caused by frame events Sample frame events that may affect the ISR network: CEC power dropped due to MCM Over Temperature CEC DCCA errors High ambient temperature BPA CNM FSP FSP FSP FSP 24

  25. Example CNM Alert >[c250mgrs52]>/opt/teal/bin/tllsalert -f text -q "alert_id=BD700025” rec_id : 9673 alert_id : BD700025 creation_time : 2011-08-16 15:15:11.146044 severity : E urgency : S event_loc : FR052-CG03-SN000-DR0-HB1-OM12-LD12 event_loc_type : H fru_loc : None recommendation : There is a problem with a D-Link. Record the alert ID. Record the location in the alert message. Contact IBM Service. Log on to the Management Server. To isolate to the proper FRU, run Link Diags and perform the actions that it recommends. If no action is recommended, because Diags cannot isolate to the proper FRU, replace the FRUs in the order listed. reason : D-link down between frame FR052 cage CG03 (superNode SN000 drawer DR0) hub HB1 port LD12 and frame FR052 cage CG06 (superNode SN003 drawer DR0) hub HB1 port LD15 (D Link Port Down) src_name : CnmEventAnalyzer state : 1 raw_data :{"fru_list":"{ HFI_DDG,Isolation Procedure,,,, },{ HFI_CAB,Symbolic Procedure,U78A9.001.20C1000-P1-T17-T6,,, },{ CBLCONT,Symbolic Procedure,U78A9.001.311B001-P1-T16-T5,,, },{ 52Y3020,FRU,U78A9.001.20C1000-P1-R2,YA193P203586,ABC123,TRMD },{ 52Y3020,FRU,U78A9.001.311B001-P1-R2,YA193P399669,ABC123,TRMD }","nbr_loc":"FR052-CG06-SN003-DR0-HB1-OM15-LD15","nbr_typ":"H","pwr_enc":"78AC-100BC50052","eed_loc":"c250mgrs52:/var/opt/isnm/cnm/log","encl_mtms":"9125-F2C/028B506"} 25

  26. CNM FRU list format in alerts raw_data :{"fru_list":"{ HFI_DDG,Isolation Procedure,,,, },{ HFI_CAB,Symbolic Procedure,U78A9.001.20C1000-P1-T17-T6,,, },{ CBLCONT,Symbolic Procedure,U78A9.001.311B001-P1-T16-T5,,, },{ 52Y3020,FRU,U78A9.001.20C1000-P1-R2,YA193P203586,ABC123,TRMD },{ 52Y3020,FRU,U78A9.001.311B001-P1-R2,YA193P399669,ABC123,TRMD }","nbr_loc":"FR052-CG06-SN003-DR0-HB1-OM15-LD15","nbr_typ":"H","pwr_enc":"78AC-100BC50052","eed_loc":"c250mgrs52:/var/opt/isnm/cnm/log","encl_mtms":"9125-F2C/028B506"} • Multiple FRUs with each one contained in braces • Part number, FRU type, FRU location, ECID, CCIN

  27. Example CNM Compound Alert >[c250mgrs52]>/opt/teal/bin/tllsalert -f text -q "alert_id=BDFF0060” -w rec_id : 13304 alert_id : BDFF0060 creation_time : 2011-08-26 19:02:53.971854 severity : W urgency : O event_loc : FR052-CG04-SN001-DR0 event_loc_type : H fru_loc : None recommendation : A large number of HFI network links attached to a drawer are down without an accompanying power event. Contact IBM Service and report the alert ID. If a drawer lost power, then this is a secondary effect. reason :Drawer level event occurred on frame FR052 cage CG04 (superNode SN001 drawer DR0). (Suspicious Drawer) src_name : CnmEventAnalyzer state : 1 raw_data : {"fru_list":"{ HFI_IDR,Isolation Procedure,,,, }","nbr_loc":"FR052-CG04-SN001-DR0-HB7-OM09-LD09","nbr_typ":"H","pwr_enc":"78AC-100BC50052","eed_loc":"c250mgrs52:/var/opt/isnm/cnm/log","encl_mtms":"9125-F2C/028B5D6"} Condition Alerts: [] Condition Events: [32873,32874,32875,32876,32877,32878,32879,32880,32881,32882,32883,32884] Duplicate Alerts: [] Suppression Alerts: [] Suppression Events: [] 27

  28. Example CNM Alert with suppression >[c250mgrs52]>/opt/teal/bin/tllsalert -f text -q "alert_id=BD700022” -w rec_id : 8507 alert_id : BD700022 creation_time : 2011-08-11 14:39:00.244292 severity : E urgency : S event_loc : FR052-CG10-SN007-DR0-HB3-OM09-LD09 event_loc_type : H fru_loc : None recommendation : There is a problem with a D-Link. Record the alert ID and call IBM Service. Log on to the Management Server. To isolate to the proper FRU, run Link Diags and perform the actions that it recommends. If no action is recommended, because Diags cannot isolate to the proper FRU, replace the FRUs in the order listed. reason : D Link Port Lane Width Change between frame FR052 cage CG10 (superNode SN007 drawer DR0) hub HB3 port LD09 and frame FR052 cage CG09 (superNode SN006 drawer DR0) hub HB3 port LD08 (D Link Port Lane Width Change) src_name : CnmEventAnalyzer state : 1 raw_data : {"fru_list":"{ HFI_DDG,Isolation Procedure,,,, },{ HFI_CAB,Symbolic Procedure,U78A9.001.30CK001-P1-T14-T1,,, },{ CBLCONT,Symbolic Procedure,U78A9.001.312N005-P1-T14-T2,,, },{ 52Y3020,FRU,U78A9.001.30CK001-P1-R5,YA193P400322,ABC123,TRMD },{ 52Y3020,FRU,U78A9.001.312N005-P1-R5,YA193N035309,ABC123,TRMD }","nbr_loc":"FR052-CG09-SN006-DR0-HB3-OM08-LD08","nbr_typ":"H","pwr_enc":"78AC-100BC50052","eed_loc":"c250mgrs52:/var/opt/isnm/cnm/log","encl_mtms":"9125-F2C/028B5F6"} Condition Alerts: [] Condition Events: [26388] Duplicate Alerts: [8511] Suppression Alerts: [] Suppression Events: [26389,26390] 28

  29. Example CNM Event >[c250mgrs52]>/opt/teal/bin/tllsevent -f text -q “event_id=BD700025” -e rec_id : 22877 event_id : BD700025 - D Link Port Down time_occurred : 2011-08-01 14:52:14 time_logged : 2011-08-01 14:52:14.369687 src_comp : CNM src_loc : FR052-CG07-SN004-DR0-HB0-OM14-LD14 src_loc_type : H rpt_comp : CNM rpt_loc : c250mgrs52##cnmd rpt_loc_type : A event_cnt : None elapsed_time : None ext.eed_loc_info : c250mgrs52:/var/opt/isnm/cnm/log ext.encl_mtms : 9125-F2C/028B596 ext.global_counter : None ext.isnm_raw_data : REG_BEGIN ISR_GLOBAL_COUNTER_REGISTER = 0x000005347ecda480 ISR_ID_REGISTER = 0x004800d01c000000 ISR_D14D15_FIR = 0x4000000000000000 D_PORT_14_SEND_NEIGHBOR_ID = 0x000800d01ee00000 OLL_LLD14_LINK_STATUS = 0xc1d6000100000000 REG_END ext.local_om1 : U78A9.001.30CM002-P1-R2-R1,52Y3020,YA193P407777,ABC122,TRMD ext.local_om2 : ext.local_planar : U78A9.001.30CM002-P1,74Y0601,YH10HA0BH002,ABC122,2E00 ext.local_port : U78A9.001.30CM002-P1-T17-T7 ext.local_torrent : U78A9.001.30CM002-P1-R2,52Y3020,YA193P407777,ABC123,TRMD ext.nbr_om1 : U78A9.001.30CK001-P1-R2-R4,52Y3020,YA193P399201,ABC123,TRMD ext.nbr_om2 : ext.nbr_planar : U78A9.001.30CK001-P1,74Y0601,YH10HA0BJ003,ABC123,2E00 ext.nbr_port : U78A9.001.30CK001-P1-T15-T8 ext.nbr_torrent : U78A9.001.30CK001-P1-R2,52Y3020,YA193P399201,ABC123,TRMD ext.neighbor_loc : H: FR052-CG04-SN006-DR0-HB0-OM11-LD11 ext.pwr_ctrl_mtms : 78AC-100BC50052 ext.recovery_file_path : /var/opt/isnm/cnm/log 29

  30. SFP Connector FSP TEAL HMC • Uses RMC and xCAT monitoring support • Retrieves batches of events from HMC HMCs Configured [c250mgrs14][/]> nodels hmc c250hmc05_a RMC set up [c250mgrs14][/]> lscondresp Displaying condition with response information: Condition Response Node State "AllServiceableEvents_HB" "TealLogSfpEvent_HB" "c250mgrs14" "Active"

  31. SFP Event rec_id : 8490 event_id : B1812A80 time_occurred : 2011-04-20 09:57:41 time_logged : 2011-04-20 09:58:46.187401 src_comp : SFP src_loc : U9125.F2C.P7IH165 src_loc_type : P rpt_comp : 7042CR5/KQZAAAT rpt_loc : c250hmc05.ppd.pok.ibm.com##AllServiceableEvents_B rpt_loc_type : A event_cnt : None elapsed_time : None ext.call_home : N ext.description : Platform firmware (0x81) reported an error. ext.fru_list : [['FSPSP04', 'ACT04219I Isolate procedure', '', '', '', ''], ['45D7208', 'ACT04216I FRU', 'U78A9.001.1122233-P1-R5', 'YH30HA022005', '', '2A3A'], ['FSPSP06', 'ACT04219I Isolate procedure', '', '', '', '']] ext.prob_num : 320 ext.sfp_raw_data : {'FRURecentlyReplaced': ['No', 'No', 'No'], 'FRULogicControllingCECMachineSerialNumber': ['P7IH165', 'P7IH165', 'P7IH165'], 'HSCBiosName': 'KQZAAAT', 'CreatedTimeStamp': '04/20/2011 06:16:49', 'CECMachineModel': 'F2C', 'FDAdditionalMachine': ['9125-F2C-P7IH165'], 'EventType': 'open', 'SystemRefCode': 'B1812A80', 'CreatorID': 'E', 'FRUEnclosureMachineSerialNumber': ['P7IH165', 'P7IH165', 'P7IH165'], 'FRUEnclosureMachineTypeModel': ['9125-F2C', '9125-F2C', '9125-F2C'], 'DuplicateCount': '0', 'EventSeverity': '32', 'CECMachineType': '9125', 'SubsystemID': '129', 'FRULogicControllingCECMachineTypeModel': ['9125-F2C', '9125-F2C', '9125-F2C'], 'CalledHome': 'No', 'FRUReplacementPriority': ['80', '50', '25'], 'CECMachineSerialNumber': 'P7IH165', 'LastReportedTimeStamp': '04/20/2011 06:16:49', 'HSCBiosId': '7042CR5', 'PlatformLogID': '1346333000'} Failing MTMS Reporting HMC FRU List Problem # on HMC

  32. SFP Alert rec_id : 8040 alert_id : 14020079 creation_time : 2011-05-17 12:58:58.661058 severity : E urgency : N event_loc : U9458.100.BPCF007 event_loc_type : P fru_loc : None recommendation : reason : Power/Cooling subsystem & control (0x60) reported an error. src_name : SFPEventAnalyzer state : 1 raw_data : {'FRU List': [['IQYRISC', 'ACT04219I Isolate procedure', '', '', '', ''], ['PU_BOOK', 'ACT04216I FRU', 'U78A9.001.1122233', '', '', '']], 'SFP': 'c250hmc05.ppd.pok.ibm.com', 'Problem Number': 601}

  33. SFP FRU list format in alerts raw_data : {'FRU List': [['IQYRISC', 'ACT04219I Isolate procedure', '', '', '', ''], ['PU_BOOK', 'ACT04216I FRU', 'U78A9.001.1122233', '', '', '']], 'SFP': 'c250hmc05.ppd.pok.ibm.com', 'Problem Number': 601} • Multiple FRUs with each one contained in brackets • Part number, FRU type, FRU location, ECID, CCIN

  34. Loadleveler Connector TLL_Raslog teal teal_ll • New support for Loadleveler 5.1 • DB table polling via TEAL connector daemon • Loadleveler must be configured to use the DB [root@c250mgrs20-pvt log]# service teal_ll status [ OK ] loadleveler.py (pid 17583) is running... [c250mgrs14][/]> lssrc -s teal_ll Subsystem Group PID Status teal_ll 5701830 active

  35. Loadleveler Alert daemon down =================================================== rec_id : 9 alert_id : LL001000 creation_time : 2011-05-19 13:26:34.559391 severity : E urgency : N event_loc : z25c4s12.ppd.pok.ibm.com event_loc_type : A fru_loc : None recommendation : Call next level of support reason : LoadL_schedd on machine z25c4s12.ppd.pok.ibm.com is down. src_name : LLEventAnalyzer state : 1 raw_data : which daemon and where LL alert_id: LL0010xx = Daemon Down LL0020xx = job failures

  36. PNSD Connector Svc Node Compute TEAL • Multi-tiered configuration through service nodes using RMC and xCAT monitoring support • Uses pnsd_stat command to get statistics • May cause jitter on compute nodes so may not be enabled in all cases xcatmn2:~ # lscondresp Displaying condition with response information: Condition Response Node State "TealAnyNodePnsdStat_H" "TealLogPnsdEvent_H" "xcatmn2" “Active"

  37. PNSD Alert =================================================== rec_id : 12 alert_id : PNSD0001 creation_time : 2011-01-26 23:03:40 severity : E urgency : N event_loc : compute37##TealPnsdStat event_loc_type : A fru_loc : None recommendation : Call next level of support reason : Packet retransmit threshold has been exceeded on node compute37 src_name : PNSDEventAnalyzer state : 1 raw_data : 0.046 Reporting node Raw percentage packets dropped/written PNSD alert_id: PNSD0001 = Retransmit threshold exceeded

  38. Installation

  39. Packaging TEAL • Multi-platform • AIX – installp • Linux – RPM • Base • Pipeline • Base services • Logging • DB access • Configuration • Locations • Rules engine • Common filters/listeners • Command line • xCAT extensions • Component • Connector Library/Program • Rules • Alert/Event Metadata • Extension Data Format • User specific Filters/Listeners • Configuration file Base ISNM GPFS PNSD LL Service Focal Point ….

  40. Configuration Files • Stanza-based • Used during startup (/etc/teal) • Separate files per package (teal.conf => base framework features) • Configures processing pipeline • Additional parameters for specialized function • Enabled in different modes [alert_listener.RmcAlertListener] class = ibm.teal.listener.rmc_alert_listener.RmcAlertListener enabled = false [alert_listener.FileAlertListener] class = ibm.teal.listener.file_alert_listener.FileAlertListener enabled = historic filters = DuplicateAlertFilter format = text file = /var/log/teal/cluster_alert.log mode = write

  41. Adding a e-mail listener • Add the definition where TEAL will pick it up: • Add to base configuration file (/etc/teal/teal.conf) • Add in file to configuration directory (/etc/teal/my.conf) • For temporary use: copy conf file(s) to own directory, modify and use during historic analysis (more often for writing out alerts) [alert_listener.SmtpAlertListener] class = ibm.teal.listener.smtp_alert_listener.SmtpAlertListener enabled = realtime filters = DuplicateAlertFilter server=ems1234.cluster.net to=jane_ibmer@ibm.com, joe_ibmer@ibm.com from=joe@cluster.net

  42. Directory Structure • Component rules & metadata • Location • Extended data def Code Libraries Start up configuration (default)

  43. When Things Go Wrong • /var/log/teal has TEAL logs (default) • On AIX look at the console (alog –t console –o) • Note the following: (These are important fields with their TEAL and SFP equivalents) • TEAL alert_id, SFP refcode • TEAL src_loc, SFP reporting MTMS • TEAL reason, SFP problem description • FRU list in TEAL and SFP • Specific alert data or range (text format) • /opt/teal/bin/tllsalert –f text –q “[query to narrow down]” • -f json or –f csv can be more handy for greping out certain records • -d to show duplicate alerts • -x to show duplicate relationships • Specific event data or range (text, with extended and raw data) • /opt/teal/bin/tllsevent –f text –e –q “[query to narrow down]” • -f json or –f csv can be more handy for greping out certain records • -x to show which alerts it is associated with

  44. When Things Go Wrong (continued) • Data dump • /opt/teal/sbin/tltab -d -p <path to dump file • Restore: • /opt/teal/sbin/tltab -c # Drop and recreate the tables • /opt/teal/sbin/tltab -r -p <path to returned file> # Restore the tables with the user data • See TEAL on sourceforge (pyteal.sourceforge.net) • Look at service pack for known issues, hints/tips, etc.. • http://www.ibm.com/developerworks/wikis/display/hpccentral/IBM+High+Performance+Computing+Clusters+Service+Packs

  45. References • TEAL Sourceforge Project - http://pyteal.sourceforge.net • Command reference • Install/Configuration Instructions • Design Overview & other goodies • Mailing List • Problem Tickets • xCAT HPC Software Installation • http://sourceforge.net/apps/mediawiki/xcat/index.php?title=IBM_HPC_Stack_in_an_xCAT_Cluster • Loadleveler • GPFS • RSCT/RMC • Cluster Guide • https://www.ibm.com/developerworks/wikis/display/hpccentral/IBM+HPC+Clustering+with+Power+775+-+Cluster+Guide • Cluster Service Pack readme • https://www.ibm.com/developerworks/wikis/display/hpccentral/IBM+High+Performance+Computing+Clusters+Service+Packs

More Related