Mainframe Global and Workload Levels Statistical Exception Detection System, Based on MASF

Mainframe Global and Workload Levels Statistical Exception Detection System, Based on MASF Igor Trubin, Ph.D. and Linwood Merritt Capital One Services, Inc. igor.trubin@capitalone.com Page 1

Introduction: Environment • Capital One • 6th largest card issuer in the United States • Capital One to S&P 500 in 1998 • Fortune 500 company starting in 2000 • Managed loans at $71.8 billion • Accounts at 46.7 million • CIO 100 Award “Master of the Customer Connection” • Information Week “Innovation 100” Award Winner • ComputerWorld “Top 100 places to work in IT” Page 2

Statistical Analysis of Mainframe Performance Data • SEDS - Statistical Exception Detection System based on Multivariate Adaptive Statistical Filtering (MASF) technique. • SEDS is used for automatically scanning through large volumes of performance data and identifying measurements that differ significantly from their expected values. • MASF is extension of Statistical Process Control or (Quality Control), which was developed by Walter Shewhart of Bell Telephone Laboratories in the 1920s. • MASF procedure was designed and presented in CMG by BGS Systems, Inc. in 1995. • SEDS is developed by this author and presented as the best paper in CMG 2002. Page 3

Review of the Existing Tools • SAS/QC (Quality Control): • JMP from SAS: • BEZsystems for Oracle and Teradata; • Concord eHealth – DFN (Deviation From Normal) • The Patrol Perform and Predict tool from BMC software: The common output is Control charts for monitoring variations in process under statistical control Page 4

SEDS Structure • Exception detectors for the most important metrics; • SEDS Database with history of exceptions; • statistical process control daily profile chart generator; • exception server name list generator; • Leader/Outsider servers/workload detector and detector of defective (runaway) processes ; and • Leaders/Outsiders bar charts generator. Page 5

CPU UtilizationControl Chart for Web Report: The full "7 days X 24 hours” adaptive filtering policy is applied to calculate the average, upper, and lower statistical limits of a particular metric for each weekday for the past six months. Page 6

SEDS against Unisys and Tandem Platforms Performance Data SEDS works with hourly or daily performance data. The schemas of the “day” tables in ITRM for Unisys and Tandem platforms are shown here. Good candidates to be used for SEDS are marked by red. Page 7

Examples of Captured Exceptions for Unisys and Tandem The Tandem server, in contrast, had two unusual spikes of CPUs utilization that crossed the upper limit. The Unisys server had unusual low utilization that might indicate Disk or Database performance problems Page 8

Global Performance Data for MVS Platform The schemas of the “day” tables in ITRM for MVS platforms are shown in the Table Good candidates for use in SEDS are marked by red • A set of nightly batch jobs • dumps remaining active accounting data, • consolidates the data, • processes the data in SAS and • updates the ITRM PDB Page 9

Examples of Captured Exceptions for One of the Logical Partitions (LPAR) Since this chart is not about the entire system’s utilization but only about LPAR utilization within a shared system, the problem is that 100% is not a true threshold. However, SEDS gives a more accurate and dynamic threshold which is a statistical one. Page 10

BMC Visualizer MASF vs. SEDS You can use BMC Visualizer to find any other exceptions based on other filtering policies. For that, the BMC collector needs to be installed on the server and BMC Visualizer must be used manually to capture any MASF exceptions. SEDS is preferable as the automated MASF chart generator. In addition, SEDS can automatically notify a performance analyst if the statistical exception occurred BMC Visualizer example: the System Hierarchy (spectrum) and Control charts Page 11

Application Level SEDS for MVS Platform One problem is that, based on LPAR level data, it is impossible to figure out what particular workloads are responsible for an exception. BUT the Data collection process provides application level data across all LPARs. SEDS shows that Appl1 was responsible for the global maxima in the overall MIPS chart . Looking at a stacked workload data chart , it’s difficult to find an application, which is responsible for spikes in overall CPU usage. Page 12

Other Reasons to Generate a Workload Control Chart 1. To capture an unusual behavior of a relatively small application that was not big enough to create a global exception: 2. To prove a stable behavior of any essential or critical application: Page 13

Service Class/Period Type of Metrics under SEDS - Hourly SUM of ended transaction count - TRANS - Hourly SUM of the average response per transaction - RESP, (It shows the values consistently larger than average) - Hourly SUM of elapsed tasks duration - CPUsec (not always reported correctly for long-running servers ) ElapsedSec = (number of tasks) * 3600 seconds. Page 14

Performance Status Automatic Recognition, WEB Report and E-mail Notification the number of applications or Service Classes with exceptions A green color in the WEB table indicates no exceptions. AMagenta indicates that the exceptions only exceeded the lower limit. A yellow color means an exception occurred on a particular server or LPAR. (NUP - NLOW) – Is the severity or type of the exceptions under the link to an MASF chart, where NUP – number of upper limit exceptions and NLOW – number of lower limit exceptions during the previous day. Page 15

Links to the Workload Control Charts Page 16

Exception Database and “Extra Volume” Metric ExtraVolume is the numeric estimation of the exception magnitude.For CPU utilization it’s an ExtraTime: The SEDS database keeps history of exceptions and has the following structure: • It calculates the area between the limit curve and the actual data curve (for periods when the exceptions occurred). • For CPU metrics the physical meaning is the CPU time (or MIPS) the server has taken that exceeds a standard deviation. Page 17

TOP LPAR Leaders/Outsiders Charts • The system automatically produces ExtraTime calculation for the last day and records that in the SEDS database. • This data is used for publishing Leaders/Outsiders charts bar charts for the last day, last week and last month. If the SERVER showed a positive ExtraVolume for the previous day, it means that more capacity was used on the server than in the past. If the server showed a negative ExtraVolumemetric, less capacity was used than usual. (not necessarily good thing) Page 18

SUMMARY • Statistical techniques can be used to automatically detect and report exceptions in resource utilization and service levels. • The author’s site previously used MASF techniques to track global and application level CPU, disk and memory exceptions for a large number of UNIX and WINTEL servers . • The workload level analysis enabled the authors’ site to expand the scope of this process to encompass large mainframe class servers. • Although the analysis of global exceptions at an LPAR level has limited value for a system that shares workloads across logical systems, a workload-oriented system allows for quick detection of exceptions and immediate drill-down capabilities for the Capacity Planner and Performance Analyst. • The authors recommend that the reader evaluate and understand any built-in statistical processes within his/her product set and consider developing ways to notify appropriate analysts when exceptions occur. Page 19

References • Trubin, Igor, Ph. D. and Mclaughlin, Kevin, “Exception Detection System, Based on the Statistical Process Control Concept," Proceedings of the Computer Measurement Group, 2001 • Trubin, Igor, Ph. D., "Global and Application level Exception Detection System, Based on the MASF Technique," Proceedings of the Computer Measurement Group, 2002 Thanks! Igor Trubin IT Capacity Planning, Capital One Services, Inc. igor.trubin@capitalone.com Page 20

Mainframe Global and Workload Levels Statistical Exception Detection System, Based on MASF

Mainframe Global and Workload Levels Statistical Exception Detection System, Based on MASF

Presentation Transcript

Exception Based Rules and Alerting

Workload Forecast System

Workload Forecast System

Energy Aware Grid: Global Workload Placement based on Energy Efficiency

gLite Information System and Workload Management System

An Effective Global Statistical System

Fuzzy-Based Inference System for Navigation and Life Detection on Titan

Exception Resolving System

An Auctioning Reputation System Based on Anomaly Detection

Workload Management System on gLite middleware

Building Topic/Trend Detection System based on Slow Intelligence

Software Performance Testing Based on Workload Characterization

Report on statistical Intrusion Detection systems

Forest Fire Detection System based on Wireless Sensor Network

A Statistical Anomaly Detection Technique based on Three Different Network Features

Workload Management System

Global Gunshot Detection System Market

Discovering Anomalies Based on Saliency Detection and Segmentation in Surveillance System

Workload Management System on gLite middleware

Workload Management System

Automatic Car Theft Detection System Based on GPS and GSM Technology