Ranking the Importance of Alerts for Problem Determination in Large Computer System

Ranking the Importance of Alerts for Problem Determination in Large Computer System Guofei Jiang, Haifeng Chen, Kenji Yoshihira, Akhilesh Saxena NEC Laboratories America, Princeton

Outline • Introduction • Motivation & Goal • System Invariants • Invariants extraction • Value propagation • Collaborative peer review mechanism • Rules & Fault model • Ranking alerts • Experiment result • Conclusion

Motivation • Large & complex systems are deployed by integrating many heterogeneous components: • servers, routers, storage & software from multiple vendors. • Hidden dependencies • Log/Performance data from components • Operators set many rules to check it and trigger alerts. • E.g. CPU% @ Web > 70% • Rule setting: independent & isolated • Operator’s own system knowledge.

Goal • Which alerts should we analyze first? Alert 1 CPU% @Web > 70% Alert 2 DiskUsg@Web > 150 Alert 3 CPU% @DB > 60% Alert 4 Network@AP > 35k • We introduce “Peer-review” mechanism • To rank the importance of alerts. • Operators can prioritize problem determinations process. Alert 3 Alert 4 Alert 1 Alert 2 - Get more consensus from others - Blend system management knowledge from multiple operators

t t t Alerts Ranking Process Invariants model [ICAC 2006] Off line [TDSC 2006] [TKDE 2007] [DSN 2006] Large system 1 1. Extract Invariants from monitoring data Full automation Alert 3 Alert 4 Alert 2 Alert 1 Alert 1 CPU% @Web > 70% Domain information Alert 2 DiskUsg@Web > 150 Alert 3 2. Define alert rules 3. Sort alert rules CPU% @DB > 60% 2 Operators (w/ domain knowledge) At time of alerts received Alert 4 Network@AP > 35k Online 4. Rank alerts Alert 1 Alert 1 Alert 1 Real alerts Alert 4

t t t t t t t System Invariants Target System t mn . . . mn m1 t Flow intensity: the intensity with which internal monitoring data reacts to the volume of user requests. . . . any constant relationship ??? m2 mi+2 User requests m3 mi+1 . . . mi m4 • User requests flow through system endlessly and many internal monitoring data react to the volume of user requests accordingly. • We search the relationships among these internal measurements collected at various points. • If modeled relationships continue to hold all the time, they can be regarded as invariants of the system.

Invariant Examples • Check implicit relationships, but not real values of flow intensities, which are always changing. However manyrelationships are constant!! • Example: x, y are changing but the equationy=f(x) is constant. Packet volume V1 Database Server Load Balancer O1 I1 O2 SQL query number N1 O3 Invariant V1 =f(N1) I1 = O1+O2+O3

Automated Invariants Search Target System Monitoring [t0-t1] [t1-t2] [tk-tk+1] observation data observation data observation data model library f pick any two measurements i, j to learn f ij with new data [t1-t2], do f ij hold ? with new data [tk-tk+1], do f ij hold ? Template Yes Yes P1 PK f ij: Invariant candidates P0 NO NO drop the variants f ij drop the variants f ij Pi: Confidence Score Sequential validation

One example in model library • We use an AutoRegressive model with eXogenous (ARX) to learn the relationship between two flow intensity measurements. • Define • Given a sequence of real observations, using LMS, we learn the model by minimizing the error. • A fitness function can be used to evaluate how well the learned model fits the real data.

Value Propagation with Invariants With ARX Model Set Converged Multi hops y z=g(y) y=f(x) z=g(f(x)) z Extract invariants v=s(u) v=s(h(x)) v x u=h(x) u

Rules and Fault Model Rule Predicate Action False positive 1 Ideal model Probability of fault occurrence Realistic model 0 x xT False negative Fault model for each rule

Probability of Reporting a True Positive Alert A very small false positive rate leads to large number of false positive repots. • Importance of an alert: Ex. One measurement is checked every minuteand its FP rate is 0.1% => 60x24x365x0.1% = 526 FP reports for a year! => What if thousands of measurements are there!!! Ex. Real operation support system: 80% of reports are FPs Probability of Reporting a True Positive (PRTP) generated by value x

Local Context Mapping to Global Context Web AP CPU%Web = fa(Network@AP) CPU%Web = fb(CPU%@DB) CPU%Web = fc(DiskUsg%@Web) Global context Different semantics DB Fault model of CPU%Web Prob(true|XCPU@DB) Alert 3 1 Alert 1 PRTP CPU% @Web > 70% Alert 1 > Prob(true|XT) Alert 2 xNetwork@AP DiskUsg@Web > 150 = fa(Network@AP) > Prob(true|XDiskUsg@Web) Alert 2 Alert 3 CPU% @DB > 60% > Prob(true|XNetwork@AP) Alert 4 Alert 4 Network@AP > 35k 0 x xDiskUsg@WEB xT xCPU@DB = fc(DiskUsg@WEB) = fb(CPU%@AP)

Local Context Mapping to Global Context Web AP DB Fault model of Network%AP Prob(true|XCPU@DB) Alert 3 1 Alert 1 PRTP CPU% @Web > 70% Alert 1 > Prob(true|XCPU@WEB) Alert 2 xT DiskUsg@Web > 150 > Prob(true|XDiskUsg@Web) Alert 2 Alert 3 CPU% @DB > 60% > Prob(true|XT) Alert 4 xCPU@DB Alert 4 Network@AP > 35k 0 x Alert ranking: No Change xDiskUsg@WEB xCPU@WEB

Alerts Ranking Process At time of alerts received Online 4. Rank alerts Alert 1 Alert 1 Alert 1 Real alerts Alert 4

Ranking Alerts (Case I) Case I: Receive ONLY ALERTS, no monitoring data from components Sorted alert rules Alerts ranking Alert 3 Alert 3 1 System Invariants Network Alert 7 Alert 7 2 Alert 2 Alert 2 3 Alert 6 4 Alert 1 Alert 1 5 Alert 9 Operator’s knowledge & configuration 5 alerts generated Alert 5 Alert 5 Alert 4 Alert 8

Ranking Alerts (Case II) Case II: Receive both alerts and monitoring data from components Number of Threshold Violations (NTV) Fault model of CPU%Web Fault model of Network%AP = fa(Network@AP) 1 1 = fc(DiskUsg@WEB) = fb(CPU%@AP) PRTP PRTP xNetwork@AP xT NTV=2 NTV=3 Observed Value X(CPU%Web) xCPU@DB Observed Value X(Network%AP) 0 0 x x Alert by CPU%Web is more important than one from Network%AP. xDiskUsg@WEB xDiskUsg@WEB xCPU@WEB xT xCPU@DB

Index • Introduction • Motivation & Goal • System Invariants • Invariants extraction • Value propagation • Collaborative peer review mechanism • Rules & Fault model • Ranking alerts • Experiment result • Conclusion

Experimental system Flow Intensities: : the number of EJB created at time t. : the JVM processing time at time t. : the number of SQL queries at time t. A B C D A B Invariant Examples: C D

Extracted Invariants Network m3 m5 m1 m2 m6 m4

Thresholds of Measurements m1 m2 m3 m4 m5 m6 T T T T T T m2 m3 m1 m5 m4 m6 70 4 30000 5 63.6 80 3 70.2 2 70.5 30000 1 77.0 70 6 59.8 20000

Thresholds of Measurements m1 m2 m3 m4 m5 m6 T T T T T T m6 m1 m2 m4 m5 m3 70 23208 4 32726 78.0 62.8 29540 30000 21200 5 63.6 71.4 57.4 27018 80 63.0 23291 3 70.2 33006 29646 33212 2 70.5 81.0 63.7 23509 30000 1 77.0 86.4 70 25688 32613 36316 6 59.8 28207 66.9 54.1 20000 25469

Ranking Alerts with NTVs (1) m1 70 m2 m3 m4 m5 m6 T T T T T T m1 m2 m4 m6 m3 m5 23208 32726 78.0 29540 62.8 63.6 30000 21200 71.4 27018 57.4 70.2 80 63.0 23291 33006 29646 70.5 33212 30000 23509 81.0 63.7 77.0 25688 86.4 70 36316 32613 Observed value 81.6 30621 71.4 22620 59.8 73.6 34319 NTVs 5 5 6 2 28207 66.9 25469 54.1 20000 5 5 1 2 2 2 2 6

Ranking Alerts with NTVs (1)

Ranking Alerts with NTVs (2) m1 70 m2 m3 m4 m5 m6 T T T T T T m4 m2 m3 m6 m5 m1 23208 32726 78.0 29540 62.8 63.6 30000 21200 71.4 27018 57.4 70.2 80 63.0 23291 33006 29646 70.5 33212 23509 81.0 63.7 30000 77.0 25688 86.4 70 36316 32613 Observed value 54.6 22712 46.1 18564 59.8 73.5 31478 NTVs - - - - 20000 28207 66.9 25469 54.1 5 2 1 2

Ranking Alerts with NTVs (2) Inject a problem (SCP copy) to Web server

Conclusion • We introduce a peer review mechanism to rank alerts from heterogeneous components • By mapping local thresholds of various rules into their equivalent values in a global context • Based on system invariants network model • To support operators’ consultation for prioritization of problem determination.

Thank You! • Questions?

Ranking the Importance of Alerts for Problem Determination in Large Computer System

Ranking the Importance of Alerts for Problem Determination in Large Computer System

Presentation Transcript

Importance of Supplier Relationship Management in Contracting for Large Weapon System Sustainment

Pinpoint: Problem Determination in Large, Dynamic Internet Services

Adaptive Ranking Model for Ranking Code-Based Static Analysis Alerts

Ranking Problem In the Network

Importance of Organizations for Computer Professionals

The importance of computer backups

Review of System Alerts

The importance of locality in the visualization of large datasets

Automated Adaptive Ranking and Filtering of Static Analysis Alerts

Importance of SLR in the Determination of the ITRF

Problem Determination

Computer Alerts

Forthworth Technologies - Importance of ram for the computer system

Importance of Notary in the Legal System

Importance of Computer Courses for Adults

Adaptive Ranking Model for Ranking Code-Based Static Analysis Alerts

Automated Adaptive Ranking and Filtering of Static Analysis Alerts

Review of System Alerts

Importance of image optimization for website better ranking

Characteristics and Importance of Self Determination

THE IMPORTANCE OF CONFERENCE ALERTS FOR 2021-2022