1 / 68

Automated Cellular Root Cause Analysis

Automated Cellular Root Cause Analysis. Sayandeep Sen Bell Labs India Joint work with Sourjya Bhaumik & Rijin John . Cellular Base Station Monitoring. Every 15 minutes. Cell sites. Monitoring Centre. Cell site. Cellular Base Station Monitoring. Performance counters.

limei
Download Presentation

Automated Cellular Root Cause Analysis

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Automated Cellular Root Cause Analysis SayandeepSenBell Labs India Joint work with SourjyaBhaumik & RijinJohn

  2. Cellular Base Station Monitoring Every 15 minutes Cell sites Monitoring Centre Cell site

  3. Cellular Base Station Monitoring Performance counters Every 15 minutes Cell sites Monitoring Centre Performance counters Example: connected users, average signal strength, cell radius etc. Cell site

  4. Cellular Base Station Monitoring KPI Every 15 minutes Cell sites Monitoring Centre KPI: Key Performance Indicator Example: Call drop rate, Successful connection setup rate, Throughput Cell site

  5. Root cause analysis KPI KPI Cell sites Monitoring Centre Why KPI went below threshold ? Performance counters Cell site Manually

  6. Root Cause Analysis – Issues Manual debugging is inefficient KPI • Too many variables • ~300 parameters • 1 engineer per O(100) cell sites Time Parameter 1 Time Parameter N Time

  7. Root Cause Analysis – Issues Manual debugging is inefficient KPI • Too many variables • ~300 parameters • 1 engineer per O(100) cell sites Time Parameter 1 Sporadic parameter dips ??? Time Parameter N Time

  8. Root Cause Analysis – Issues Manual debugging is inefficient KPI • Too many variables • ~300 parameters • 1 engineer per O(100) cell sites Time Parameter 1 Sporadic parameter dips Time Multiple parameter interaction Parameter N Time

  9. Problem Statement Carry out automated (fast)root cause analysis which accounts for sporadic dips and multiple parameter interactions while ensuring human readable output.

  10. Outline • Motivation • Problem statement • Approach • Insight, Mechanism, Customizations • Results • Ongoing work • Other work

  11. Key Intuition KPI-parameter relationship is dependent on other parameter values

  12. Key Intuition Call Success Conn. Req. Handoff rate

  13. Key Intuition Call Success X Threshold y Conn. Req. Conn. Req. > X & H/o =y Handoff rate

  14. Key Intuition Call Success KPI-parameter relationship is dependent on other parameter values Determine the rules for various parameter combination values using Regression trees X’ Conn. Req. y’ Conn. Req. > X’ & H/o =y’ Handoff rate

  15. Outline • Motivation • Problem statement • Approach • Insight, Mechanism, Customizations • Results • Ongoing work • Other work

  16. Regression trees Call Success Δ Form clusters of points Δ’ Δ” To minimize the sum of distance metric for sub-clusters

  17. Regression trees Call Success Δ Form clusters of points Δ’ Δ” To minimize the sum of distance metric for sub-clusters Distance metric: sum of Euclidean distance of points in a sub-cluster Provide human readable rule for each cluster

  18. Regression trees Call Success 1) Pick an axis 2) Calculate Δ Conn. Req.

  19. Regression trees Call Success 1) Pick an axis X 2) Calculate Δ 3)Pick pivot to divide points in two clusters, Conn. Req.

  20. Regression trees Call Success Δ” 1) Pick an axis X 2) Calculate Δ 3)Pick pivot to divide points in two clusters, Δ’ 4) Calculate Δ’+Δ” Conn. Req.

  21. Regression trees Call Success 1) Pick an axis Repeat for all pivots X X X X 2) Calculate Δ 3)Pick pivot to divide points in two clusters, 4) Calculate Δ’+Δ” Conn. Req.

  22. Regression trees Repeat for all axis Call Success 1) Pick an axis Repeat for all pivots 2) Calculate Δ 3)Pick pivot to divide points in two clusters, 4) Calculate Δ’+Δ” Conn. Req.

  23. Regression trees Repeat for all axis Call Success 1) Pick an axis Repeat for all pivots X 2) Calculate Δ 3)Pick pivot to divide points in two clusters, 4) Calculate Δ’+Δ” 5) Pick pivot with minimum Δ’+Δ” Conn. Req. Conn.Req<X Conn.Req>=X

  24. Regression trees Repeat for all axis Call Success 1) Pick an axis Repeat for all pivots X 2) Calculate Δ 3)Pick pivot to divide points in two clusters, 4) Calculate Δ’+Δ” 5) Pick pivot with minimum Δ’+Δ” Conn. Req. Conn.Req<X Conn.Req>=X Repeat for sub-clusters

  25. Regression trees Repeat for all axis Call Success 1) Pick an axis Repeat for all pivots X 2) Calculate Δ 3)Pick pivot to divide points in two clusters, Handoff rate 4) Calculate Δ’+Δ” Y 5) Pick pivot with minimum Δ’+Δ” Conn. Req. Conn.Req<X Conn.Req>=X Repeat for sub-clusters Handoff Rate < Y Handoff Rate >= Y

  26. Regression trees Repeat for all axis Call Success 1) Pick an axis Repeat for all pivots X 2) Calculate Δ 3)Pick pivot to divide points in two clusters, Handoff rate 4) Calculate Δ’+Δ” Y 5) Pick pivot with minimum Δ’+Δ” Conn. Req. Conn.Req<X Conn.Req>=X Repeat for sub-clusters Handoff Rate < Y Handoff Rate >= Y Select rules corresponding to low KPI values

  27. Regression trees Call Success X Human readable Handoff rate Capture sporadic events due to time agnostic clustering Y Conn. Req. Conn.Req<X Conn.Req>=X Capture multiple variable interaction Handoff Rate < Y Handoff Rate >= Y

  28. Outline • Motivation • Problem statement • Approach • Insight, Mechanism, Customizations • Results • Ongoing work • Other work

  29. Regression trees – Issues • Distance metric oblivious of significance of KPI values • Curse of dimensionality

  30. Metric oblivious KPI value significance Call Success Conn. Req. Need big separation between good and bad values Handoff rate

  31. Metric oblivious KPI value significance Call Success Conn. Req. Bad 98.5% Call Success Handoff rate

  32. Metric oblivious KPI value significance Call Success 98.7 % 98.5% 98.6% Conn. Req. Bad 98.5% Call Success Handoff rate

  33. Metric oblivious KPI value significance Call Success Distinction between good and bad is small 98.7 % 98.5% 98.6% Conn. Req. Bad 98.5% Call Success Handoff rate Stratify KPI values

  34. Metric oblivious KPI value significance Call Success Distinction between good and bad is small 98.7 % 98.5% 98.6% Conn. Req. Bad 98.5% Call Success Handoff rate Multiply KPI value with custom step function

  35. Stratification of data Call Success Distinction between good and bad is small 98.7 % 98.5% 98.6% Conn. Req. Bad 98.5% Call Success Handoff rate Multiply KPI value with custom step function

  36. Stratification of data Call Success Distinction between good and bad is small 98.7 % 98.5% 98.6% Conn. Req. Bad Call Success Handoff rate

  37. Stratification of data Call Success Distinction between good and bad is small 98.7 % 98.5% 98.6% Conn. Req. Bad 98.5% Call Success Handoff rate

  38. Regression trees – Issues • Distance metric oblivious of significance of KPI values • Stratify KPI values • Curse of dimensionality reduction

  39. Curse of Dimensionality Call Success Handoff rate < X & Conn. Req. < Y Cell Radius > X & Allotted Power < Y Traffic Load Interference Traffic Load > X & Interference > Y

  40. Curse of Dimensionality Call Success Handoff rate < X & Conn. Req. < Y ~300 variables lead to 2^300 combinationsregression tree can be misled Cell Radius > X & Allotted Power < Y Traffic Load Interference Traffic Load > X & Interference > Y

  41. Dimensionality reduction • Preprocessing • Remove correlated, barely changing parameters etc. • Domain knowledge based filtering • Remove unrelated parameters, apply weights • Heuristics • Spike, Correlation, 3 more …

  42. Spike heuristic Values spike around same time Call Success Time Time

  43. Correlation heuristic Call Success > 98.5 % Call Success <= 98.5 % Call Success Call Success Conn. Req. Conn. Req. Correlation changes significantly

  44. Rule generation Stratify KPI data Data store Apply filters Regression tree Select rules Rule store

  45. Rule application Matching rules Rule store

  46. Outline • Motivation • Problem statement • Approach • Insight, Mechanism, Customizations • Results • Ongoing work • Other work

  47. Training & Verification Data • Analyzed 28 days of data from 217 cell sites • 2 countries, 2 OEMs • 317 parameters @ 15 minute interval • 80% data to train and 20% to validate

  48. Find rules for all KPI dips Instances Instances KPI KPI Country #2(60 cell sites) Country #1 (18 cell sites) Cell sites with at least 4 KPIs with more than 100 bad instances selected

  49. Rule Verification • Picked rules for randomly selected 50 KPI dips • Show rules to 15 RF engineers (Ongoing) • 80% rules were actionable • For all the KPI dips at least one actionablerule in the rule set

  50. Example rule set KPI dip: Call success rate < 98.5% 1) Total users in 5 to 10 KM from base station > 63% 2) Total users in bad RSS region > 21% AND Total uplink load > 831 MB 3) Download Traffic < 500 Kbytes AND Total active users < 200

More Related