1 / 18

Online System Problem Detection by Mining Console Logs

Online System Problem Detection by Mining Console Logs. Wei Xu* Ling Huang † Armando Fox* David Patterson* Michael Jordan*. *UC Berkeley † Intel Labs Berkeley. Why console logs?. Detecting problems in large scale Internet services often requires detailed instrumentation

makya
Download Presentation

Online System Problem Detection by Mining Console Logs

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Online System Problem Detection by Mining Console Logs Wei Xu* Ling Huang† Armando Fox* David Patterson* Michael Jordan* • *UC Berkeley † Intel Labs Berkeley

  2. Why console logs? • Detecting problems in large scale Internet services often requires detailed instrumentation • Instrumentation can be costly to insert & maintain • High code churn • Often combine open-source building blocks that are not all instrumented • Can we use console logs in lieu of instrumentation? + Easy for developer, so nearly all software has them – Imperfect: not originally intended for instrumentation

  3. Problems we are looking for what is wrong with blk_2 ??? receivingblk_1 received blk_1 receivingblk_2 NORMAL ERROR The easy case – rare messages Harder but useful - abnormal sequences

  4. Overview and Contributions Dominant cases Frequent pattern based filtering Parsing* OK Non-pattern Normal cases OK Free text logs200 nodes PCA Detection ERROR Real anomalies • Accurate online detection with small latency * Large-scale system problem detection by mining console logs (SOSP’ 09)

  5. Constructing event traces from console logs receiving blk_1 receiving blk_1 received blk_1 received blk_1 reading blk_1 reading blk_1 receivingblk_1 receiving blk_1 received blk_1 received blk_1 reading blk_1 reading blk_1 receiving blk_2 received blk_2 receiving blk_2 receivingblk_2 receivedblk_2 receiving blk_2 • Parse: message type + variables • Group messages by identifiers (automatically discovered) • Group ~= event trace

  6. Online detection: When to make detection? receivingblk_1 receivingblk_1 received blk_1 received blk_1 reading blk_1 reading blk_1 deleting blk_1 deleted blk_1 • Cannot wait for the entire trace • Can last arbitrarily long time • How long do we have to wait? • Long enough to keep correlations • Wrong cut = false positive • Difficulties • No obvious boundaries • Inaccurate message ordering • Variations in session duration Time

  7. Frequent patterns help determine session boundaries • Key Insight: Most messages/traces are normal • Strong patterns • “Make common paths fast” • Tolerate noise

  8. Two stage detection overview Dominant cases Frequent pattern based filtering Parsing OK Non-pattern Normal cases OK Free text logs200 nodes PCA Detection ERROR Real anomalies

  9. Stage 1 - Frequent patterns (1): Frequent event sets Repeat until all patterns found receivingblk_1 receivingblk_1 Coarse cut by time received blk_1 reading blk_1 Find frequent item set received blk_1 Refine time estimation error blk_1 reading blk_1 deleting blk_1 deleted blk_1 PCA Detection Time

  10. Stage 1 - Frequent patterns (2) : Modeling session duration time Count Pr(X>=x) • Assuming Gaussian? • 99.95th percentile estimation is off by half • 45% more false alarms • Mixture distribution • Power-law tail + histogram head Duration Duration

  11. Stage 2 - Handling noise with PCA detection Dominant cases Frequent pattern based filtering Parsing OK Non-pattern Normal cases OK Free text logs200 nodes PCA Detection ERROR Real anomalies • More tolerant to noise • Principal Component Analysis (PCA) based detection

  12. Frequent pattern matching filters most of the normal events 86% 100% Dominant cases Frequent pattern based filtering Parsing* OK Non-pattern 13.97% 14% Normal cases OK Free text logs200 nodes PCA Detection ERROR Real anomalies 0.03%

  13. Evaluation setup • Hadoop file system (HDFS) • Experiment on Amazon’s EC2 cloud • 203 nodes x 48 hours • Running standard map-reduce jobs • ~24 million lines of console logs • 575,000 traces • ~ 680 distinct ones • Manual label from previous work • Normal/abnormal + why it is abnormal • “Eventually normal” – did not consider time • For evaluation only

  14. Frequent patterns in HDFS (Total events ~20 million) • Covers most messages • Short durations

  15. Detection latency Frequent pattern (matched) Single event pattern Frequent pattern (timed out) Non pattern events Detection latency is dominated by the wait time

  16. Detection accuracy (Total trace = 575,319) • Ambiguity on “abnormal” • Manual labels:“eventually normal” • > 600 FPs in online detection as very long latency • E.g. a write session takes >500sec to complete (99.99th percentile is 20sec)

  17. Future work • Distributed log stream processing • Handle large scale cluster + partial failures • Clustering alarms • Allowing feedback from operators • Correlation on logs from multiple applications / layers

  18. Summary Dominant cases Frequent pattern based filtering Parsing OK Non-pattern Normal cases OK Free text logs200 nodes 24 million lines PCA Detection ERROR Real anomalies http://www.cs.berkeley.edu/~xuw/ Wei Xu <xuw@cs.berkeley.edu>

More Related