Detecting Large-Scale System Problems by Mining Console Logs

Detecting Large-Scale System Problems by Mining Console Logs Wei Xu (徐葳)* Ling Huang† Armando Fox* David Patterson* Michael Jordan* • *UC Berkeley † Intel Labs Berkeley • Jan 18, 2009, MSRA

Why console logs? • Detecting problems in large scale Internet services often requires detailed instrumentation • Instrumentation can be costly to insert & maintain • Often combine open-source building blocks that are not all instrumented • High code churn • Can we use console logs in lieu of instrumentation? + Easy for developer, so nearly all software has them – Imperfect: not originally intended for instrumentation

The ugly console log HODIE NATUS EST RADICI FRATER today unto the Root a brother is born. * "that crazy Multics error message in Latin." http://www.multicians.org/hodie-natus-est.html

Result preview Parse Detect Visualize 200 nodes, >24 million lines of logs Abnormal log segments A single page visualization • Fully automatic process without any manual input

Outline • Key ideas and general methodology • Methodology illustrated with Hadoop case study • Four step approach • Evaluation • An extension: supporting online detection • Challenges • Two stage detection • Evaluation • Related work and future directions

Our approach and contribution Feature Creation Machine Learning Parsing Visualization • A general methodology for processing console logs automatically (SOSP’09) • A novel method for online detection (ICDM ’09) • Validation on real systems • Working on production data from real companies

Key insights for analyzing logs receivingblk_1 received blk_1 receivingblk_2 NORMAL ERROR • Console logs are inherently structured • Determined by log printing statement • The log contains the necessary information to create features • Identifiers • State variables • Correlations among messages (sequences)

Step 1: Parsing Receiving block blk_1 Log.info(“Receiving block ” + blockId); [blockId] Receiving block (.*) Type: Receiving block Variables: blockId(String)=blk_1 • Non-trivial in object oriented languages • String representation of objects • Sub-classing • Our Solution • Type inference on source code • Pre-compute all possible message templates Free text → semi-structured text Basic ideas

Parsing: Scale parsing with map-reduce 24 Million lines of console logs = 203 nodes * 48 hours

Parsing: Language specific parsers • Source code based parsers • Java (OO Language) • C / C++ (Macros and complex format strings) • Python (Scripting/dynamic languages) • Binary based parsers • Java byte code • Linux + Windows binaries using printf format strings

Step 2: Feature creation - Message count vector receivingblk_1 receiving blk_1 received blk_1 received blk_1 receivingblk_2 receivedblk_2 receiving blk_2 • Identifiers are widely used in logs • Variables that identify objects manipulated by the program • file names, object keys, user ids • Grouping by identifiers • Similar to execution traces • Identifiers can be discovered automatically

Feature creation – Message count vector example • Numerical representation of these “traces” • Similar to bag of words model in information retrieval blk_1 ■■■ ■■■■■■■■■■■■■ 0 1 2 0 0 2 0 0 0 0 0 0 0 0 2 2 blk_2 0 0 1 2 0 0 2 0 0 0 0 0 0 0 2 1 ■■■ ■■■■■■■■■■■■■ Receiving blk_1 Receiving blk_2 Received blk_2 Receivingblk_1 Receivedblk_1 Receivedblk_1 Receiving blk_2

Step 3: Machine learning – PCA anomaly detection 0 1 2 0 0 2 0 0 0 0 0 0 0 0 2 2 receivingblk_1 received blk_1 receivingblk_2 • Most of the vectors are normal • Detecting abnormal vectors • Principal Component Analysis (PCA) based detection • PCA captures normal patterns in these vectors • Based on correlations among dimensions of the vectors NORMAL ERROR

Evaluation setup • Experiment on Amazon’s EC2 cloud • 203 nodes x 48 hours • Running standard map-reduce jobs • ~24 million lines of console logs • ~575,000 HDFS blocks • 575,000 vectors • ~ 680 distinct ones • Manually labeled each distinct cases • Normal/abnormal • Tried to learn why it is abnormal • For evaluation only

PCA detection results False Positives How can we make the results easy for operators to understand?

Step 4: Visualizing results with decision tree writeBlock # received exception >=1 ERROR 0 # Starting thread to transfer block # to # >=3 1 ERROR <=2 #: Got exception while serving # to #:# >=1 OK 0 Unexpected error trying to delete block #\. BlockInfo Not found in volumeMap >=1 ERROR 1 0 addStoredBlock request received for # on # size # But it does not belong to any file >=1 ERROR 1 0 # starting thread to transfer block # to # >=1 OK 0 0 #Verification succeeded for # >=1 OK 0 0 Receiving block # src: # dest: # <=2 1 ERROR >=3 OK 0

The methodology is general • Surveyed a number of software apps • Linux kernel, OpenSSH • Apache, MySQL, Jetty • Cassandra, Nutch • Our methods apply to virtually all of them • Same methodology worked on multiple features / ML algorithms • Another feature “state count vector” • Graph visualization • WIP: Analyzing production logs at Google

What makes it hard to do online detection? Feature Creation PCA Detection Parsing Visualization Message count vector feature is based on sequences, NOT individual messages

Online detection: when to make detection? receivingblk_1 receivingblk_1 received blk_1 received blk_1 reading blk_1 reading blk_1 deleting blk_1 deleted blk_1 • Cannot wait for the entire trace • Can last arbitrarily long time • How long do we have to wait? • Long enough to keep correlations • Wrong cut = false positive • Difficulties • No obvious boundaries • Inaccurate message ordering • Variations in session duration Time

Frequent patterns help determine session boundaries • Key Insight: Most messages/traces are normal • Strong patterns • “Make common paths fast” • Tolerate noise

Two stage detection overview Dominant cases Frequent pattern based filtering Parsing OK Non-pattern Normal cases OK Free text logs200 nodes PCA Detection ERROR Real anomalies

Stage 1 - Frequent patterns (1): Frequent event sets Repeat until all patterns found receivingblk_1 receivingblk_1 Coarse cut by time received blk_1 reading blk_1 Find frequent item set received blk_1 Refine time estimation error blk_1 reading blk_1 deleting blk_1 deleted blk_1 PCA Detection Time

Stage 1 - Frequent patterns (2) : Modeling session duration time Count Pr(X>=x) • Cannot assume Gaussian • 99.95th percentile estimation is off by half • 45% more false alarms • Mixture distribution • Power-law tail + histogram head Duration Duration

Frequent pattern matching filters most of the normal events 86% 100% Dominant cases Frequent pattern based filtering Parsing* OK Non-pattern 13.97% 14% Normal cases OK Free text logs200 nodes PCA Detection ERROR Real anomalies 0.03%

Frequent patterns in HDFS (Total events ~20 million) • Covers most messages • Short durations

Detection latency Frequent pattern (matched) Single event pattern Frequent pattern (timed out) Non pattern events Detection latency is dominated by the wait time

Detection accuracy (Total trace = 575,319) • Ambiguity on “abnormal” • Manual labels:“eventually normal” • > 600 FPs in online detection as very long latency • E.g. a write session takes >500sec to complete (99.99th percentile is 20sec)

Related work: system monitoring and problem diagnosis Heavier instrumentation Finer granularity data • Source/binary • Aspect-oriented programming • Assertions • (Passive) data collection • Syslog • SNMP • Chukwa [Boulon08] • Configurable Tracing frameworks • Dtrace • Xtrace [Fonseca07] • VM-based tracing [King05] Statistical debugging [Zheng06] AjaxScope [Kiciman07] Numerical traces SystemHistory[Cohen05] Fingerprints[Bodik10] [Lakhina04] Path-based problem detection Pinpiont [Chen04] Replay-based Deterministic replay [Altekar09] Event as time series [Hellerstein02] [Lim08] [Yamanishi05] Runtime predicates D3S [Liu08] Alternative logging PeerPressure[Wang04] Heat-ray [Dunagan09] [Kandula09]

Related work: log parsing • Rule based and scripts • Microsoft log parser, splunk, … • Frequent pattern based • [Vaarandi04] R. Vaarandi. A breadth-first algorithm for mining frequent patterns from event logs. In proceedings of INTELLCOMM, 2004. • Hierarchical • [Fu09] Q. Fu, et al. Execution Anomaly Detection in Distributed Systems through Unstructured Log Analysis. In Proc. of ICDM 2009 • [Makanju09] A. A. Makanju, et al. Clustering event logs using iterative partitioning. In Proc. of KDD ’09 • [Fisher08] K. Fisher, et al. From dirt to shovels: fully automatic tool generation from ad hoc data. In Proc. of POPL ’08 • Key benefit of using source code • Handles rare messages, which is likely to be more useful in problem detection

Related work: modeling logs • As free texts • [Stearley04] J. Stearley. Towards informatic analysis of syslogs. In Proc. of CLUSTER, 2004. • Visualize global trends • [CretuCioCarlie08] G. Cretu-Ciocarlie et al, Hunting for problems with Artemis, In Proc of WASL 2008 • [Tricaud08] S. Tricaud, Picviz: Finding a Needle in a Haystack, In Proc. of WASL 2008 • A single event stream (time series) • [Ma01] S. Ma, et.al.. Mining partially periodic event patterns with unknown periods. In Proceedings of the 17th International Conference on Data Engineering, 2001. • [Hellerstein02] J. Hellerstein, et.al. Discovering actionable patterns in event data, 2002. • [Yamanishi05] K. Yamanishi, et al. Dynamic syslog mining for network failure monitoring. In Proceedings of ACM KDD, 2005. • Multiple streams / state transitions • [Tan08] J. Tan, et al. SALSA: Analyzing Logs as State Machines. In Proc. of WASL 2008 • [Fu09] Q. Fu, et al. Execution Anomaly Detection in Distributed Systems through Unstructured Log Analysis. In Proc. of ICDM 2009

Related work: techniques that inspired this project • Path based analysis • [Chen04] M. Y. Chen, et al. Path-based failure and evolution management. In Proceedings of NSDI, 2004. • [Fonseca07] R. Fonseca, et.al. Xtrace: A pervasive network tracing framework. In In NSDI, 2007. • Anomaly detection and kernel methods • [Gartner03] T. Gartner. A survey of kernels for structured data. SIGKDD Explore, 2003. • [Twining03] C. J. Twining, et al. The use of kernel principal component analysis to model data distributions. Pattern Recognition, 36(1):217–227, January 2003. • Vector space information retrieval • [Papineni 01] K. Papineni. Why inverse document frequency? In NAACL, 2001. • [Salton87] G. Salton et al. Term weighting approaches in automatic text retrieval. Cornell Technical report, 1987. • Using source code as alternative info • [Tan07] L. Tan, et.al.. /*icomment: bugs or bad comments?*/. In Proceedings of ACM SOSP, 2007.

Future directions: console log mining • Distributed log stream processing • Handle large scale cluster + partial failures • Integration with existing monitoring systems • Allows existing detection algorithm to apply directly • Logs from the entire software stack • Handling logs across software versions • Allowing feedback from operators • Providing suggestions on improving logging practice

Beyond console logs: textual information in SW dev-operation • Paper

Summary Feature Creation Machine Learning Parsing Visualization http://www.cs.berkeley.edu/~xuw/ Wei Xu <xuw@cs.berkeley.edu> • A general methodology for processing console logs automatically (SOSP’09) • A novel method for online detection (ICDM ’09) • Validation on real systems

Backup slides

Rare messages are important Receivingblock blk_100 src: … dest:... Receivingblock blk_200src: … dest:... Receiving block blk_100 src: …dest:… Received block blk_100 of size 49486737 from … Received block blk_100 of size 49486737 from … Pending Replication Monitor timeout block blk_200

Compare to natural language (semantics) based approach • We use console log solely as a trace of program execution • Semantic meanings sometimes misleading • Does “Error” really mean it? • Missing link between the message and the actual program execution • Hard to completely eliminate manual search

Compare to text based parsing • Existing work on inferring structure of log messages without source codes • Patterns in strings • Words weighing • Learning “grammar” of logs • Pros -- Don’t need source code analysis • Cons -- Unable to handle rare messages • 950 possible, only <100 appeared in our log • Per-operation features require highly accurate parsing

Better logging practices • We do not require developers to change their logging habits • Use logging • Traces vs. error reporting • Distinguish your messages from others’ • printf(“%d”, i); • Use a logging framework • Levels, timestamps etc. • Use correct levels (think in a system context) • Unnecessary WARNINGs and ERRORs

Ordering of events • Message count vectors do not capture any ordering/timing information • Cannot guarantee the global ordering on distributed systems • Needs significant change to the logging infrastructure

Too many alarms -- what can operators do with them? • Operators see clusters of alarms instead of individual ones • Easy to cluster similar alarms • Most of the alarms are the same • Better understanding on problems • Deterministic bugs in the systems • Provides more informative bug reports • Either fix the bug or the log • Due to transient failures • Clustered around single nodes/subnets/time • Better understanding of normal patterns

Case studies • Surveyed a number of software apps • Linux kernel, OpenSSH • Apache, MySQL, Jetty • Hadoop, Cassandra, Nutch • Our methods apply to virtually all of them

Log parsing process

Step 1: Log parsingChallenges in object oriented languages LOG.info(“starting:” + transact); starting:xact 325 is PREPARING starting: xact 325 is STARTING at Node1:1000 starting: (.*) [transact][Transaction] [Participant.java:345] xact(.*)is(.*) [tid, state] [int, String] starting:xact 325 is PREPARING starting:(.*) [transact][Transaction] [Participant.java:345] xact(.*)is(.*)at(.*) [tid, state, node] [int, String, Node] starting: xact 325 is STARTING at Node1:1000

Intuition behind PCA

Another decision tree

Existing work on textual log analysis • Frequent item set mining • R. Vaarandi. SEC - a lightweight event correlation tool. IEEE Workshop on IP Operations and Management, 2002. • Temporal properties analysis • Y. Liang, A. Sivasubramaniam, and J. Moreira. Filtering failure logs for a bluegene/L prototype. In Proceedings of IEEE DSN, 2005. • C. Lim, et.al. A log mining approach to failure analysis of enterprise telephony systems. In Proceedings of IEEE DSN, 2008. • Statistical modeling of logs • R. K. Sahoo, et.al.. Critical event prediction for proactive management in largescale computer clusters. In Proceedings of ACM KDD, 2003.

Detecting Large-Scale System Problems by Mining Console Logs