Supporting System-Wide Similarity Queries for Networked System Management

www.nec-labs.com Supporting System-Wide Similarity Queries forNetworked System Management Songyun Duan1Hui Zhang2 Guofei Jiang2 Xiaoqiao Meng1 1. IBM T.J. Watson Research Center 2. NEC Laboratories America Hawthorne, NY Princeton, NJ USA

Outline ■ Problem statement • Solution • Evaluation

Problem statement • Motivation • Large networked systems are the backbone of modern IT and Internet services. • System dynamics and complexity lead to management difficulties when unexpected events happen. • System administrators are overwhelmed with data in systems management. • Massive monitoring data from extensive instrumentation.

Problem statement Goal – data analysis tools based on a simple and powerful query primitive for systems management Makes information intuitive to data users (e.g., sysamdins) Observation: similarity query in various management tasks Performance management Given the performance problem in time period T, whether and when the system ever experienced a similar problem in the past and had been reported a successful problem diagnosis result? Traffic management What are the top-k <port, protocol> pairs that exhibit the most similar traffic patterns at an hourly time scale? Network security Workload management 4

Background • Preliminaries • On managing massive logs and/or monitoring data, existing systems support data navigation/browsing with visualization or SQL-like queries • Related Work • AT&T telecomm data set visualization tools [keim1999] • UCBerkley TelegraphCQ: continuous dataflow processing for an uncertain world [Chandrasekaran2003] • Yahoo Pig: web-scale log processing [olston2008] • Amazon AWS: GrepTheWeb- Hadoop on AWS [varia2008]. • Facebook Hive: data warehousing using Hadoop [sarma2008] • Our work complements the above with application focus on systems management. • related works [Bahl et al 2007][Kandula et al 2008][Mahimkar et al 2009]

System-wide Similarity Queries (S2Q) • Queries: asking the similarity of an (multiple) object(s) on their time-based states • Nearest neighbor search (Sq,k; S), which asks for the top-k states in S that are most similar to the state S. • Range query (Sq, d; S), which asks for all the states S that are within distance in d of Sq, the target state, and d is a similarity threshold. • Challenges • Query processing efficiency & quality • Massive and noisy data continuously generated from many sources • Supporting integration of domain knowledge into query processing • Diverse management tasks, e.g., workload management, performance management, application management, & security

Outline ■ Problem statement • Networked system management • System-wide similarity queries • Solution • Evaluation

System modeling Data  information (state) Similarity metrics To compare managed objects Indexing For efficient retrieval Similarity query primitives Nearest neighborQ,rangeQ Task view interface To express mgmt. tasks using similarity queries Query plan formulation and execution S2Q framework

M M M t t t k 1 2 2 1 n System Modeling • Monitoring data • Multi-dimensional time-series • <M1t, M2t, …, Mnt >, t=0,1,2,… • System logs / events • not considered in this paper. • Design space • Raw data • Issues: large volume; measurement noise (reading errors, time synchronization, etc.). • Clustering-based techniques • K-Means, LAC, etc. • Issues: hard to decide the number K; curse of dimensionality. • Pairwise-dependency relationships • Bottom-up methodology • Studied in this paper.

System Modeling – pair-wise dependency relationships • Hypotheses • dependency relationships change significantly only when systems transition from one state to another. • Dependency relationships • statistical dependencies • statistical correlation of time-series from a pair of system metrics using some correlation metric. • Correlation metrics • Linear correlation • Covariance matrix structure based correlation.

SVD output of an example ACovX System Modeling – covariance matrix based dependency score • Let X and Y present two time series. The dependency score of X and Y at a time point t is computed as following: • Generate the auto-covariance matrix of X and Y around time t respectively. • Where Xi,w is a time series segment [Xi, Xi+1,…,Xi+w-1], and X’i,w is the transpose of X’i,w. • Compute the dependency score of X and Y based on their auto-covariance matrices. • Decompose the covariance matrices using singular value decomposition (SVD). • The dependency score of X and Y is computed as the distance between two subspaces expanded by the top-k principle components of X and Y respectively. • Dependence score = ½(||UXTuY||+ ||UYTuX||) • UX,Uy – top-k principle components of ACovx and ACovy. • ux,uy – the first principle components of ACovx and ACovy.

System Modeling – system-wide similarity graph • A dependency graph Gt = (V,Et) will be generated at time t • V is the set of attributes of target system objects • Etis the set of dependency relationships between object attributes at t using the covariance-matrix-based dependency metric. • Similarity metrics on two graphs Gt1 and Gt2 • One simple metric is the sum of edge weight difference if dependency scores are used as the edge weights. • In the evaluation, we firstly prunes away the edges whose weights are below a threshold (e.g., 0.9), and then calculates the distance between Gt1 and Gt2 as:

Robustness to noise: (a) time-series (b) dependence score Robustness to time delay: (a) time-series (b) dependence score Streaming algorithm performance Dependence score computation–algorithm and performance

Outline ■ Problem statement • Networked system management • System-wide similarity queries • Solution • S2Q framework • System modeling: related work and our solution • Similarity metric and index: related work and our solution • Evaluation • Task 1: fast diagnosis of repeated failures in IT systems • Task 2: automated application traffic profiling

Task I: fast diagnosis of repeated failures in IT systems • Goal: reuse past diagnosis efforts by locating similar diagnosed failure instances quickly. • 50%-90% failures are recurrences of previous failures [Brodie et al 2005] • S2Q query formulation: • Q = most_similar(Sq,N; SU), which asks about the top-N failure states in SUthat are most similar to the failure state Sq to diagnose. • Experimental setting • Three-tier Web service testbed • JBoss application server, embedded Web server, and MySQL DB • Runs auction service---Rubis---modeled on eBay • Data: #procedure invocations in Java beans of the application tier • Various failures injected to simulate different system states • Java exceptions, deadlock, memory leak, and infinite loop, etc. • Using AFPI tool from Berkeley/Stanford ROC project

1 S2Q 0.9 raw-data 0.8 0.7 0.6 Precision 0.5 0.4 0.3 0.2 0.1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 Recall Task I: results • Dataset • 4120 * 105 • 80 distinct failure states • 40% as test instances • Evaluation metrics • Given a state S, return N most similar instances • Schemes • S2Q • Raw-data: K-means clustering technique applied on the raw monitoring data N=1 N=50 Precision = #matched / N Recall = #matched / #historic_S

Task II: automated application traffic profiling • Goal: automatic learning of application info from network traffics • Hypothesis 1: ports associated with an application show similar patterns along with time and across the port group • Hypothesis 2: traffic through randomly used ports like noise signal • S2Q query formulation: • Q = within(Oq, d;O). Oqis the state of the target traffic object, d is a similarity threshold, and O is the set of all traffic objects found in the monitoring data. • Experimental setting • Dartmouth campus-wide wireless network traffic in packets <SrcIP, DstIP, SrcPort, DstPort, protocol> • Aggregate the data based on <port, protocol> combinations with flow statistics in 5-min interval • #packets, #bytes, & two entropy-related features on SrcIP and DstIP • No prior knowledge about application-port mappings • Output: a set of applications; each is represented as a group of <port, protocol> combinations

Task II: results (I) <139, TCP> <port, protocol> = <137, UDP>

Task II: results (I) <139, TCP> <port, protocol> = <137, UDP> NetBIOS Session Service NetBIOS Name Service

Task II: results (II) • Data: one-day traffic trace at one sniffing point • Out of ~8000 <port, protocol> combinations in the trace, 15 applications were identified • Profiling results: port number (application) • 80 (Http), 53 (DNS), 137~139 (NetBIOS), 1214 (Kazaa), 5190 (AOL Messenger), 161 (SNMP), 0 (ICMP), 67-68 (DHCP), 1071 (BASQUARE-VOIP), 6699 (WinMX) • 6 major applications were verified by the data owner [Kotz2002]

Conclusions & Future Work • Main results • A framework for System-wide Similarity Query is described. • The framework is general for various target systems and systems management tasks. • A robust system modeling technique based on covariance matrix structures is proposed to characterize dependency between multiple time-series. • A graph-based system-wide similarity metric • A streaming algorithm for similarity score computation. • Two systems management applications are evaluated by applying the proposed S2Q methodology. • Future work • Extension and optimization on event & symbolic data. • Implementation in the MapReduce distributed computation framework. • Evaluate other management tasks, e.g., network security.

Supporting System-Wide Similarity Queries for Networked System Management

Supporting System-Wide Similarity Queries for Networked System Management

Presentation Transcript

System Wide Information Management (SWIM)

System Wide Information Management (SWIM)

System Wide Information Management (SWIM)

System Wide Information Management

System Wide Information Management

System Wide Information Management

System Wide Information Management (SWIM)

System-Wide Information Management (SWIM)

System Wide Information Management

System Wide Information Management (SWIM)

System Wide Information Management (SWIM)

System Wide Information Management (SWIM)

System Wide Information Management Overview

System Wide Information Management (SWIM)

System Wide Information Management

Supporting System-Wide Similarity Queries for Networked System Management

System Wide Information Management (SWIM)

System Wide Information Management (SWIM)

System Wide Information Management (SWIM)