Paradyn Week 2006 March 2006 UAB Dynamic Monitoring and Tuning in Multicluster Environment Genaro Costa, Anna Morajko, Paola Caymes Scutari, Tomàs Margalef and Emilio Luque Universitat Autònoma de Barcelona
Outline • Introduction • Multicluster Systems • Applications on Wide Systems • MATE • New Requirements • Design • Conclusions
Introduction System performance • New problems require more computation power. Performance is a key issue. • New wide systems are built over the available resources and the user does not have total control of where the application will run. • It became more difficult to reach high performance and efficiency for these wide systems.
Introduction (II) • To reach performance goals, users need to find and solve bottlenecks. • Dynamic Monitoring and Tuning is a promising approach. • With dynamic systems’ properties, efficient resource use is hard to reach even for expert users.
Multicluster Systems • New systems are built using existing resources. Examples are NOW and HNOW linked with multistage network interconnections. • Intra cluster communications have different latencies than inter cluster communications. • Generally multiclusters built of clusters (homogenous or heterogeneous) interconnected by WAN.
Multicluster Systems (II) • Each cluster can have its own scheduler and can be exposed either through a head node or by all nodes
Applications on Wide Systems Cluster A Master • Hierarchical Master/Worker Applications • Raise the possibility of performance bottlenecks • Load imbalance problems • Inefficient resource use • Non-deterministic inter cluster bandwidth Worker Worker Worker Worker Common data aretransmitted once Cluster B Sub Master Sub Master explores data locality Worker Worker Worker Worker
Applications on Wide Systems (II) • Hierarchical Master/Worker Applications • Sub master is seen as a high processing node by the master. • Work distribution from master to sub master should be based on: • Available bandwidth • Computing power • These characteristics may have dynamic behavior.
Problem / Solution Application development Source User Application Execution Performance data Monitoring Tuning Tool Events Performance analysis MATE • Monitoring, Analysis and Tuning Environment • Dynamic automatic tuning of parallel/distributed applications. Modifications DynInst Instrumentation
MATE (II) Machine 1 Machine 2 modif. AC AC Task1 Task3 Task2 DMLib DMLib DMLib instr. instr. events • Application Controller - AC • Dynamic Monitoring Library - DMLib • Analyzer events Machine 3 Analyzer
Tunlet Tunlet Measure points Measure points Performance model Performance model Tuning point, action, sync Tuning point, action, sync MATE (III) Analyzer DTAPI • Each tuning technique is implemented in MATE as a “tunlet”, a C/C++ library dynamically loaded to the Analyzer process. • measure points – what events are needed • performance model – how to determine bottlenecks and solutions • tuning actions/points/synchronization - what to change, where, when
New Requirements • Transparent process tracking • AC should follow application process to any cluster. • Lower inter cluster instrumentation communication overhead • Inter cluster communications generally have high latency and lower bandwidth.
MATE EnabledMachine AC receivesAnalyzerinformation DESIGN Transparent process tracking • System Service • Machine or Cluster can have MATE enabled as daemon that detects startup of new processes. MATE EnabledMachine MATE EnabledMachine Taskn attach Taskn DMLib AC AC startup detection control Analyzersubscription
DESIGN (II) Transparent process tracking • Application plug-in • AC can be binary packaged with application binary. AC Task DMLib Remote Machine Remote Machine detects Dyninst new ‘Task’ Taskn create new ‘Task’ create DMLib Task AC Job submission AC control AC DMLib Analyzersubscription
DESIGN (III) Lower communication overhead • Smart event collection • Total application trace may generate much overhead. • Event aggregation • Remote trace events should be aggregated to trace event abstractions, saving bandwidth. • Inter Cluster Trace Event Routing
Analyzer Approaches • Centralized • Requires tunlets modification to distinguish instrumentation data of local application processes. • Hierarchical • Requires tunlets dismembering into local tunlets and global tunlets. • Distributed • Requires that tunlets instances located on different Analyzer instances cooperate to tune an application.
DESIGN (IV) Lower communication overhead (II) • Centralized Analyzer Approach Cluster A Cluster B Machine A1 Machine B1 Machine B3 Machine A2 Task1 Task1 AC AC AC AC Task2 Task4 Task3 Task3 Machine B2 AC Machine A3 Analyzer Event Router Task2
DESIGN (V) Local Performance Model Analysis • Hierarchical Analyzer Approach Cluster A Cluster B Machine A1 Machine B1 Machine B3 Machine A2 Task1 Task1 AC AC AC AC Task2 Task4 Task3 Task3 Machine B2 LocalAnalyzer Machine A3 Machine A4 LocalAnalyzer GlobalAnalyzer Abstract Events
DESIGN (VI) Distributed Monitoring, Analysis and Tuning Environment • Distributed Analyzer Approach Cluster A Cluster A Cluster B Cluster B Machine A1 Machine B1 Machine B3 Machine A2 Task1 Task1 AC AC AC AC Task2 Task4 Task3 Task3 Machine B2 Machine A3 Analyzer Tunlet instancescooperation Analyzer
Conclusions and future work • Conclusions • Interference of instrumentation information on inter cluster communication should be minimal. • Process tracking enables MATE for multicluster systems. • Centralized Analyzer approach benefits tunlet developer but does not scale. • Distributed Analyzer approach scales but requires different model based analysis.
Conclusions and future work (II) • Future Work • Development of new tunlets for distributed and hierarchical Analyzer approach. • Tuning based only of local instrumentation data. • Semantics of aggregation for Instrumentation events. • Patterns of distributed tunlets cooperation. • Scenarios of distributed Analyzer cooperation in multiclusters.