Randy Schauer,  Anupam Joshi
This presentation is the property of its rightful owner.
Sponsored Links
1 / 1

Research partially supported by the ARL MSRC (GSA Contract GS00T99ALD0209) & Raytheon PowerPoint PPT Presentation


  • 47 Views
  • Uploaded on
  • Presentation posted in: General

Randy Schauer, Anupam Joshi. A Probabilistic Approach to Distributed System Management. Why is the management of large scale distributed systems a problem? New High Performance Computing (HPC) clusters are already running

Download Presentation

Research partially supported by the ARL MSRC (GSA Contract GS00T99ALD0209) & Raytheon

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Research partially supported by the arl msrc gsa contract gs00t99ald0209 raytheon

Randy Schauer, Anupam Joshi

A Probabilistic Approach to

Distributed System Management

  • Why is the management of large scale distributed

  • systems a problem?

  • New High Performance Computing (HPC) clusters are already running

  • over 100 TeraFLOPS (Trillion Floating Point Operations per Second) on

  • a consistent basis, the PetaFLOP era is near.

  • Systems are becoming too large for system administrators to manage easily

BlueGene/L 596 TFLOPS

LLNL Livermore, CA

  • How can this problem be solved?

  • The system must be able to manage aspects of its configuration without

  • using a central image master, relying only on the knowledge of its peers

  • The system must be able to understand and evaluate its operating

  • environment to catch issues before they become catastrophic problems

LNXI ATC (MJM) 53 TFLOPS

ARL MSRC Aberdeen, MD

  • How can we determine the correct configuration in a distributed system?

  • Large clusters require the various commodity components to be tied together operationally through software

  • configurations, resulting in an inability to accurately model all possible configuration parameters

  • Based on the infinite possible configurations and optimal settings for differing environments, a statistical

  • relational learning method is the preferred inference mechanism, specifically Markov Logic Networks

  • Markov Logic Networks provide a first-order predicate knowledge base with a weight applied to each

  • formula, allowing for an initial set of conditions that capture the rules needed to make informed decisions

  • File Access Permissions

  • Comparisons required to ensure proper permissions for both

  • security and access include majority rule, most restrictive and

  • time-based differences

  • A statistical approach to solving this issue takes known factors

  • into account and weights them as appropriate, allowing us to

  • minimize uncertainty and determine the most valid option

  • Processor Heat Analysis

  • Determine if a processor is overheating by comparing the

  • temperatures being reported on the neighboring nodes and in

  • the nodes residing in the same rack location in neighboring racks

  • Nodes toward the middle and top tend to get hotter than nodes

  • toward the outside and bottom

  • So, what have we learned so far?

  • We understand that the ability to diagnose and recover from performance and configuration issues without

  • resorting to a centralized knowledge base is the next great stride in allowing systems to self-manage their

  • reliability and stability

  • Preliminary results show this is a good approach to using logic for probabilistic model-based diagnosis.

  • The results are promising, especially for such a radical change in the approach to system management, but

  • for production deployment, further refinement is necessary in order to obtain statistically significant results.

Research partially supported by the ARL MSRC (GSA Contract GS00T99ALD0209) & Raytheon


  • Login