Randy Schauer,  Anupam Joshi
Download
1 / 1

Research partially supported by the ARL MSRC (GSA Contract GS00T99ALD0209) & Raytheon - PowerPoint PPT Presentation


  • 75 Views
  • Uploaded on

Randy Schauer, Anupam Joshi. A Probabilistic Approach to Distributed System Management. Why is the management of large scale distributed systems a problem? New High Performance Computing (HPC) clusters are already running

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Research partially supported by the ARL MSRC (GSA Contract GS00T99ALD0209) & Raytheon' - baina


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
Research partially supported by the arl msrc gsa contract gs00t99ald0209 raytheon

Randy Schauer, Anupam Joshi

A Probabilistic Approach to

Distributed System Management

  • Why is the management of large scale distributed

  • systems a problem?

  • New High Performance Computing (HPC) clusters are already running

  • over 100 TeraFLOPS (Trillion Floating Point Operations per Second) on

  • a consistent basis, the PetaFLOP era is near.

  • Systems are becoming too large for system administrators to manage easily

BlueGene/L 596 TFLOPS

LLNL Livermore, CA

  • How can this problem be solved?

  • The system must be able to manage aspects of its configuration without

  • using a central image master, relying only on the knowledge of its peers

  • The system must be able to understand and evaluate its operating

  • environment to catch issues before they become catastrophic problems

LNXI ATC (MJM) 53 TFLOPS

ARL MSRC Aberdeen, MD

  • How can we determine the correct configuration in a distributed system?

  • Large clusters require the various commodity components to be tied together operationally through software

  • configurations, resulting in an inability to accurately model all possible configuration parameters

  • Based on the infinite possible configurations and optimal settings for differing environments, a statistical

  • relational learning method is the preferred inference mechanism, specifically Markov Logic Networks

  • Markov Logic Networks provide a first-order predicate knowledge base with a weight applied to each

  • formula, allowing for an initial set of conditions that capture the rules needed to make informed decisions

  • File Access Permissions

  • Comparisons required to ensure proper permissions for both

  • security and access include majority rule, most restrictive and

  • time-based differences

  • A statistical approach to solving this issue takes known factors

  • into account and weights them as appropriate, allowing us to

  • minimize uncertainty and determine the most valid option

  • Processor Heat Analysis

  • Determine if a processor is overheating by comparing the

  • temperatures being reported on the neighboring nodes and in

  • the nodes residing in the same rack location in neighboring racks

  • Nodes toward the middle and top tend to get hotter than nodes

  • toward the outside and bottom

  • So, what have we learned so far?

  • We understand that the ability to diagnose and recover from performance and configuration issues without

  • resorting to a centralized knowledge base is the next great stride in allowing systems to self-manage their

  • reliability and stability

  • Preliminary results show this is a good approach to using logic for probabilistic model-based diagnosis.

  • The results are promising, especially for such a radical change in the approach to system management, but

  • for production deployment, further refinement is necessary in order to obtain statistically significant results.

Research partially supported by the ARL MSRC (GSA Contract GS00T99ALD0209) & Raytheon