slide1
Download
Skip this Video
Download Presentation
Research partially supported by the ARL MSRC (GSA Contract GS00T99ALD0209) & Raytheon

Loading in 2 Seconds...

play fullscreen
1 / 1

Research partially supported by the ARL MSRC (GSA Contract GS00T99ALD0209) & Raytheon - PowerPoint PPT Presentation


  • 73 Views
  • Uploaded on

Randy Schauer, Anupam Joshi. A Probabilistic Approach to Distributed System Management. Why is the management of large scale distributed systems a problem? New High Performance Computing (HPC) clusters are already running

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about ' Research partially supported by the ARL MSRC (GSA Contract GS00T99ALD0209) & Raytheon' - baina


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
slide1

Randy Schauer, Anupam Joshi

A Probabilistic Approach to

Distributed System Management

  • Why is the management of large scale distributed
  • systems a problem?
  • New High Performance Computing (HPC) clusters are already running
  • over 100 TeraFLOPS (Trillion Floating Point Operations per Second) on
  • a consistent basis, the PetaFLOP era is near.
  • Systems are becoming too large for system administrators to manage easily

BlueGene/L 596 TFLOPS

LLNL Livermore, CA

  • How can this problem be solved?
  • The system must be able to manage aspects of its configuration without
  • using a central image master, relying only on the knowledge of its peers
  • The system must be able to understand and evaluate its operating
  • environment to catch issues before they become catastrophic problems

LNXI ATC (MJM) 53 TFLOPS

ARL MSRC Aberdeen, MD

  • How can we determine the correct configuration in a distributed system?
  • Large clusters require the various commodity components to be tied together operationally through software
  • configurations, resulting in an inability to accurately model all possible configuration parameters
  • Based on the infinite possible configurations and optimal settings for differing environments, a statistical
  • relational learning method is the preferred inference mechanism, specifically Markov Logic Networks
  • Markov Logic Networks provide a first-order predicate knowledge base with a weight applied to each
  • formula, allowing for an initial set of conditions that capture the rules needed to make informed decisions
  • File Access Permissions
  • Comparisons required to ensure proper permissions for both
  • security and access include majority rule, most restrictive and
  • time-based differences
  • A statistical approach to solving this issue takes known factors
  • into account and weights them as appropriate, allowing us to
  • minimize uncertainty and determine the most valid option
  • Processor Heat Analysis
  • Determine if a processor is overheating by comparing the
  • temperatures being reported on the neighboring nodes and in
  • the nodes residing in the same rack location in neighboring racks
  • Nodes toward the middle and top tend to get hotter than nodes
  • toward the outside and bottom
  • So, what have we learned so far?
  • We understand that the ability to diagnose and recover from performance and configuration issues without
  • resorting to a centralized knowledge base is the next great stride in allowing systems to self-manage their
  • reliability and stability
  • Preliminary results show this is a good approach to using logic for probabilistic model-based diagnosis.
  • The results are promising, especially for such a radical change in the approach to system management, but
  • for production deployment, further refinement is necessary in order to obtain statistically significant results.

Research partially supported by the ARL MSRC (GSA Contract GS00T99ALD0209) & Raytheon

ad