Configuring resources for the grid
1 / 33

Configuring Resources for the Grid - PowerPoint PPT Presentation

  • Uploaded on
  • Presentation posted in: General

Configuring Resources for the Grid . Jerry Perez Senior Administrator Texas Tech University. Outline. What is a Job Manager? Types of Job Managers PBS Pro SGE LSF Condor/Condor-DAGman Rocks + Rolls (Quick overview). What is a Job Manager?.

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.

Download Presentation

Configuring Resources for the Grid

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript

Configuring Resources for the Grid

Jerry Perez

Senior Administrator

Texas Tech University


  • What is a Job Manager?

  • Types of Job Managers

  • PBS Pro

  • SGE

  • LSF

  • Condor/Condor-DAGman

  • Rocks + Rolls (Quick overview)

What is a Job Manager?

  • A Job Management System is a software component that ensures:

  • Balanced use of cluster resources.

  • Fair allocation of these resources to user's jobs in a process that determines which job to run

  • When and where to run compute jobs.

What is a Job Manager?

Components of a Job Manager

  • Resource Management System

    • a process that maintains the current state of all the resources under its control, including the physical resources of the cluster and account information such as relative priorities and account balances.

  • Queuing System

    • a process that maintains the current state of jobs submitted but not completed.

  • Scheduler

    • a system that assigns jobs to resources.

Why do we need a Job Manager?

A Job Management System should always be

used for a cluster:

  • Operated as a public resource.

  • If there are a large number of users or users who don't know each other.

  • With a large number of nodes and processors.

    that runs a large number of jobs.

  • Whose nodes are heterogeneous in terms of memory, speed, number of processors, software licenses, networking, and other features.

    Note: Most clusters are homogeneous with respect to hardware and software.

Types of Job Managers



PBS Pro is made up of a number of components:

  • The server and clients such as user commands.

  • A server component manages a number of different objects, such as queues or jobs.

  • Each object consists of a number of data items or attributes.

  • Scheduling is policy based and operates in a FIFO round-robin type fashion.

  • Specific Queues can be configured for priority queuing.

  • Minimal Queue/Scheduler configuration


PBS Pro Graphical User Interface

PBS Pro Graphical User Interface

SGE – Sun Grid Engine

  • The SGE version 6 queue configuration allows for a queue to span more than one execution host to provide multiple hosts per queue configuration.

  • Uses concept of SGE Master node controlling “pools” of compute clients.

  • Can manage up to 10,000 clients per SGE Master node.

  • SGE can provide Load Leveling on the fly.

  • Scheduling can be policy based or topologically based.

  • Addresses the “Backfill” problem. (More on that later.)

  • Queue optimization is not automatic. It requires “tuning”.

SGE - Basic Cluster Configuration

  • Configured to reflect site dependencies and to influence batch system behavior.

  • Site dependencies include valid paths for programs such as mail or xterm.

  • A global configuration is provided for the Master Host as well as for every host in the grid engine system pool.

  • Can configure the system to use a configuration local to each host to override particular entries in the global configuration.

SGE – Cluster Configuration GUI

SGE – Host Configuration GUI

(L)oad (S)haring (F)acility - LSF


  • Scheduling can be policy based or topologically based.

  • Queue optimization is not automatic. It requires “tuning”.

  • Topologically based scheduling can use load information to schedule jobs.

  • Addresses the “Backfill” problem.

  • Jobs in a backfill queue cannot be preempted (a job in a backfill queue might be running in a reserved job slot, and starting a new job in that slot might delay the start of the big parallel job):

  • A backfill queue cannot be preemptable.

  • A preemptive queue whose priority is higher than the backfill queue cannot preempt the jobs in backfill queue.

LSF - How backfilling works

  • LSF assumes that a job will run until its run limit expires.

  • Backfill scheduling works most efficiently when all the jobs in the cluster have a run limit.

  • Since jobs with a shorter run limit have more chance of being scheduled as backfill jobs, users who specify appropriate run limits in a backfill queue will be rewarded by improved turnaround time.

LSF - How backfilling works

LSF - How backfilling works


LSF – Cluster Monitoring GUI


  • Provides a job queuing mechanism

  • Scheduling policy

  • Priority scheme

  • Resource monitoring

  • Resource management.

  • Users submit their serial or parallel jobs to Condor.

  • Condor places them into a queue.

  • Chooses when and where to run the jobs based upon a policy.

  • Carefully monitors their progress

  • Informs the user upon completion

  • Uses FIFO round-robin scheduling out of the box.

  • Can use attribute-based scheduling.

  • Condor can be used to build Grid-style computing environments that cross administrative boundaries.

  • Condor's "flocking" technology allows multiple Condor compute installations to work together.

  • Condor incorporates many of the emerging Grid-based computing methodologies and protocols.

  • For instance, Condor-G is fully interoperable with resources managed by Globus.


  • DAGMan (Directed Acyclic Graph Manager) is a meta-scheduler for Condor. It manages dependencies between jobs at a higher level than the Condor Scheduler.

  • DAGMan is responsible for scheduling, recovery, and reporting for the set of programs submitted to Condor

Rocks + Rolls

Rocks + Rolls

The complexity of cluster management (e.g., determining if all nodes have a consistent set of software) often overwhelms part-time cluster administrators, who are usually domain application scientists.

Rocks is a complete clustering solution with a goal to help deliver the computational power of clusters to a wide range of scientific users.

Rocks + Rolls

  • Before you install Rocks, be sure you have decided what Rolls you wish to include in your installations.

  • You may install whatever you like, however remember you can only choose one scheduler: LSF, SGE, PBS, or Condor.

  • Schedulers do not like being used together due to resource conflicts.

Rocks + Rolls

  • Required Rolls:

  • Base

  • Hpc

  • Kernel

  • Web-server

Rocks + Rolls

  • List of various rolls:

  • Area51System - security related services and utilities

  • GangliaCluster - monitoring system from UCB

  • GridGlobus 4.0.1 (GT4)

  • Condor Roll

  • JavaSun Java SDK and JVM

  • MyrinetMyricom’s Myrinet drivers and MPICH environments

  • PbsPBS - job queueing system

  • NinfNinf-G - a simple, yet powerful, client-server-based standard RPC mechanism

  • SgeSun - Grid Engine job queueing system

  • VizSupport - for building visualization clusters

  • LSF - comes with Platform Rocks

Rocks + Rolls

Thank You.


  • Login