Maximizing goodput via co scheduling of cpu and network capacity
This presentation is the property of its rightful owner.
Sponsored Links
1 / 25

Maximizing Goodput via Co-scheduling Of CPU and Network Capacity PowerPoint PPT Presentation


  • 90 Views
  • Uploaded on
  • Presentation posted in: General

Maximizing Goodput via Co-scheduling Of CPU and Network Capacity. Miron Livny Computer Sciences Department University of Wisconsin-Madison [email protected] (joint work with Jim Basney). Allocated CPU hours per user (6/21/98 - 9/3/98). 400,000 CPU hours in 73 days on

Download Presentation

Maximizing Goodput via Co-scheduling Of CPU and Network Capacity

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Maximizing Goodput viaCo-scheduling Of CPU and Network Capacity

Miron Livny

Computer Sciences Department

University of Wisconsin-Madison

[email protected]

(joint work with Jim Basney)


Allocated CPU hours per user(6/21/98 - 9/3/98)

400,000 CPU hours in 73 days on

320 Desk-top machines of the UW-CS Condor pool

(~17 hours per day per machine)

Maximizing Goodput via Co-scheduling of CPU and Network Capacity


Memory

CPU

File System

Remote Execution Challenge

Remote Resource

Customer File System*

Executable

Checkpoint

Network

Input Files

Output Files

*May be distributed.

Maximizing Goodput via Co-scheduling of CPU and Network Capacity


Maximizing Goodput via Co-scheduling of CPU and Network Capacity


How useful is the allocated Time?

Allocate

Preempt

X

Placement

Periodic

Ckpt

Periodic

Ckpt

Preempt

Ckpt

Remote

I/O

Wait and See

Goodput = Allocation - Overhead

Maximizing Goodput via Co-scheduling of CPU and Network Capacity


Goodput is the allocation time where the application makes forward progress

overhead = Placement + Migration Periodic Checkpoints + Remote I/O +Wait and See

Maximizing Goodput via Co-scheduling of CPU and Network Capacity


Placement

  • What: Transfer executable and checkpoint data

  • How much - Known in advance.

    • Executable: usually small

    • Checkpoint: application memory image

      • Can be large! (100MB+)

      • May include cached input data and intermediate file data

  • When: Triggered by Resource Manager when CPU is allocated

Maximizing Goodput via Co-scheduling of CPU and Network Capacity


Migration

  • What: Transfer Checkpoint Data to file system or a hot standby.

  • How much: Known in advance

    • Workstation owner may limit time to migrate

    • Failure results in lost work

  • When: Initiated by workstation owner or triggered by Resource Manager to enforce priority order

Maximizing Goodput via Co-scheduling of CPU and Network Capacity


Remote I/O

  • What: Application Input/Output data

    • Read input files.

    • Write intermediate results.

    • Read intermediate results.

    • Write final results.

  • How much: Application may know/tell.

  • When: Initiated by application read and write system calls during run.

Maximizing Goodput via Co-scheduling of CPU and Network Capacity


Periodic Checkpoint

  • What: Transfer Checkpoint Data to file system.

  • How much: Known in advance.

  • When: Scheduled in advance by shadow.

    • reduce risk in case of a failed migration.

    • No deadline.

    • All remote resources are available.

Maximizing Goodput via Co-scheduling of CPU and Network Capacity


Wait and See

  • What: Suspend application when resource is revoked

    • Wait and See if resource will become available shortly.

    • Shortens migration time limit.

    • Consumes local resources.

  • When: Initiated by owner activity

  • How long: Upper bound set by resource owner.

Maximizing Goodput via Co-scheduling of CPU and Network Capacity


Application

Application Agent

Customer Agent

Environment Agent

Owner Agent

Local Resource Management

Resource

High Throughput Computing Layers

Maximizing Goodput via Co-scheduling of CPU and Network Capacity


Who Does What in the Condor Environment?

  • Matchmaker

    • Initiates allocations

    • Preempts (re-matches) to transfer allocation to higher priority customer.

  • Checkpoint Server(s)

    • Store checkpoints (may include data files).

  • File system (Unix, NFS, AFS)

    • Stores files.

Maximizing Goodput via Co-scheduling of CPU and Network Capacity


Who does what?

  • Shadow: Application Resource Manager

    • Application-level scheduling

    • Acts a proxy for the application in the submit environment.

  • Owner Agent: Controls opportunistic resource

    • Owner may preempt application at any time.

    • Owner controls preemption policy.

Maximizing Goodput via Co-scheduling of CPU and Network Capacity


Approachs for Maximizing Goodput

  • Co-matching (scheduling of network, server and CPU resources. (matchmaker)

  • Support high priority data transfers to/from checkpoint servers. (checkpoint server)

  • Localized checkpointing (shadow).

Maximizing Goodput via Co-scheduling of CPU and Network Capacity


… approach

  • Plan in advance for pre-scheduled events.(external scheduler)

  • Reduce size of data to be transferred (checkpoint server and remote resource).

  • Monitor system goodput (all).

Maximizing Goodput via Co-scheduling of CPU and Network Capacity


Challenges

  • Develop an effective model of the network and I/O capabilities of a Condor pool.

  • Obtain the information needed to build such a model.

  • Add co-matching of ClassAds to the matchmaking framework.

  • Develop a multi-resource consumption based priority scheme.

Maximizing Goodput via Co-scheduling of CPU and Network Capacity


Matchmaker Co-matching

  • Problem: Bursty matchmaking causes network or server saturation

    • increases placement and checkpoint costs

    • slow placement results in underutilized CPUs

    • results in failed migrations

  • Approach: Don’t allow new matches to exceed predefined usage thresholds

Maximizing Goodput via Co-scheduling of CPU and Network Capacity


…. Matchmaker Co-matching

  • Application requests an allocation which provides the best possible goodput

    • large data and checkpoint files require high bandwidth to checkpoint server.

    • balance cost of application placement and checkpoint overheads with (estimated) allocation time.

Maximizing Goodput via Co-scheduling of CPU and Network Capacity


… Matchmaker Co-matching

  • Best Fit vs. First Fit

    • Match lower priority requests with smaller network requirements first toincrease cluster CPU utilization

    • Preempt one of these requests when you match a high priority request with a large network requirement.

Maximizing Goodput via Co-scheduling of CPU and Network Capacity


Checkpoint Server support

  • Prioritize data streams

    • high priority: migration streams

    • low priority: checkpoint read and periodic checkpoint write streams

  • Schedule periodic checkpoints in advance to avoid bursts of network traffic.

  • Schedule graceful shutdowns in advance to avoid vacate failures.

Maximizing Goodput via Co-scheduling of CPU and Network Capacity


Shadow support

  • Choose most efficient data access method per file

    • Locate checkpoint and file servers

  • Schedule periodic checkpoints in advance.

Maximizing Goodput via Co-scheduling of CPU and Network Capacity


Minimize Data Size

  • compress checkpoints.

  • only checkpoint changes (diffs).

  • data staging.

  • checkpoint staging.

    • write checkpoint to local file system and schedule transfer when resources are available

Maximizing Goodput via Co-scheduling of CPU and Network Capacity


Goodput Measurements

  • Goodput/Allocation ratio measures health of the system

    • detect problem resources

    • detect overloaded subnets

    • measure QOS per application

  • Checkpoint transfer statistics measure network usage

    • success rate

    • throughput

Maximizing Goodput via Co-scheduling of CPU and Network Capacity


Very

Large

Objects

on the Network

Maximizing Goodput via Co-scheduling of CPU and Network Capacity


  • Login