Maximizing goodput via co scheduling of cpu and network capacity
Download
1 / 25

Maximizing Goodput via Co-scheduling Of CPU and Network Capacity - PowerPoint PPT Presentation


  • 111 Views
  • Uploaded on

Maximizing Goodput via Co-scheduling Of CPU and Network Capacity. Miron Livny Computer Sciences Department University of Wisconsin-Madison [email protected] (joint work with Jim Basney). Allocated CPU hours per user (6/21/98 - 9/3/98). 400,000 CPU hours in 73 days on

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about ' Maximizing Goodput via Co-scheduling Of CPU and Network Capacity' - kaemon


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
Maximizing goodput via co scheduling of cpu and network capacity

Maximizing Goodput viaCo-scheduling Of CPU and Network Capacity

Miron Livny

Computer Sciences Department

University of Wisconsin-Madison

[email protected]

(joint work with Jim Basney)


Allocated cpu hours per user 6 21 98 9 3 98
Allocated CPU hours per user(6/21/98 - 9/3/98)

400,000 CPU hours in 73 days on

320 Desk-top machines of the UW-CS Condor pool

(~17 hours per day per machine)

Maximizing Goodput via Co-scheduling of CPU and Network Capacity


Remote execution challenge

Memory

CPU

File System

Remote Execution Challenge

Remote Resource

Customer File System*

Executable

Checkpoint

Network

Input Files

Output Files

*May be distributed.

Maximizing Goodput via Co-scheduling of CPU and Network Capacity



How useful is the allocated time
How useful is the allocated Time? Capacity

Allocate

Preempt

X

Placement

Periodic

Ckpt

Periodic

Ckpt

Preempt

Ckpt

Remote

I/O

Wait and See

Goodput = Allocation - Overhead

Maximizing Goodput via Co-scheduling of CPU and Network Capacity


Goodput Capacity is the allocation time where the application makes forward progress

overhead = Placement + Migration Periodic Checkpoints + Remote I/O +Wait and See

Maximizing Goodput via Co-scheduling of CPU and Network Capacity


Placement
Placement Capacity

  • What: Transfer executable and checkpoint data

  • How much - Known in advance.

    • Executable: usually small

    • Checkpoint: application memory image

      • Can be large! (100MB+)

      • May include cached input data and intermediate file data

  • When: Triggered by Resource Manager when CPU is allocated

Maximizing Goodput via Co-scheduling of CPU and Network Capacity


Migration
Migration Capacity

  • What: Transfer Checkpoint Data to file system or a hot standby.

  • How much: Known in advance

    • Workstation owner may limit time to migrate

    • Failure results in lost work

  • When: Initiated by workstation owner or triggered by Resource Manager to enforce priority order

Maximizing Goodput via Co-scheduling of CPU and Network Capacity


Remote i o
Remote I/O Capacity

  • What: Application Input/Output data

    • Read input files.

    • Write intermediate results.

    • Read intermediate results.

    • Write final results.

  • How much: Application may know/tell.

  • When: Initiated by application read and write system calls during run.

Maximizing Goodput via Co-scheduling of CPU and Network Capacity


Periodic checkpoint
Periodic Checkpoint Capacity

  • What: Transfer Checkpoint Data to file system.

  • How much: Known in advance.

  • When: Scheduled in advance by shadow.

    • reduce risk in case of a failed migration.

    • No deadline.

    • All remote resources are available.

Maximizing Goodput via Co-scheduling of CPU and Network Capacity


Wait and see
Wait and See Capacity

  • What: Suspend application when resource is revoked

    • Wait and See if resource will become available shortly.

    • Shortens migration time limit.

    • Consumes local resources.

  • When: Initiated by owner activity

  • How long: Upper bound set by resource owner.

Maximizing Goodput via Co-scheduling of CPU and Network Capacity


High throughput computing layers

Application Capacity

Application Agent

Customer Agent

Environment Agent

Owner Agent

Local Resource Management

Resource

High Throughput Computing Layers

Maximizing Goodput via Co-scheduling of CPU and Network Capacity


Who does what in the condor environment
Who Does What in the Condor Environment? Capacity

  • Matchmaker

    • Initiates allocations

    • Preempts (re-matches) to transfer allocation to higher priority customer.

  • Checkpoint Server(s)

    • Store checkpoints (may include data files).

  • File system (Unix, NFS, AFS)

    • Stores files.

Maximizing Goodput via Co-scheduling of CPU and Network Capacity


Who does what
Who does what? Capacity

  • Shadow: Application Resource Manager

    • Application-level scheduling

    • Acts a proxy for the application in the submit environment.

  • Owner Agent: Controls opportunistic resource

    • Owner may preempt application at any time.

    • Owner controls preemption policy.

Maximizing Goodput via Co-scheduling of CPU and Network Capacity


Approachs for maximizing goodput
Approachs for Maximizing Goodput Capacity

  • Co-matching (scheduling of network, server and CPU resources. (matchmaker)

  • Support high priority data transfers to/from checkpoint servers. (checkpoint server)

  • Localized checkpointing (shadow).

Maximizing Goodput via Co-scheduling of CPU and Network Capacity


Approach
… approach Capacity

  • Plan in advance for pre-scheduled events.(external scheduler)

  • Reduce size of data to be transferred (checkpoint server and remote resource).

  • Monitor system goodput (all).

Maximizing Goodput via Co-scheduling of CPU and Network Capacity


Challenges
Challenges Capacity

  • Develop an effective model of the network and I/O capabilities of a Condor pool.

  • Obtain the information needed to build such a model.

  • Add co-matching of ClassAds to the matchmaking framework.

  • Develop a multi-resource consumption based priority scheme.

Maximizing Goodput via Co-scheduling of CPU and Network Capacity


Matchmaker co matching
Matchmaker Co-matching Capacity

  • Problem: Bursty matchmaking causes network or server saturation

    • increases placement and checkpoint costs

    • slow placement results in underutilized CPUs

    • results in failed migrations

  • Approach: Don’t allow new matches to exceed predefined usage thresholds

Maximizing Goodput via Co-scheduling of CPU and Network Capacity


Matchmaker co matching1
…. Matchmaker Co-matching Capacity

  • Application requests an allocation which provides the best possible goodput

    • large data and checkpoint files require high bandwidth to checkpoint server.

    • balance cost of application placement and checkpoint overheads with (estimated) allocation time.

Maximizing Goodput via Co-scheduling of CPU and Network Capacity


Matchmaker co matching2
… Matchmaker Co-matching Capacity

  • Best Fit vs. First Fit

    • Match lower priority requests with smaller network requirements first toincrease cluster CPU utilization

    • Preempt one of these requests when you match a high priority request with a large network requirement.

Maximizing Goodput via Co-scheduling of CPU and Network Capacity


Checkpoint server support
Checkpoint Server support Capacity

  • Prioritize data streams

    • high priority: migration streams

    • low priority: checkpoint read and periodic checkpoint write streams

  • Schedule periodic checkpoints in advance to avoid bursts of network traffic.

  • Schedule graceful shutdowns in advance to avoid vacate failures.

Maximizing Goodput via Co-scheduling of CPU and Network Capacity


Shadow support
Shadow support Capacity

  • Choose most efficient data access method per file

    • Locate checkpoint and file servers

  • Schedule periodic checkpoints in advance.

Maximizing Goodput via Co-scheduling of CPU and Network Capacity


Minimize data size
Minimize Data Size Capacity

  • compress checkpoints.

  • only checkpoint changes (diffs).

  • data staging.

  • checkpoint staging.

    • write checkpoint to local file system and schedule transfer when resources are available

Maximizing Goodput via Co-scheduling of CPU and Network Capacity


Goodput measurements
Goodput Measurements Capacity

  • Goodput/Allocation ratio measures health of the system

    • detect problem resources

    • detect overloaded subnets

    • measure QOS per application

  • Checkpoint transfer statistics measure network usage

    • success rate

    • throughput

Maximizing Goodput via Co-scheduling of CPU and Network Capacity


Very Capacity

Large

Objects

on the Network

Maximizing Goodput via Co-scheduling of CPU and Network Capacity


ad