1 / 25

Maximizing Goodput via Co-scheduling Of CPU and Network Capacity

Maximizing Goodput via Co-scheduling Of CPU and Network Capacity. Miron Livny Computer Sciences Department University of Wisconsin-Madison miron@cs.wisc.edu (joint work with Jim Basney). Allocated CPU hours per user (6/21/98 - 9/3/98). 400,000 CPU hours in 73 days on

kaemon
Download Presentation

Maximizing Goodput via Co-scheduling Of CPU and Network Capacity

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Maximizing Goodput viaCo-scheduling Of CPU and Network Capacity Miron Livny Computer Sciences Department University of Wisconsin-Madison miron@cs.wisc.edu (joint work with Jim Basney)

  2. Allocated CPU hours per user(6/21/98 - 9/3/98) 400,000 CPU hours in 73 days on 320 Desk-top machines of the UW-CS Condor pool (~17 hours per day per machine) Maximizing Goodput via Co-scheduling of CPU and Network Capacity

  3. Memory CPU File System Remote Execution Challenge Remote Resource Customer File System* Executable Checkpoint Network Input Files Output Files *May be distributed. Maximizing Goodput via Co-scheduling of CPU and Network Capacity

  4. Maximizing Goodput via Co-scheduling of CPU and Network Capacity

  5. How useful is the allocated Time? Allocate Preempt X Placement Periodic Ckpt Periodic Ckpt Preempt Ckpt Remote I/O Wait and See Goodput = Allocation - Overhead Maximizing Goodput via Co-scheduling of CPU and Network Capacity

  6. Goodput is the allocation time where the application makes forward progress overhead = Placement + Migration Periodic Checkpoints + Remote I/O +Wait and See Maximizing Goodput via Co-scheduling of CPU and Network Capacity

  7. Placement • What: Transfer executable and checkpoint data • How much - Known in advance. • Executable: usually small • Checkpoint: application memory image • Can be large! (100MB+) • May include cached input data and intermediate file data • When: Triggered by Resource Manager when CPU is allocated Maximizing Goodput via Co-scheduling of CPU and Network Capacity

  8. Migration • What: Transfer Checkpoint Data to file system or a hot standby. • How much: Known in advance • Workstation owner may limit time to migrate • Failure results in lost work • When: Initiated by workstation owner or triggered by Resource Manager to enforce priority order Maximizing Goodput via Co-scheduling of CPU and Network Capacity

  9. Remote I/O • What: Application Input/Output data • Read input files. • Write intermediate results. • Read intermediate results. • Write final results. • How much: Application may know/tell. • When: Initiated by application read and write system calls during run. Maximizing Goodput via Co-scheduling of CPU and Network Capacity

  10. Periodic Checkpoint • What: Transfer Checkpoint Data to file system. • How much: Known in advance. • When: Scheduled in advance by shadow. • reduce risk in case of a failed migration. • No deadline. • All remote resources are available. Maximizing Goodput via Co-scheduling of CPU and Network Capacity

  11. Wait and See • What: Suspend application when resource is revoked • Wait and See if resource will become available shortly. • Shortens migration time limit. • Consumes local resources. • When: Initiated by owner activity • How long: Upper bound set by resource owner. Maximizing Goodput via Co-scheduling of CPU and Network Capacity

  12. Application Application Agent Customer Agent Environment Agent Owner Agent Local Resource Management Resource High Throughput Computing Layers Maximizing Goodput via Co-scheduling of CPU and Network Capacity

  13. Who Does What in the Condor Environment? • Matchmaker • Initiates allocations • Preempts (re-matches) to transfer allocation to higher priority customer. • Checkpoint Server(s) • Store checkpoints (may include data files). • File system (Unix, NFS, AFS) • Stores files. Maximizing Goodput via Co-scheduling of CPU and Network Capacity

  14. Who does what? • Shadow: Application Resource Manager • Application-level scheduling • Acts a proxy for the application in the submit environment. • Owner Agent: Controls opportunistic resource • Owner may preempt application at any time. • Owner controls preemption policy. Maximizing Goodput via Co-scheduling of CPU and Network Capacity

  15. Approachs for Maximizing Goodput • Co-matching (scheduling of network, server and CPU resources. (matchmaker) • Support high priority data transfers to/from checkpoint servers. (checkpoint server) • Localized checkpointing (shadow). Maximizing Goodput via Co-scheduling of CPU and Network Capacity

  16. … approach • Plan in advance for pre-scheduled events.(external scheduler) • Reduce size of data to be transferred (checkpoint server and remote resource). • Monitor system goodput (all). Maximizing Goodput via Co-scheduling of CPU and Network Capacity

  17. Challenges • Develop an effective model of the network and I/O capabilities of a Condor pool. • Obtain the information needed to build such a model. • Add co-matching of ClassAds to the matchmaking framework. • Develop a multi-resource consumption based priority scheme. Maximizing Goodput via Co-scheduling of CPU and Network Capacity

  18. Matchmaker Co-matching • Problem: Bursty matchmaking causes network or server saturation • increases placement and checkpoint costs • slow placement results in underutilized CPUs • results in failed migrations • Approach: Don’t allow new matches to exceed predefined usage thresholds Maximizing Goodput via Co-scheduling of CPU and Network Capacity

  19. …. Matchmaker Co-matching • Application requests an allocation which provides the best possible goodput • large data and checkpoint files require high bandwidth to checkpoint server. • balance cost of application placement and checkpoint overheads with (estimated) allocation time. Maximizing Goodput via Co-scheduling of CPU and Network Capacity

  20. … Matchmaker Co-matching • Best Fit vs. First Fit • Match lower priority requests with smaller network requirements first toincrease cluster CPU utilization • Preempt one of these requests when you match a high priority request with a large network requirement. Maximizing Goodput via Co-scheduling of CPU and Network Capacity

  21. Checkpoint Server support • Prioritize data streams • high priority: migration streams • low priority: checkpoint read and periodic checkpoint write streams • Schedule periodic checkpoints in advance to avoid bursts of network traffic. • Schedule graceful shutdowns in advance to avoid vacate failures. Maximizing Goodput via Co-scheduling of CPU and Network Capacity

  22. Shadow support • Choose most efficient data access method per file • Locate checkpoint and file servers • Schedule periodic checkpoints in advance. Maximizing Goodput via Co-scheduling of CPU and Network Capacity

  23. Minimize Data Size • compress checkpoints. • only checkpoint changes (diffs). • data staging. • checkpoint staging. • write checkpoint to local file system and schedule transfer when resources are available Maximizing Goodput via Co-scheduling of CPU and Network Capacity

  24. Goodput Measurements • Goodput/Allocation ratio measures health of the system • detect problem resources • detect overloaded subnets • measure QOS per application • Checkpoint transfer statistics measure network usage • success rate • throughput Maximizing Goodput via Co-scheduling of CPU and Network Capacity

  25. Very Large Objects on the Network Maximizing Goodput via Co-scheduling of CPU and Network Capacity

More Related