what s new in condor what s coming up condor week 2009 l.
Download
Skip this Video
Loading SlideShow in 5 Seconds..
What’s new in Condor? What’s coming up? Condor Week 2009 PowerPoint Presentation
Download Presentation
What’s new in Condor? What’s coming up? Condor Week 2009

Loading in 2 Seconds...

play fullscreen
1 / 49

What’s new in Condor? What’s coming up? Condor Week 2009 - PowerPoint PPT Presentation


  • 211 Views
  • Uploaded on

What’s new in Condor? What’s coming up? Condor Week 2009. Release Situation. Stable Series Current: Condor v7.2.2 (April 14 2009) Last Year: Condor v7.0.1 (Feb 27th 2008) Development Series Current: Condor v7.3.0 (Feb 24 2009) v7.3.1 “any day” Last Year : Condor v7.1.0 (April 1st 2008)

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

What’s new in Condor? What’s coming up? Condor Week 2009


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
release situation
Release Situation
  • Stable Series
    • Current: Condor v7.2.2 (April 14 2009)
    • Last Year: Condor v7.0.1 (Feb 27th 2008)
  • Development Series
    • Current: Condor v7.3.0 (Feb 24 2009)
      • v7.3.1 “any day”
    • Last Year : Condor v7.1.0 (April 1st 2008)
  • How long is development taking?
    • v6.9 Series : ~ 18 months
    • v7.1 Series : ~ 12 months
    • v7.3 Series : plan says done in July 09
new ports in 7 2 0 and beyond
New Ports In 7.2.0 and Beyond
  • Full ports: Debian 5.0 x86 & x86_64
  • Also added condor_compile support for gfortran
big new goodies in v7 0
Big new goodies in v7.0

Last Year's News

  • Virtual Machine Universe
  • Scalability Improvements
  • GCB Improvements
  • Privilege Separation
  • New Quill
  • “Crondor”
big new goodies in v7 2
Big new goodies in v7.2
  • Job Router
  • Startd and Job Router hooks
  • DAGMan tagging and splicing
  • Green Computing started
  • GLEXEC
  • Concurrency Limits
job router
Job Router
  • Automated way to let jobs run on a wider array of resources
    • Transform jobs into different forms
    • Reroute jobs to different destinations

6

what is job routing
What is “job routing”?

original (vanilla) job

routed (grid) job

Universe = “vanilla”

Executable = “sim”

Arguments = “seed=345”

Output = “stdout.345”

Error = “stderr.345”

ShouldTransferFiles = True

WhenToTransferOutput = “ON_EXIT”

Universe = “grid”

GridType = “gt2”

GridResource = \“cmsgrid01.hep.wisc.edu/jobmanager-condor”

Executable = “sim”

Arguments = “seed=345”

Output = “stdout”

Error = “stderr”

ShouldTransferFiles = True

WhenToTransferOutput = “ON_EXIT”

JobRouter

Routing Table:

Site 1

Site 2

final status

7

routing is just site level matchmaking
Routing is just site-level matchmaking
  • With feedback from job queue
      • number of jobs currently routed to site X
      • number of idle jobs routed to site X
      • rate of recent success/failure at site X
  • And with power to modify job ad
      • change attribute values (e.g. Universe)
      • insert new attributes (e.g. GridResource)
      • add a “portal” grid proxy if desired

8

startd job hooks
Startd Job Hooks
  • Users wanted to take advantage of Condor’s resource management daemon (condor_startd) to run jobs, but they had their own scheduling system.
    • Specialized scheduling needs
    • Jobs live in their own database or other storage rather than a Condor job queue

9

job router hooks
Job Router Hooks
  • Truly transform jobs, not just reroute them
    • E.g. stuff a job into a virtual machine (either VM universe or Amazon EC2)
  • Hooks invoked like startd ones

10

our solution
Our solution
  • Make a system of generic “hooks” that you can plug into:
    • A hook is a point during the life-cycle of a job where the Condor daemons will invoke an external program
    • Hook Condor to your existing job management system without modifying the Condor code

11

category example

Setup

Big job

Big job

Big job

Small job

Small job

Small job

Small job

Small job

Small job

Small job

Small job

Small job

Cleanup

Category Example

Run <= 2

Run <= 5

dagman splicing

X+A

Y+A

Z+A

Z+B

X+B

Y+B

X+C

Z+C

Y+C

Z+D

X+D

Y+D

DAGMan Splicing

Splicing creates one “in memory”

DAG. No subdags means no

extra condor_dagmans.

A

# Example Use Case

JOB A A.sub

JOB B B.sub

SPLICE X diamond.dag

SPLICE Y diamond.dag

SPLICE Z diamond.dag

PARENT A CHILD XYZ

PARENT XYZ CHILD B

# Notice scoping of node!

B

green computing
Green Computing
  • The startd has the ability to place a machine into a low power state. (Standby, Hibernate, Soft-Off, etc.)
    • HIBERNATE, HIBERNATE_CHECK_INTERVAL
    • If all slots return non-zero, then the machine is powered down; otherwise; it continues running.
  • Machine ClassAd contains all information required for a client to wake it up
    • Condor can wake it up, also a standalone tool.
    • This was NOT as easy as it should be.
  • Machines in “Offline State”
    • Stored persistently to disk
    • Lots of other uses
concurrency limits
Concurrency Limits
  • Limit job execution based on admin-defined consumable resources
    • E.g. licenses
  • Can have many different limits
  • Jobs say what resources they need
  • Negotiator enforces limits pool-wide

16

concurrency example
Concurrency Example
  • Negotiator config file
    • MATLAB_LIMIT = 5
    • NFS_LIMIT = 20
  • Job submit file
    • concurrency_limits = matlab,nfs:3
    • This requests 1 Matlab token and 3 NFS tokens

17

other goodies in v7 2
Other goodies in v7.2
  • ALLOW/DENY_CLIENT
  • Job queue backup on local disk
  • PREEMPTION_REQUIREMENTS and RANK can reference additional attributes in negotiator about group resource usage
  • Start on dynamic provisioning in the startd
  • $$([])
dynamic slot partitioning
Dynamic Slot Partitioning

Divide slots into chunks sized for matched jobs

Readvertise remaining resources

Partitionable resources are cpus, memory, and disk

See Matt Farrellee’s talk

19

dynamic partitioning caveats
Dynamic Partitioning Caveats
  • Cannot preempt original slot or group of sub-slots
    • Potential starvation of jobs with large resource requirements
  • Partitioning happens once per slot each negotiation cycle
    • Scheduling of large slots may be slow

20

new variable substitution
New Variable Substitution
  • $$(Foo) in submit file
    • Existing feature
    • Attribute Foo from machine ad substituted
  • $$([Memory * 0.9]) in submit file
    • New feature
    • Expression is evaluated and then substituted

21

more info for preemption
More Info For Preemption
  • New attributes for these preemption expressions in the negotiator…
    • PREEMPTION_REQUIREMENTS
    • PREEMPTION_RANK
  • Used for controlling preemption due to user priorities

22

slide24

Terms of License

Any and all dates in these slides are relative from a date hereby unspecified in the event of a likely situation involving a frequent condition. Viewing, use, reproduction, display, modification and redistribution of these slides, with or without modification, in source and binary forms, is permitted only after a deposit by said user into PayPal accounts registered to Todd Tannenbaum ….

already served leftovers
Already served (leftovers)
  • CCB – Condor Connection Broker
    • Dan Bradley’s presentation
  • Bring checkpoint/restart to Vanilla Job
    • Pete Keller’s presentation re DMTCP
  • Asynch notification of events to fill a hole in Condor’s web service API
    • Jungha Woo’s presentation
  • Grid Universe improvements
    • Xin Zhao’s presentation
data drinks
Data “Drinks”

Wando Fishbowl Anyone?

condor hadoop fs
Condor + Hadoop FS !
  • Lots of hard work by Faisal Khan
  • Motivation
    • Condor+HDFS = 2 + 2 = 5 !!!
    • A Synergy exists (next slide)
      • Hadoop as distributed storage system
      • Condor as cluster management system
    • Large number of distributed disks in a compute cluster

Managing disk as a resource

condor hdfs
Condor + HDFS
  • Dhruba Borthakur’s talk
  • Synergy
    • Condor knows a lot about its cluster
      • Capability of individual machines in terms of available memory, CPU load, disk space etc.
      • Availability of JRE (Java Universe)
    • Condor can easily automate house keeping jobs e.g
      • rebalancing data blocks
      • Implementing user file quota
condor hdfs30
Condor + HDFS
  • Synergy
    • Failover
      • High availability daemon in Condor
    • ClassAds
      • Let clients know the current IP of name server
      • Heartbeat
condor hdfs daemon
condor_hdfs daemon
  • Main integration point of HDFS within Condor
  • Configures HDFS cluster based on existing condor_config files
  • Runs under condor_master and can be controlled by existing Condor utilities
  • Publish interesting parameters to Collector e.g IP address, node type, disk activity
  • Currently deployed at UW-Madison
condor hdfs next steps
Condor + HDFS : Next Steps
  • FileNode Failover
  • Block placement policies & management
  • Thinking about how Condor can steer jobs to the data
    • Via a ClassAd function used in the RANK expression?
  • Integrate with File Transfer Mechanism…
more job sandbox options
More Job Sandbox Options
  • Condor’s File Transfer mechanism
    • Currently moves files between submit and execute hosts (shadow and starter).
    • Next : Files can have URLs
      • HTTP
      • HDFS
    • How about Condor’s SPOOL ?
  • Need to schedule movement? New Stork
    • Mehmet Balman’s presentation
virtual machine sandboxing
Virtual Machine Sandboxing
  • We have the Virtual Machine Universe…
    • Great for provisioning
    • Nitin Narkhede’s presentation
  • … and now we are exploring different mechanisms to run a job inside a VM.
  • Benefits
    • Isolate the job from execute host.
    • Stage custom execution environments.
    • Sandbox and control the job execution.
one way to do it via the condor job router
One way to do it – via the Condor Job Router
  • Hard work by Varghese Mathew
  • Ordinary Jobs & VM Universe Jobs.
  • Job router – transform a job into a new form.
  • Job router hook picks them up, sets them up inside a VM job, and submits the VM job.
  • On completion, job router hook extracts output from the VM and returns to original job.
different flavors
Different Flavors
  • Script Inside VM
  • Starter Inside VM
  • Personal Condor Inside VM
  • VM joins the pool as an execute node
  • All different ways to bind a job to a specific virtual machine.
speaking of vm universe
Speaking of VM Universe…
  • Adding VM Universe Support for
    • VMWare Server 2.x
    • KVM
      • Done via libvirt
      • Future VM systems added to libvirt should be easy to add in the future
    • VMWare ESX, ESXi
  • Thank you community for contributions!
fast quick light jobs
Fast, quick, light jobs
  • Options to put a Condor job on a diet
  • Diet ideas:
    • Leave the luggage at home! No job file sandbox, everything in the job ad.
    • Don’t pay for strong semantic guarantees if you don’t need em. Define expectations on entry, update, completion.
  • Want to honor scheduling policy, however.
some small side dishes
Some small side dishes

Julia, a spy who really knew her eggs…

slide42
Non-blocking communication via threads
    • Refer to Dan/Igor’s talk
    • Especially all the security session roundtrips
    • The USCMS scalability testbed needs 70 collectors to support ~20k dynamic machines; replaced with 1 collector w/ threading code. 70:1, baby!!!!!
  • Configuration knob management
    • Think about:config in firefox
    • Hard-coded configurations now possible
  • Nested groups
back to green computing
Back to Green Computing
  • The startd has the ability to place a machine into a low power state. (Standby, Hibernate, Soft-Off, etc.).
  • Machine ClassAd contains all information required for a client to wake it up
  • Machines in “Offline State”
    • Stored persistently to disk
  • NOW… have the matchmaker publish “match pressure” into these offline ads, enabling policies for auto-wakeup
scheduling in condor today

startd

startd

startd

startd

startd

startd

startd

startd

startd

startd

Scheduling in Condor Today
  • Distributed Ownership
  • Settings reflect 3 separate viewpoints:
    • Pool manager, Resource Owner, Job Submitter

CM

schedd

CM

schedd

schedd

schedd

schedd

but some sites want to use condor like this

startd

startd

startd

startd

startd

But some sites want to use Condor like this:
  • Just one submission point (schedd)
  • All resources owned by one entity
  • We can do better for these sites.
    • Policy configurations are complicated.
    • Some useful policies not present because they are hard to do a wide-area distributed system.
    • Today the dedicated “scheduler” only supports FIFO and a naive Best Fit algorithms.

schedd

so what to do

startd

startd

startd

startd

startd

So what to do?

schedd

  • Give the schedd more scheduling options.
    • Examples: why can’t the schedd do priority preemption without the matchmakers help? Or move jobs from slow to fast claimed resources ?