- 88 Views
- Uploaded on
- Presentation posted in: General

Fair Share Scheduling

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Fair Share Scheduling

Ethan Bolker

Mathematics & Computer Science UMass Boston

www.cs.umb.edu/~eb

Queen’s University

March 23, 2001

Acknowledgements

- Yiping Ding
- Jeff Buzen
- Dan Keefe
- Oliver Chen
- Chris Thornley

- Aaron Ball
- Tom Larard
- Anatoliy Rikun
- Liying Song

- www.bmc.com/patrol/fairshare
- www/cs.umb.edu/~eb/goalmode

- Queueing theory primer
- Fair share semantics
- Priority scheduling; conservation laws
- Predicting response times from shares
- analytic formula
- experimental validation
- applet simulation

- Implementation geometry

- Stream of jobs visiting a server (ATM, time shared CPU, printer, …)
- Jobs queue when server is busy
- Input:
- Arrival rate: job/sec
- Service demand: s sec/job

- Performance metrics:
- server utilization: u = s (must be 1)
- response time: r = ??? sec/job (average)
- degradation: d = r/s

- r, d measure queueing delay
r s (d 1), unless parallel processing possible

- Randomness really matters
r = s (d = 1) if arrivals scheduled (best case, no waiting)

r >> s for bulk arrivals (worst case, maximum delays)

- Theorem. If arrivals are Poisson and service is exponentially distributed (M/M/1) then
d = 1/(1- u) r = s/(1- u)

- Think: virtual server with speed 1-u

- Essential nonlinearity often counterintuitive
- at u = 95% average degradation is 1/(1-0.95) = 20,
- but 1 customer in 20 has no wait at all (5% idle time)

- A useful guide even when hypotheses fail
- accurate enough ( 30%) for real computer systems
- d depends only on u: many small jobs have same impact as few large jobs
- faster system smaller s smaller u r = s/(1-u) double win: less service, less wait
- waiting costly, server cheap (telephones): want u 0
- server costly (doctors): want u 1 but scheduled

- Customers want good response times
- Decreasing u is expensive
- High end Unix offerings from HP, IBM, Sun offer fair share scheduling packages that allow an administrator to allocate scarce resources (CPU, processes, bandwidth) among workloads
- How do these packages behave?
- Model as a black box, independent of internals
- Limit study to CPU shares on a uniprocessor

- Multiple workloads, utilizations u1, u2, …
- U = ui < 1
- Ifno workload prioritization then all degradations are equal: di = 1/(1-U)
- Share allocations are de facto prioritizations
- Study degradation vector V = (d1, d2, …)

- Suppose workload w has CPU share fw
- Normalize shares so that w fw = 1
- w gets fraction fw of CPU time slices when at least one of its jobs is ready for service
- Can it use more if competing workloads idle?
No :think share = cap

Yes : think share = guarantee

- Good for accounting (sell fraction of web server)
- Available now from IBM, HP, soon from Sun
- Straightforward (boring) - workloads are isolated
- Each runs on a virtual processor with speed *= f

share f

dedicated system

utilization u

u/f need f > u !

response time r r(1 u)/(f u)

- Good for performance + economy (use otherwise idle resources)
- Shares make a difference only when there are multiple workloads
- Large share resembles high priority: share may be less than utilization
- Workload interaction is subtle, often unintuitive, hard to explain

OS

Performance

Goals

response time

report

measure

frequently

update

query

workload

Model

complex

scheduling

software

analytic

algorithms

fast

computation

- Real system
- Complex, dynamic, frequent state changes
- Hard to tease out cause and effect

- Model
- Static snapshot, deals in averages and probabilities
- Fast enlightening answers to “what if ” questions

- Abstraction helps you understand real system
- Start with a study of priority scheduling

- Priority state: order workloads by priority (ties OK)
- two workloads, 3 states: 12, 21, [12]
- three workloads, 13 states:
- 123 (6 = 3! of these ordered states),
- [12]3 (3 of these),
- 1[23] (3 of these),
- [123] (1 state with no priorities)

- n wkls, f(n) states, n! ordered (simplex lock combos)

- p(s) = prob( state = s ) = fraction of time in state s
- V(s) = degradation vector when state = s (measure this, or compute it using queueing theory)
- V = s p(s)V(s) (time avg is convex combination)
- Achievable region is convex hull of vectors V(s)

d1 = d2

d2

V(12) (wkl 1 high prio)

V([12]) (no priorities)

achievable region

V(21)

d1

d1 = d2

d2

V(12) (wkl 1 high prio)

V([12]) (no priorities)

0.5 V(12) + 0.5V(21)

V([12])

V(21)

d1

d1 = d2

d2

V(12) (wkl 1 high prio)

V([12]) (no priorities)

note: u1 < u2 wkl 2 effect on wkl 1 large

V(21)

d1

- No Free Lunch Theorem. Weighted average degradation is constant, independent of priority scheduling scheme:
i (ui /U) di = 1/(1-U)

- Provable from some hypotheses
- Observable in some real systems
- Sometimes false: shortest job first minimizes average response time (printer queues, supermarket express checkout lines)

- For any proper set A of workloads
Imagine giving those workloads top priority.

Then can pretend other wkls don’t exist. In that case

i A (ui /U(A)) di= 1/(1-U(A))

When wkls in A have lower priorities they have

higher degradations, so in general

i A (ui /U(A)) di 1/(1-U(A))

- These 2n -2 linear inequalities determine the convex achievable regionR
- R is a permutahedron: only n! vertices

conservation law:

(d1, d2 ) lies on the line

d2 : workload 2 degradation

u1d1 + u2d2 = 1/(1-U)

d1 : workload 1 degradation

d2 : workload 2 degradation

constraint resulting

from workload 1

d1 1/(1- u1 )

d1 : workload 1 degradation

Workload 1 runs at high priority:

V(1,2) = (1 /(1- u1 ), 1 /(1- u1 )(1-U) )

d2 : workload 2 degradation

constraint resulting

from workload 1

d1 1 /(1- u1 )

d1 : workload 1 degradation

d2 : workload 2 degradation

u1d1 + u2d2 = 1/(1-U)

d2 1 /(1- u2 )

V(2,1)

d1 : workload 1 degradation

V(1,2) = (1 /(1- u1 ), 1 /(1- u1 )(1-U) )

d2 : workload 2 degradation

achievable region R

u1d1 + u2d2 = 1/(1-U)

d2 1 /(1- u2 )

V(2,1)

d1 1 /(1- u1 )

d1 : workload 1 degradation

- Degradation vector (d1,d2, d3) lies on plane u1 d1 + u2 d2 + u3dr3 = C
- We know a constraint for each workload w: uw dw Cw
- Conservation applies to each pair of wkls as well: u1 d1 + u2 d2 C12
- Achievable region has one vertex for each priority ordering of workloads: 3! = 6 in all
- Hence its name: the permutahedron

Three Workload Permutahedron

3! = 6 vertices (priority orders)

23 - 2 = 6 edges

(conservation constraints)

u1 r1 + u2 d2 + u3 d3 = C

d3

V(1,2,3)

V(2,1,3)

d2

d1

Experimental evidence

4! = 24 vertices (ordered states)

24 - 2 = 14 facets (proper subsets)

(conservation constraints)

74 faces (states)

Simplicial geometry and transportation polytopes,

Trans. Amer. Math. Soc. 217 (1976) 138.

- Suppose f1 and f2 > 0 , f1 + f2 = 1
- Model: System operates in state
- 12 with probability f1
- 21 with probability f2
(independent of who is on queue)

- Average degradation vector:
V = f1 V(12) + f2 V(21)

- Reasonable modeling assumption: f1 = 1, f2 = 0 means workload 1 runs at high priority
- For arbitrary shares: workload priority order is
(1,2) with probability f1

(2,1) with probability f2 (probability = fraction of time)

- Compute average workload degradation: d1 = f1 (wkl 1 degradation at high priority) + f2 (wkl 1 degradation at low priority )

Fair Share Scheduling

f1 f2 f3

prob(123) = ------------------------------

(f1 + f2 +f3) (f2 +f3) (f3)

- Theorem: These n! probabilities sum to 1
- interesting identity generalizing adding fractions
- prove by induction, or by coupon collecting

- V = ordered states s prob(s) V(s)
- O(n!), (n!), good enough for n 9 (12)

- Screen captures on next slides are from www.bmc.com/patrol/fairshare
- Experiment with “what if” fair share modeling
- Watch a simulation
- Random virtual job generator for the simulation is the same one used to generate random real jobs for our benchmark studies

1

2

3

???

- Three workloads, each with utilization 0.32 jobs/second 1.0 seconds/job = 0.32 = 32%
- CPU 96% busy, so average (conserved) response time is 1.0/(10.96) = 25 seconds
- Individual workload average response times depend on shares

???

???

1

sum 80.0

32.0

2

48.0

3

20.0

- Normalized f3 = 0.20 means 20% of the time workload 3 (development) would be dispatched at highest priority

- During that time, workload priority order is (3,1,2) for 32/80 of the time, (3,2,1) for 48/80
- Probability( priority order is 312 ) = 0.20(32/80) = 0.08

Three Transaction Workloads

- Formulas on previous slide
- Average predicted response time weighted by throughput 25 seconds (as expected)
- Hard to understand intuitively
- Software helps

note change from 32%

jobs currently on run queue

- Real CPU uses round robin scheduling to deliver time slices
- Short jobs never wait for long jobs to complete
- That resembles shortest job first, so response time conservation law fails
- At high utilization, simulation shows smaller response times than predicted by model
- Response time conservation law yields conservative predictions

- V = ordered states s prob(s) V(s)
- Each s is a permutation of (1,2, … , n)
- Think of it as a vector in n-space
- Those n! vectors lie on of a sphere
- For n large they are pretty densely packed
- Think of prob(s) as a discrete approximation to a probability distribution on the sphere
- V is an integral

- loop sampleSize times
choose a permutation s at random from the distribution determined by the shares

compute degradation vector V(s)

accumulate V += prob(s)V(s)

- sampleSize = 40000 works well independent of n!

- Interpret shares as barycentric coordinates in the n-1 simplex
- Study the geometry of the map from the simplex to the n-1 dimensional permutahedron
- Easy when n=2: each is a line segment and map is linear

f3 = 1

f1 = 0

312

132

f1 = 1

M

f3 = 0

321

123

wkl 1 high priority

213

231

wkl 1 low priority

f1 = 0

f1 = 1

{23}

Mapping a triangle to a hexagon

- Add a strong statement that summarizes how you feel or think about this topic
- Summarize key points you want your audience to remember