Outline for Today’s Lecture

Outline for Today’s Lecture • Objective for today: • Finish the last lecture without preemption • Real-time scheduling • Beyond classic scheduling • Multiprocessor • Networks of workstations • Dynamic voltage scaling • Administrative: ??

Preemptive FCFS: Round Robin • Preemptive timeslicing is one way to improve fairness of FCFS. • If job does not block or exit, force an involuntary context switch after each quantum Q of CPU time. • Preempted job goes back to the tail of the ready list. • With infinitesimal Q round robin is called processor sharing. D=3 D=2 D=1 FCFS round robin 6 3+ε 5 quantum Q=1 R = (3 + 5 + 6 + ε)/3 = 4.67 + ε In this case, R is unchanged by timeslicing. Is this always true? preemption overhead = ε

Evaluating Round Robin D=5 D=1 • Response time. RR reduces response time for short jobs. • For a given load, a job’s wait time is proportional to its D. • Fairness. RR reduces variance in wait times. • But: RR forces jobs to wait for other jobs that arrived later. • Throughput. RR imposes extra context switch overhead. • CPU is only Q/(Q+ε) as fast as it was before. • Degrades to FCFS with large Q. R = (5+6)/2 = 5.5 R = (2+6 + ε)/2 = 4 + ε Q is typically 5-100 milliseconds; ε is 1-5 μs in 1998.

D=3 D=2 D=1 6 1 3 Minimizing Response Time: SJF • Shortest Job First (SJF) is provably optimal if the goal is to minimize R. • Example: express lanes at the MegaMart • Idea: get short jobs out of the way quickly to minimize the number of jobs waiting while a long job runs. • Intuition: longest jobs do the least possible damage to the wait times of their competitors. R = (1 + 3 + 6)/3 = 3.33

SJF • In preemptive case, shortest remaining time first. • In practice, we have to predict the CPU service times (computation time until next blocking). • Favors interactive jobs, needing response, & repeatedly doing user interaction • Favors jobs experiencing I/O bursts - soon to block, get devices busy, get out of CPU’s way • Focus is on an average performance measure, some long jobs may starve under heavy load/ constant arrival of new short jobs.

Behavior of SJF Scheduling • With SJF, best-case R is not affected by the number of tasks in the system. • Shortest jobs budge to the front of the line. • Worst-case R is unbounded, just like FCFS. • Since the queue is not “fair”, starvation exists - the longest jobs are repeatedly denied the CPU resource while other more recent jobs continue to be fed. • SJF sacrifices fairness to lower average response time.

SJF in Practice • Pure SJF is impractical: scheduler cannot predict D values. • However, SJF has value in real systems: • Many applications execute a sequence of short CPU bursts with I/O in between. • E.g., interactive jobs block repeatedly to accept user input. • Goal: deliver the best response time to the user. • E.g., jobs may go through periods of I/O-intensive activity. • Goal: request next I/O operation ASAP to keep devices busy and deliver the best overall throughput. • Use adaptive internal priority to incorporate SJF into RR. • Weather report strategy: predict future D from the recent past.

start (arrival rate λ) I/O completion I/O request I/O device exit (throughput λ until some center saturates) CPU Considering I/O • In real systems, overall system performance is determined by the interactions of multiple service centers. A queue network has K service centers. Each job makes Vk visits to center k demanding service Sk. Each job’s total demand at center k is Dk = Vk*Sk Forced Flow Law: Uk = λk Sk = λ Dk (Arrivals/throughputs λk at different centers are proportional.) Easy to predict Xk, Uk, λk, Rk and Nk at each center: use Forced Flow Law to predict arrival rate λk at each center k, then apply Little’s Law to k. Then: R = ΣVk*Rk

Digression: Bottlenecks • It is easy to see that the maximum throughput X of a system is reached as 1/λ approaches Dk for service center k with the highest demand Dk. • k is called the bottleneck center • Overall system throughput is limited byλkwhen Uk approaches 1. Example 1: CPU S0 = 1 I/O S1 = 4 This job is I/O bound. How much will performance improve if we double the speed of the CPU? Is it worth it? To improve performance, always attack the bottleneck center! Example 2: CPU S0 = 4 I/O S1 = 4 Demands are evenly balanced. Will multiprogramming improve system throughput in this case?

Two Schedules for CPU/Disk 1. Naive Round Robin 5 1 5 1 4 CPU busy 25/37: U = 67% Disk busy 15/37: U = 40% 2. Round Robin with SJF 33% performance improvement CPU busy 25/25: U = 100% Disk busy 15/25: U = 60%

high low Multilevel Feedback Queue • Many systems (e.g., Unix variants) implement priority and incorporate SJF by using a multilevel feedback queue. • multilevel. Separate queue for each of N priority levels. • Use RR on each queue; look at queue i-1 only if queue i is empty. • feedback. Factor previous behavior into new job priority. I/O bound jobs waiting for CPU jobs holding resouces jobs with high external priority GetNextToRun selects job at the head of the highest priority queue. constant time, no sorting CPU-bound jobs Priority of CPU-bound jobs decays with system load and service received. ready queues indexed by priority

Real Time Schedulers • Real-time schedulers must support regular, periodic execution of tasks (e.g., continuous media). • e.g. Microsoft’s Rialto scheduler [Jones97] supports an external interface for: • CPU Reservations • “I need to execute for X out of every Y units.” • Scheduler exercises admission control at reservation time: application must handle failure of a reservation request. • Time Constraints • “Run this before my deadline at time T.”

Assumptions • Tasks are periodic with constant interval between requests, Ti (request rate 1/Ti) • Each task must be completed before the next request for it occurs • Tasks are independent • Run-time for each task is constant (max), Ci • Any non-periodic tasks are special

Task Model C1 = 1 t1 Ti Ti t2 C2 = 1 T2 time

Definitions • Deadline is time of next request • Overflow at time t if t is deadline of unfulfilled request • Feasible schedule - for a given set of tasks, a scheduling algorithm produces a schedule so no overflow ever occurs. • Critical instant for a task - time at which a request will have largest response time. • Occurs when task is requested simultaneously with all tasks of higher priority

Rate Monotonic • Assign priorities to tasks according to their request rates, independent of run times • Optimal in the sense that no other fixed priority assignment rule can schedule a task set which can not be scheduled by rate monotonic. • If feasible (fixed) priority assignment exists for some task set, rate monotonic is feasible for that task set.

Task Model C1 = 1 t1 T1 T1 t2 C2 = 1 T2 time

Earliest Deadline First • Dynamic algorithm • Priorities are assigned to tasks according to the deadlines of their current request • With EDF there is no idle time prior to an overflow • For a given set of m tasks, EDF is feasible iffC1/T1 + C2/T2 + … + Cm/Tm[ 1 • If a set of tasks can be scheduled by any algorithm, it can be scheduled by EDF

Task Model C1 = 1 t1 T1 T1 t2 C2 = 1 T2 time

t1 t0 Proportional Share • Goals: to integrate real-time and non-real-time tasks, to police ill-behaved tasks, to give every process a well-defined share of the processor. • Each client, i, gets a weight wi • Instantaneous share fi (t) = wi /(S wj ) • Service time (fi constant in interval)Si(t0, t1) = fi (t) Dt • Set of active clients varies ->fi varies over timeSi(t0 , t1) = fi (t) dt jcA(t)

Common Proportional Share Competitors • Weighted Round Robin – RR with quantum times equal to share RR: WRR: • Fair Share –adjustments to priorities to reflect share allocation (compatible with multilevel feedback algorithms) 30 20 20 10 Linux

Common Proportional Share Competitors • Weighted Round Robin – RR with quantum times equal to share RR: WRR: • Fair Share –adjustments to priorities to reflect share allocation (compatible with multilevel feedback algorithms) 20 10 10 10 Linux

Common Proportional Share Competitors • Weighted Round Robin – RR with quantum times equal to share RR: WRR: • Fair Share –adjustments to priorities to reflect share allocation (compatible with multilevel feedback algorithms) 10 0 10 Linux

2/3 2/2 1/1 t Common Proportional Share Competitors • Fair Queuing • Weighted Fair Queuing • Stride scheduling • VT – Virtual Time advances at a rate proportional to share • VFT – Virtual Finishing Time: VT a client would have after executing its next time quantum • WFQ schedules by smallest VFT • EA never below -1 VTA(t) = WA(t) / SA VFT = 3/3VFT = 3/2VFT = 2/1

Lottery Scheduling • Lottery scheduling[Waldspurger96]is another scheduling technique. • Elegant approach to periodic execution, priority, and proportional resource allocation. • Give Wp “lottery tickets” to each process p. • GetNextToRun selects “winning ticket” randomly. • If ΣWp = N, then each process gets CPU share Wp/N ... ...probabilistically, and over a sufficiently long time interval. • Flexible: tickets are transferable to allow application-level adjustment of CPU shares. • Simple, clean, fast. • Random choices are often a simple and efficient way to produce the desired overall behavior (probabilistically).

Example List-based Lottery T = 20 5 2 1 10 2 12 Summing: 10 17 Random(0, 19) = 15

Linux Scheduling

Linux Scheduling Policy • Runnable process with highest priority and timeslice remaining runs (SCHED_OTHER policy) • Dynamically calculated priority • Starts with nice value • Bonus or penalty reflecting whether I/O or compute bound by tracking sleep time vs. runnable time: sleep_avg and decremented by timer tick while running

Linux Scheduling Policy • Dynamically calculated timeslice • The higher the dynamic priority, the longer the timeslice: • Recalculated every round when “expired” and “active” swap • Exceptions for expired interactive • Go back on active unless there are starving expired tasks High prioritymore interactive Low priorityless interactive 10ms 150ms 300ms

Linux task_struct • Process descriptor in kernel memory represents a process (allocated on process creation, deallocated on termination). • Linux: task_struct, located via task pointer in thread_info structure on process’s kernel state. task_struct task_struct stateprio policy static_priosleep_avgtime_slice …

. . . . . . . . . . . . Runqueue for O(1) Scheduler Higher prioritymore I/O 300ms priority array priority queue active lower prioritymore CPU 10ms priority queue expired priority array priority queue priority queue

. . . . . . Runqueue for O(1) Scheduler priority array 0 priority queue 1 . . . active priority queue expired priority array priority queue . . . priority queue

X . . . . . . . . . . . . X Runqueue for O(1) Scheduler priority array 0 priority queue active priority queue expired priority array priority queue 1 priority queue

Linux Real-time • No guarantees • SCHED_FIFO • Static priority, effectively higher than SCHED_OTHER processes* • No timeslice – it runs until it blocks or yields voluntarily • RR within same priority level • SCHED_RR • As above but with a timeslice. * Although their priority number ranges overlap

Beyond “Ordinary” Uniprocessors • Multiprocessors • Co-scheduling and gang scheduling • Hungry puppy task scheduling • Load balancing • Networks of Workstations • Harvesting Idle Resources - remote execution and process migration • Laptops and mobile computers • Power management to extend battery life, scaling processor speed/voltage to tasks at hand, sleep and idle modes.

Multiprocessor Scheduling • What makes the problem different? • Workload consists of parallel programs • Multiple processes or threads, synchronized and communicating • Latency defined as last piece to finish. • Time-sharing and/or Space-sharing (partitioning up the Mp nodes) • Both when and where a process should run

Mem Mem Node 0 Node 1 P P CA CA $ $ Interconnect CA CA $ $ Mem Mem P P Node 3 Node 2 Architectures Symmetric mp NUMA P P P P $ $ $ $ Memory cluster

Affinity Scheduling • Where (on which node) to run a particular thread during the next time slice? • Processor’s POV: favor processes which have some residual state locally (e.g. cache) • What is a useful measure of affinity for deciding this? • Least intervening time or intervening activity (number of processes here since “my” last time) * • Same place as last time “I” ran. • Possible negative effect on load-balance.

Every processor has its own private runqueue Locking – spinlock protects runqueue Load balancing – pulls tasks from busiest runqueue into mine. Affinity – cpus_allowed bitmask constrains a process to particular set of processors load_balance runs from schedule( ) when runqueue is empty or periodically esp. during idle. Prefers to pull processes from expired, not cache-hot, high priority, allowed by affinity Linux Support for SMP Symmetric mp P P P P $ $ $ $ Memory

Processor Partitioning • Static or Dynamic • Process Control (Gupta) • Vary number of processors available • Match number of processes to processors • Adjusts # at runtime. • Works with task-queue or threads programming model • Suspend and resume are responsibility of runtime package of application • Impact on “working set”

Process Control ClaimsTypical speed-up profile Lock contention, memory contention, context switching, cache corruption ||ism and working set in memory speedup Magic point Number of processes per application

Co-Scheduling • John Ousterhout (Medusa OS) • Time-sharing model • Schedule related threads simultaneouslyWhy? • Common state and coordination • How? • Local scheduling decisions after some global initialization (Medusa) • Centralized (SGI IRIX)

Effect of Workload • Impact of communication and cooperation • Issues: • context switch • common state • lock contention • coordination

CM*’s Version • Matrix S (slices) x P (processors) • Allocate a new set of processes (task force) to a row with enough empty slots • Schedule: Round robin through rows of matrix • If during a time slice, this processor’s element is empty or not ready, run some other task force’s entry in this column - backward in time (for affinity reasons and purely local “fall-back” decision)

Networks of Workstations • What makes the problem different? • Exploiting otherwise “idle” cycles. • Notion of ownership associated with workstation. • Global truth is harder to come by in wide area context

Harvesting Idle Cycles • Remote execution on an idle processor in a NOW (network of workstations) • Finding the idle machine and starting execution there. Related to load-balancing work. • Vacating the remote workstation when its user returns and it is no longer idle • Process migration

Issues • Why? • Which tasks are candidates for remote execution? • Where to find processing cycles? What does “idle” mean? • When should a task be moved? • How?

Motivation for Cycle Sharing • Load imbalances. Parallel program completion time determined by slowest thread. Speedup limited. • Utilization. In trend from shared mainframe to networks of workstations -> scheduled cycles to statically allocated cycles • “Ownership” model • Heterogeneity

Which Tasks? • Explicit submission to a “batch” scheduler (e.g., Condor) or Transparent to user. • Should be demanding enough to justify overhead of moving elsewhere. Properties? • Proximity of resources. • Example: move query processing to site of database records. • Cache affinity

Finding Destination • Defining “idle” workstations • Keyboard/mouse events? CPU load? • How timely and complete is the load information (given message transit times)? • Global view maintained by some central manager with local daemons reporting status. • Limited negotiation with a few peers • How binding is any offer of free cycles? • Task requirements must match machine capabilities

Outline for Today’s Lecture

Outline for Today’s Lecture

Presentation Transcript

Lecture 24

Lecture 1 Introduction

Today’s outline

SPU-22: The Unity of Science from the Big Bang to the Brontosaurus and Beyond

EE360: Multiuser Wireless Systems and Networks Lecture 6 Outline

EE359 – Lecture 3 Outline

EE360: Multiuser Wireless Systems and Networks Lecture 3 Outline

Lecture 2

16.317 Microprocessor Systems Design I

Lecture 18

Outline for Today’s Lecture

Lecture 3 Outline

Lecture 6: Descartes

Today ’ s take-home lessons (i.e. what you should be able to answer at end of lecture)

Lecture 9 of 14

Overview of Today’s Lecture

Lecture 12: Noam Chomsky

Before Class

16.216 ECE Application Programming

EE359 – Lecture 2 Outline

16.317 Microprocessor Systems Design I

Lecture 15