440 likes | 550 Views
Chapter 10. Scheduling. Presented by Vladimir Yanovsky. The goals. Scheduling: Mapping of parallelism within the constraints of limited available parallel resources In general, we must sacrifice some parallelism to fit a program within the available resources
 
                
                E N D
Chapter 10 Scheduling Presented by Vladimir Yanovsky
The goals • Scheduling: Mapping of parallelism within the constraints of limited available parallel resources • In general, we must sacrifice some parallelism to fit a program within the available resources • Our goal: Minimize the amount of parallelism sacrificed/maximize utilization of the resources
Lecture Outline • Straight line scheduling • Trace Scheduling • Loops: Kernel Scheduling (Software Pipelining) • Vector unit scheduling
Scheduling - Motivation • Transistor sizes have shrank. Can be exploited by: • Several processors on the same silicone. • Multiple identical execution units. • The more parallelisms allows the processor, the more important scheduling is.
Processor Types • Superscalar Multiple functional units controlled and scheduled by the hardware. • VLIW (Very Large Instruction Word) Scheduled by the compiler
VLIW vs Superscalar • Compatibility • Capability of run-time adjustments (branches & cache misses) • Design simplicity • Global view of the program
Scheduling – standard approach • Scheduling in VLIW and Superscalar architectures: • Receive a sequential stream of instructions • Reorder this sequential stream to utilize available parallelism • Reordering must preserve dependences • Our model for this talk is VLIW
Reuse Constrains • Need to execute: a = b + c + d + e • One possible sequential stream: add a, b, c add a, a, d add a, a, e • And, another: add r1, b, c add r2, d, e add a, r1, r2
Fundamental Problem • Fundamental conflict in scheduling: • If the original instruction stream takes into account available resources it will create artificial dependences • If not, then there may not be enough resources to correctly execute the stream • Who should be earlier, register allocation or scheduling?
Processor Model • VLIW type • Processor contains a number of issue units • Issue unit has an associated type and a delay • Purpose: to select set of instructions for each cycle such that the number of instructions of each type is not greater than the number of execution units of this type
Straight Line Scheduling • Scheduling a basic block: receives a dependence graph G = (N, E, type, delay) • N: set of instructions in the code • E: (n1, n2)  E iff n2 must wait completion of n1 due to a dependency • Each n  N has a type, type(n), and a delay, delay(n).
Straight Line Scheduling • A correct schedule is a mapping, S, from vertices in the graph to nonnegative integers representing cycle numbers such that: • If (n1,n2)  E, S(n1) + delay(n1)  S(n2), i.e. deps satisfied • Hardware constraints are satisfied. • The length of a schedule, S, denoted L(S) is defined as:L(S) = maxn (S(n) + delay(n)) • Goal of straight-line scheduling: Find a shortest possible correct schedule.
List Scheduling • Use variant of topological sort: • Maintain a list of instructions which have no unscheduled predecessors in the graph • Schedule these instructions • This will allow other instructions to be added to the list • Repeat until all instructions are scheduled
List Scheduling • We maintain two arrays: • count determines for each instruction how many predecessors are still to be scheduled • earliest array maintains the earliest cycle on which the instruction can be scheduled. • Maintain a number of worklists which hold instructions to be scheduled for a particular cycle number. All their predecessors are scheduled.
List Scheduling - Initialization for each n N do begin count[n] := 0; earliest[n] = 0 end for each (n1,n2) E do begin count[n2] := count[n2] + 1; successors[n1] := successors[n1]  {n2}; end for i := 0 to MaxC – 1 do W[i] := ; //MaxC max(delay)+1 Wcount := 0; //The number of ready instructions for each n N do if count[n] = 0 then begin //No dependencies W[0] := W[0]  {n}; Wcount := Wcount + 1; end end c := 0; // c is the cycle number cW := 0; // cW is the number of the worklist for cycle c instr[c] := ;
List Scheduling Algorithm while Wcount > 0 do begin while W[cW] = do begin c := c + 1; instr[c] := ; cW := mod(cW+1,MaxC); end nextc := mod(c+1,MaxC); //next cycle while W[cW] ≠ do begin select and remove an arbitrary instruction x from W[cW]; iffree issue units of type(x) on cycle c then begin instr[c] := instr[c]  {x}; Wcount := Wcount - 1; for each y  successors[x] do begin count[y] := count[y] – 1; earliest[y] := max(earliest[y], c+delay(x)); if count[y] = 0 then begin loc := mod(earliest[y],MaxC); W[loc] := W[loc]  {y}; Wcount := Wcount + 1; end end else W[nextc] := W[nextc]  {x}; //x could not be scheduled For each unused unit insert stall end end Priority
Finding the critical path for each n N do begin count[n] := 0; remaining[n] := delay(n); end for each (n1,n2) E do begin count[n1] := count[n1] + 1; //count[n]==0 iff nothing depends on n predecessors[n2] := predecessors[n2]  {n1}; end W := ∅; for each n N do if count[n] = 0 then W := W  {n};//init: W-inst without deps while W ≠ ∅ do begin select and remove an arbitrary instruction x from W; for each y predecessors[x] do begin count[y] := count[y] – 1; remaining[y] := max(remaining[y], remaining[x]+delay(y)); if count[y] = 0 then W := W {y}; end end
Problems of list scheduling • Previous basic block must complete before the next is started. • Cannot schedule loops.
Trace Scheduling • Exploit parallelism between several basic blocks. • Trace: is a collection of basic blocks that form a single path through all or part of the program. • CFG without loops
Trace Scheduling Scheduling j=j+1 i=i+2 if e1 i = i + 2 is moved below the split – inserted fixup code
Trace Scheduling • Trace scheduling algorithm: • Select a trace based on profiling information. • Schedule the trace using basic block scheduler adding dependencies from the splits/joints to the upstream/downstream instructions respectively. • Insert a fixup code. • Remove the scheduled trace from the CFG • If CFG not empty Goto 1
Trace & line scheduling - conclusions • Problem with line & trace scheduling – cannot schedule loops effectively. Must unroll loops to have more “meat” for work. • Trace scheduling increases code size by inserting fixup code, may lead to exponential code increase. • Need up-to-date memory dependencies information to do anything about moving memory accesses.
Kernel Scheduling • Moves instructions not only in space but also in time – across iterations. • Allows to better exploit parallelism between loop iterations.
Kernel Scheduling problem • A kernel scheduling problem is a graph:G = (N, E, delay, type, cross)where cross (n1, n2) defined for each edge in E is the number of iterations crossed by the dependence relating n1 and n2
Load/Store Integer Floating Pt. 10: fld fr2,a(r1) ai r1,r1,8 • comp r1,r2 fst fr3,b-16(r1) ble l0 • fadd fr3,fr2,fr1 Software Pipelining fld 2 • Example: ld r1,0 ld r2,400 fld fr1, c l0 fld fr2,a(r1) l1 fadd fr2,fr2,fr1 l2 fst fr2,b(r1) l3 ai r1,r1,8 l4 comp r1,r2 l5 ble l0 • A legal schedule: fadd 3 fst
Load/Store Integer Floating Pt. 10: fld fr2,a(r1) ai r1,r1,8 • comp r1,r2 fst fr3,b-16(r1) ble l0 • fadd fr3,fr2,fr1 Software Pipelining ld r1,0 ld r2,400 fld fr1, c l0 fld fr2,a(r1) l1 fadd fr2,fr2,fr1 l2 fst fr2,b(r1) l3 ai r1,r1,8 l4 comp r1,r2 l5 ble l0 S[10] = 0; I[l0] = 0; S[l1] = 2; I[l1] = 0; S[l2] = 2; I[l2] = 1; S[l3] = 0; I[l3] = 0; S[l4] = 1; I[l4] = 0; S[l5] = 2; I[l5] = 0;
Software Pipelining • Have to generate epilog and prolog to ensure correctness • Prolog: ld r1,0 ld r2,400 fld fr1, c p1 fld fr2,a(r1); ai r1,r1,8 p2 comp r1,r2 p3 beq e1; fadd fr3,fr2,fr1 • Epilog: e1 nop e2 nop e3 fst fr3,b-8(r1)
Kernel Scheduling • A solution to the kernel scheduling problem is a pair of tables (S,I), where: • the schedule S maps each instruction n to a cycle within the kernel • the iteration I maps each instruction to an iteration offset from zero, such that:S[n1] + delay(n1)  S[n2] + (I[n2] – I[n1] + cross(n1,n2)) Lk(S) for each edge (n1,n2) in E, where: Lk(S) = maxn (S(n)) is the length of the kernel for S. • Another name for kernel’s length is II – initiation interval
Kernel scheduling - intuition • S[n1] + delay(n1)  S[n2] + (I[n2] – I[n1] + cross(n1,n2)) Lk(S) • Instructions with I[n] = 0 are running in the “current” iteration. • If I[n]>0 this means that the instruction is delayed by I[n] iterations. • Even if n1 has large delay, n2 can be moved to a later iteration instead of forcing it to be scheduled in the cycle S[n1] + delay(n1)
Resource Constrains • Resource usage constraint: • No recurrence in the loop • #t: number of instructions in each iteration that must issue in a unit of type tLk(S)  • We can always find a schedule S, such thatLk(S) =
Kernel Scheduling for each instruction x in G in topological order do begin earlyS := 0; earlyI := 0; for each predecessor y of x in G do thisS := S[y] + delay(y); thisI := I[y]; if thisS ≥ L then begin thisS := mod(thisS,L); thisI := thisI + ; end if thisI > earlyI or ((thisI =earlyI) && (thisS>earlyS)) then begin earlyI := thisI; earlyS := thisS; end end starting at cycle earlyS, find the first cycle c0 where the resource needed by x is available, wrapping to the beginning of the kernel if necessary; S[x] := c0; if c0 < earlyS then I[x] := earlyI +1 else I[x] := earlyI; //Wrapped over kernel end
Memory1 Integer1 Integer2 Integer3 Memory2 Software Pipelining Example l0 ld a,x(i) l1 ai a,a,1 l2 ai a,a,1 l3 ai a,a,1 l4 st a,x(i) l0: S=0; I=0 l1: S=0; I=1 l2: S=0; I=2 l3: S=0; I=3 l4: S=0; I=4 • 2 memory units, 3 integer units. • II=1 is enough. Each time next instruction is pushed to the next iteration.
Register Pressure • The same register a cannot be used in 4 different iterations running simultaneously. • Need to store register’s value for each overlapping iterations and rename them cyclically after each iteration. • Issue 2 can be solved by unrolling with renaming though this will increase code size • l0 ld a0,x(i) • l1 ai a1,a0,1 • l2 ai a2,a1,1 • l3 ai a3,a2,1 • l4 st a3,x(i)
Prolog & Epilog • Current iteration when entering the kernel is 5. • I(Stage A)=0, that is we execute Stage A in the same iteration as initially. • I(Stage B) = 1, i.e. Stage B is always delayed to the next iteration. • Prolog: StageA1; StageB1,StageA2;StageC1,StageB2,StageA3…
Prolog & Epilog generation • Prolog: for k = 0 to maxn(I(n))-1 lay out the kernel replacing all n s.t. I(n)>k by NO-OP • Epilog: for to k=1 to maxn(I(n)) lay out the kernel replacing all n s.t. I(n)<k by NO-OP • Compact both using list schedule.
Recurrences • Given a recurrence (n1, n2, …, nk): Lk(S)  • Right hand side is called the slope of the recurrence. Nominator is the number of cycles it takes to complete all the computations of the recurrence, denominator is the number iterations available to do this. • Lk(S)  MAXc
Kernel Scheduling – General Case • Compute MII to be the maximum of resource constraint and the maximum slope. • II=MII • Remove an edge from every recurrence. • Schedule(II) using the simple kernel scheduling algorithm. • If failed (dependency of any removed edge is violated), increase II and got 4.
Kernel Scheduling - Conclusions • Handling control flow is difficult. May use hardware support for predicated execution or handling the “control flow regions” as black boxes. • Increased register pressure may limit only to single basic block inner loops anyway. • Benefits from unrolling with renaming.
Vector Unit Scheduling • A vector instruction involves the execution of many scalar instructions • Much of the benefit from the pipelining is already achieved • Still, something can be done
Chaining • Chaining:vload t1, a vload t2, b vadd t3, t1, t2 vstore t3, c • Two load units • Each operation takes 64 cycles • 192 cycles without chaining • 66 cycles with chaining • Proximity within instructions required for hardware to identify opportunities for chaining
Chaining rearranging 2 load, 1 addition, 1 multiplication pipe vload a,x(i) vload b,y(i) vadd t1,a,b vload c,z(i) vmul t2,c,t1 vmul t3,a,b vadd t4,c,t3 • Rearranging: • vload a,x(i) • vload b,y(i) • vadd t1,a,b • vmul t3,a,b • vload c,z(i) • vmul t2,c,t1 • vadd t4,c,t3
Instruction fusion vload a,x(i) vload b,y(i) vadd t1,a,b vload c,z(i) vmul t2,c,t1 vmul t3,a,b vadd t4,c,t3
Instruction fusion – cont. After Fusion vload a,x(i) vload b,y(i) vadd t1,a,b vmul t3,a,b vload c,z(i) vmul t2,c,t1 vadd t4,c,t3