390 likes | 535 Views
In Class 17 of EECS 583 at the University of Michigan, we reviewed key research topics including decoupled software pipelining. The midterm exam is scheduled for November 14, 2011, focusing on control flow analysis and various optimization techniques. Students were reminded of upcoming reading materials on automatic thread extraction and speculative parallel iteration chunk execution. It's essential to prepare by reviewing lecture notes and example problems. For any questions, office hours are available leading up to the exam.
E N D
EECS 583 – Class 17Research Topic 1Decoupled Software Pipelining University of Michigan November 9, 2011
Announcements + Reading Material • 2nd paper review due today • Should have submitted to andrew.eecs.umich.edu:/y/submit • Next Monday – Midterm exam in class • Today’s class reading • “Automatic Thread Extraction with Decoupled Software Pipelining,” G. Ottoni, R. Rangan, A. Stoler, and D. I. August, Proceedings of the 38th IEEE/ACM International Symposium on Microarchitecture, Nov. 2005. • Next class reading (Wednes Nov 16) • “Spice: Speculative Parallel Iteration Chunk Execution,” E. Raman, N. Vachharajani, R. Rangan, and D. I. August, Proc 2008 Intl. Symposium on Code Generation and Optimization, April 2008.
Midterm Exam • When: Monday, Nov 14, 2011, 10:40-12:30 • Where • 1005 EECS • Uniquenames starting with A-H go here • 3150 Dow (our classroom) • Uniquenames starting with I-Z go here • What to expect • Open book/notes, no laptops • Apply techniques we discussed in class on examples • Reason about solving compiler problems – why things are done • A couple of thinking problems • No LLVM code • Reasonably long but you should finish • Last 2 years exams are posted on the course website • Note – Past exams may not accurately predict future exams!!
Midterm Exam • Office hours between now and Monday if you have questions • Daya: Thurs and Fri 3-5pm • Scott: Wednes 4:30-5:30, Fri 4:30-5:30 • Studying • Yes, you should study even though its open notes • Lots of material that you have likely forgotten • Refresh your memories • No memorization required, but you need to be familiar with the material to finish the exam • Go through lecture notes, especially the examples! • If you are confused on a topic, go through the reading • If still confused, come talk to me or Daya • Go through the practice exams as the final step
Exam Topics • Control flow analysis • Control flow graphs, Dom/pdom, Loop detection • Trace selection, superblocks • Predicated execution • Control dependence analysis, if-conversion, hyperblocks • Can ignore control height reduction • Dataflow analysis • Liveness, reaching defs, DU/UD chains, available defs/exprs • Static single assignment • Optimizations • Classical: Dead code elim, constant/copy prop, CSE, LICM, induction variable strength reduction • ILP optimizations - unrolling, renaming, tree height reduction, induction/accumulator expansion • Speculative optimization – like HW 1
Exam Topics - Continued • Acyclic scheduling • Dependence graphs, Estart/Lstart/Slack, list scheduling • Code motion across branches, speculation, exceptions • Software pipelining • DSA form, ResMII, RecMII, modulo scheduling • Make sure you can modulo schedule a loop! • Execution control with LC, ESC • Register allocation • Live ranges, graph coloring • Research topics • Can ignore these
Last Class • Scientific codes – Successful parallelization • KAP, SUIF, Parascope, gcc w/ Graphite • Affine array dependence analysis • DOALL parallelization • C programs • Not dominated by array accesses – classic parallelization fails • Speculative parallelization – Hydra, Stampede, Speculative multithreading • Profiling to identify statistical DOALL loops • But not all loops DOALL, outer loops typically not!! • This class – Parallelizing loops with dependences
What About Non-Scientific Codes??? 1 5 3 2 0 4 2 4 5 1 0 3 Core 1 Core 2 C:1 C:2 X:1 X:2 C:4 C:3 X:4 X:3 C:5 C:6 X:5 X:6 Scientific Codes (FORTRAN-like) General-purpose Codes (legacy C/C++) for(i=1; i<=N; i++) // C a[i] = a[i] + 1; // X while(ptr = ptr->next) // LD ptr->val = ptr->val + 1; // X Core 1 Core 2 Independent Multithreading (IMT) Example: DOALL parallelization Cyclic Multithreading (CMT) Example: DOACROSS [Cytron, ICPP 86] LD:1 X:1 LD:2 X:2 LD:3 LD:4 X:3 X:4 LD:5 X:5 LD:6
Alternative Parallelization Approaches 0 1 2 4 5 3 4 3 2 1 0 5 Core 1 Core 2 Core 1 Core 2 LD:1 LD:1 LD:2 X:1 X:1 LD:2 X:2 X:2 LD:3 LD:3 X:3 LD:4 LD:4 X:3 X:4 X:4 LD:5 LD:5 LD:6 X:5 X:5 LD:6 while(ptr = ptr->next) // LD ptr->val = ptr->val + 1; // X Pipelined Multithreading (PMT) Example: DSWP [PACT 2004] Cyclic Multithreading (CMT)
Comparison: IMT, PMT, CMT 0 4 0 1 3 4 5 5 2 1 2 1 0 3 5 4 3 2 CMT PMT Core 1 Core 2 LD:1 Core 1 Core 2 Core 1 Core 2 C:1 C:2 LD:1 LD:2 X:1 X:1 X:2 X:1 LD:2 X:2 LD:3 X:2 C:4 C:3 LD:3 X:3 LD:4 X:4 LD:4 X:3 X:3 X:4 LD:5 X:4 C:5 C:6 LD:5 LD:6 X:5 X:5 X:6 X:5 LD:6 IMT
Comparison: IMT, PMT, CMT 4 3 2 0 5 3 5 4 1 0 1 2 3 4 2 5 1 0 IMT CMT PMT Core 1 Core 2 Core 1 Core 2 Core 1 Core 2 C:1 C:2 LD:1 LD:1 X:1 X:2 LD:2 X:1 X:1 LD:2 X:2 X:2 C:4 C:3 LD:3 LD:3 X:4 X:3 LD:4 X:3 LD:4 X:3 X:4 X:4 C:5 C:6 LD:5 LD:5 1 iter/cycle 1 iter/cycle X:5 X:6 LD:6 X:5 X:5 LD:6 1 iter/cycle lat(comm) = 1:
Comparison: IMT, PMT, CMT 0 0 1 3 4 5 5 4 2 2 1 0 5 4 3 3 2 1 IMT CMT PMT Core 1 Core 2 Core 1 Core 2 LD:1 LD:1 Core 1 Core 2 C:1 C:2 X:1 LD:2 X:1 X:2 LD:2 X:1 LD:3 C:4 C:3 X:2 X:2 LD:4 X:4 X:3 X:3 LD:3 LD:5 C:5 C:6 X:4 X:3 LD:6 X:5 X:6 1 iter/cycle 1 iter/cycle lat(comm) = 1: 1 iter/cycle 1 iter/cycle 1 iter/cycle lat(comm) = 2: 0.5 iter/cycle
Comparison: IMT, PMT, CMT 5 4 3 2 1 0 1 5 0 3 4 5 4 3 2 1 0 2 C:1 C:2 X:1 X:2 C:4 C:3 X:4 X:3 C:5 C:6 X:5 X:6 Thread-local Recurrences Fast Execution IMT PMT CMT Core 1 Core 2 Core 1 Core 2 Core 1 Core 2 LD:1 LD:1 LD:2 X:1 X:1 LD:2 LD:3 X:2 X:2 LD:4 X:3 LD:5 LD:3 X:4 LD:6 X:3 Cross-thread Dependences Wide Applicability
Our Objective: Automatic Extraction of Pipeline Parallelism using DSWP 197.parser Find English Sentences Parse Sentences (95%) Emit Results Decoupled Software Pipelining PS-DSWP (Spec DOALL Middle Stage)
[MICRO 2005] Decoupled Software Pipelining (DSWP) register control intra-iteration loop-carried communication queue A: while(node) B: ncost = doit(node); C: cost += ncost; D: node = node->next; Dependence Graph Thread 1 Thread 2 A D DAGSCC D A B B A D B C C C Inter-thread communication latency is a one-time cost
register control memory L1: intra-iteration loop-carried Aux: Implementing DSWP DFG
register control memory intra-iteration loop-carried Optimization: Node SplittingTo Eliminate Cross Thread Control L1 L2
register control memory intra-iteration loop-carried Optimization: Node Splitting To Reduce Communication L1 L2
register control memory intra-iteration loop-carried Constraint: Strongly Connected Components Consider: Solution: DAGSCC Eliminates pipelined/decoupled property
2 Extensions to the Basic Transformation • Speculation • Break statistically unlikely dependences • Form better-balanced pipelines • Parallel Stages • Execute multiple copies of certain “large” stages • Stages that contain inner loops perfect candidates
Why Speculation? register control intra-iteration loop-carried communication queue A: while(node) B: ncost = doit(node); C: cost += ncost; D: node = node->next; A D Dependence Graph DAGSCC D B B A C C
Why Speculation? register control intra-iteration loop-carried communication queue A: while(cost < T && node) B: ncost = doit(node); C: cost += ncost; D: node = node->next; A D Dependence Graph DAGSCC D B A B C D B A C C Predictable Dependences
Why Speculation? register control intra-iteration loop-carried communication queue A: while(cost < T && node) B: ncost = doit(node); C: cost += ncost; D: node = node->next; D Dependence Graph DAGSCC D B B A C C A Predictable Dependences
Execution Paradigm register control intra-iteration loop-carried communication queue D DAGSCC Misspeculation detected B C Misspeculation Recovery Rerun Iteration 4 A
Understanding PMT Performance 4 3 5 0 2 1 1 2 3 4 5 0 Core 1 Core 2 Core 1 Core 2 A:1 B:1 A:1 A:2 B:1 A:2 B:2 C:1 B:2 A:3 Rate tiis at least as large as the longest dependence recurrence. NP-hard to find longest recurrence. Large loops make problem difficult in practice. Idle Time B:3 A:4 A:3 B:3 C:3 B:4 A:5 A:6 B:5 Slowest thread: 1 cycle/iter 2 cycle/iter Iteration Rate: 1 iter/cycle 0.5 iter/cycle
Selecting Dependences To Speculate register control intra-iteration loop-carried communication queue A: while(cost < T && node) B: ncost = doit(node); C: cost += ncost; D: node = node->next; D Dependence Graph DAGSCC Thread 1 D B Thread 2 B A C Thread 3 C A Thread 4
Detecting Misspeculation A1: while(consume(4)) D: node = node->next produce({0,1},node); Thread 1 A2: while(consume(5)) B: ncost = doit(node); produce(2,ncost); D2: node = consume(0); D DAGSCC Thread 2 B A3: while(consume(6)) B3: ncost = consume(2); C: cost += ncost; produce(3,cost); Thread 3 C A: while(cost < T && node) B4: cost = consume(3); C4: node = consume(1); produce({4,5,6},cost < T && node); A Thread 4
Detecting Misspeculation A1: while(TRUE) D: node = node->next produce({0,1},node); Thread 1 D DAGSCC A2: while(TRUE) B: ncost = doit(node); produce(2,ncost); D2: node = consume(0); Thread 2 B A3: while(TRUE) B3: ncost = consume(2); C: cost += ncost; produce(3,cost); Thread 3 C A: while(cost < T && node) B4: cost = consume(3); C4: node = consume(1); produce({4,5,6},cost < T && node); A Thread 4
Detecting Misspeculation A1: while(TRUE) D: node = node->next produce({0,1},node); Thread 1 D DAGSCC A2: while(TRUE) B: ncost = doit(node); produce(2,ncost); D2: node = consume(0); Thread 2 B A3: while(TRUE) B3: ncost = consume(2); C: cost += ncost; produce(3,cost); Thread 3 C A: while(cost < T && node) B4: cost = consume(3); C4: node = consume(1); if(!(cost < T && node)) FLAG_MISSPEC(); A Thread 4
Breaking False Memory Dependences register control intra-iteration loop-carried communication queue Oldest Version Committed by Recovery Thread Dependence Graph D B A Memory Version 3 C false memory
Adding Parallel Stages to DSWP 5 3 2 4 6 Core 1 Core 2 Core 3 0 LD:1 1 LD:2 while(ptr = ptr->next) // LD ptr->val = ptr->val + 1; // X LD:3 X:1 LD:4 LD = 1 cycle X = 2 cycles X:2 LD:5 X:3 Comm. Latency = 2 cycles LD:6 X:4 Throughput DSWP: 1/2 iteration/cycle DOACROSS: 1/2 iteration/cycle PS-DSWP: 1 iteration/cycle LD:7 X:5 7 LD:8
Thread Partitioning p = list; sum = 0; A: while (p != NULL) { B: id = p->id; E: q = p->inner_list; C: if (!visited[id]) { D: visited[id] = true; F: while (foo(q)) G: q = q->next; H: if (q != NULL) I: sum += p->value; } J: p = p->next; } 10 10 10 10 5 5 50 50 5 Reduction 3
Thread Partitioning 20 • Merging Invariants • No cycles • No loop-carried dependence inside a doall node 10 15 5 100 5 3
Thread Partitioning 20 Treated as sequential 10 15 113
Thread Partitioning 45 113 Modified MTCG[Ottoni, MICRO’05] to generate code from partition
Discussion Point 1 – Speculation • How do you decide what dependences to speculate? • Look solely at profile data? • What about code structure? • How do you manage speculation in a pipeline? • Traditional definition of a transaction is broken • Transaction execution spread out across multiple cores
Discussion Point 2 – Pipeline Structure • When is a pipeline a good/bad choice for parallelization? • Is pipelining good or bad for cache performance? • Is DOALL better/worse for cache? • Can a pipeline be adjusted when the number of available cores increases/decreases?