When All Else Fails, Guess: The Use of Speculative Multithreading for High-Performance Computing

HPCC SEMINARS 12/12/2001 When All Else Fails, Guess: The Use of Speculative Multithreading for High-Performance Computing David J. Lilja Presented By: Baris Kazar HPCC SEMINARS

Outline I. Introduction II. Speculative Multithreading A. Maybe Dependences B. Programming Model III. The Super-Threaded Processor A. Architecture B. Compiler Support C. Performance Evaluation D. Coarse-Grained Speculative Multithreading IV. Related Work V. Current Work VI. Conclusions HPCC SEMINARS

I. Introduction • Numeric Application Programs • Research primarily focused on these • %15 of the overall market for large systems • Non-Numeric Application Programs • Operate on irregularly structured data • On-line transaction processing, file-serving, data-mining, web-serving etc. • The Speculative Multithreading Execution Model • Combines compiler-directed thread-level speculation of control dependences with run-time verification of data dependences. • Fine-grained parallelism. HPCC SEMINARS

Reduce Texec= n x CPI x Tclock (n: # of instructions; CPI: clocks/instruction) • Reduce the processor’s cycle time, Tclock • By scaling a given processor design to a faster technology • Obstacles: Not all signal propagation delays on a chip not scale down at the same rate for any particular technology. • Make each individual instruction in the processor’s ISA do more work per cycle • Complex, requiring more levels of logic leading to increase in Tclock • Decreasing n by altering ISA may not improve performance • Origin of this tradeoff is RISC vs. CISC • Reduce CPI • Needs more work to be done in each clock period • Then, n increases leading to increase in Tclock • Alternative: Donot change each individual instruction; • Increase the # of instructions that are executing simultaneously, • Parallel execution : Texec= n x Tclock/IPC HPCC SEMINARS

Texec= n x Tclock/IPC • IPC can increase much faster than Tclock • Fortran programs have been heavily parallelized • Conservative assumptions of compilers for determining dependences at run-time lead to sequential codes for C, C++ and JAVA programs. • There must be sufficient number of iterations to gain from a parallelized loop. • An example program from m88ksim program of SPECint95 Benchmark Suite, which has got a while-do loop, which is hard to parallelize with conventional techniques. HPCC SEMINARS

Example Program while (funct_units[i].class != ILLEGAL_CLASS) { if (f->class == funct_units[i].class) { if ( minclk > funct_units[i].busy) { minclk = funct_units[i].busy; j = i; if ( minclk == 0 ) break; } } i++ } • The variable “minclk” introduces a potential dependence between iterations. HPCC SEMINARS

II. Speculative Multithreading A. Maybe Dependences • “Maybe dependences” are dependences detected by compiler for which it cannot conclusively determine whether or not the dependence will actually exist at run-time • With our speculative parallelization model, the compiler may assume that maybe dependences will not actually occur at run-time. • Given the appropriate hardware support, this parallel execution model allows the system to speculate on control dependences and to perform run-time checks on data-dependences HPCC SEMINARS

CONTINUATION -Values needed to fork next thread Fork Fork TARGET STORE -Forward addresses of maybe dependences CONTINUATION -Values needed to fork next thread … … Fork Sync Sync TARGET STORE -Forward addresses of maybe dependences … COMPUTATION -Forward addresses and computed data as needed CONTINUATION -Values needed to fork next thread … … Sync COMPUTATION -Forward addresses and computed data as needed TARGET STORE -Forward addresses of maybe dependences … WRITE-BACK COMPUTATION -Forward addresses and computed data as needed Sync Sync Thread i WRITE-BACK Sync Thread i+1 WRITE-BACK Thread i+2 B. Programming Model (speculative multithreading exec model) HPCC SEMINARS

Reminding Example Program while (funct_units[i].class != ILLEGAL_CLASS) { if (f->class == funct_units[i].class) { if ( minclk > funct_units[i].busy) { minclk = funct_units[i].busy; j = i; if ( minclk == 0 ) break; } } i++ } • The variable “minclk” introduces a potential dependence between iterations. HPCC SEMINARS

Re-written Example Program /* if minclk is 0, break to terminate search*/ if (minclk ==0) { abort_future; i=i_1; goto L2; } } else release_ts(&minclk); } else release_ts(&min_clk); stop; /* Write-back stage*/ /* -> performed automatically after stop */ /* END OF THREAD PIPELINING */ L2: /*continue */ /* Continuation Stage */ L1: i_1 =i; store_ts(&i,i_1+1); fork L1; /* Target-Store-Address-Generation Stage */ allocate_ts(&minclk); wait_tsag_done; release_tsag_done; /* Computation Stage */ if(funct_units[i_1].class == ILLEGAL_CLASS) { abort_future; /* to check end of the loop */ i = i_1; goto L2; } if(f->class == funct_units[i_1].class) { if (minclk > funct_units[i_1].busy) { store_ts(&minclk, funct_units[i_1].busy); j=i_1; } [Continues on next column] HPCC SEMINARS

Instruction Cache Super-Scalar Core Super-Scalar Core Super-Scalar Core Super-Scalar Core Registers Registers Registers Registers PC Execution Unit PC Execution Unit PC Execution Unit PC Execution Unit Comm Dependence Buffer Comm Dependence Buffer Comm Dependence Buffer Comm Dependence Buffer Data Cache III. The Super-Threaded Processor A. Architecture: Unidirecectional Ring w/ 4 Thread Processing Units HPCC SEMINARS

B. Compiler Support • Needed for: • Partitioning threads into appropriate stages • Reordering instructions to maximize the amount overlap among executing threads • A compiler infrastructure is being develop • The initial version of this compiler consists of: • A modified version of SUIF compiler as the front-end • An enhanced version of the GCC compiler as the back-end • The new compiler will have new techniques for pointer alias analysis, data dependence analysis, integrated interprocedural dataflow analysis, and sophisticated loop parallelization techniques. HPCC SEMINARS

C. Performance Evaluation • SIMCA – SImulator for Multi-threaded Computer Architecture • Based on SimpleScalar simulator • Fork instruction is OS-level fork operation • Effectively creates a new copy of the SimpleScalar simulator to use as the simulated thread processing unit for the newly forked thread • Benchmarks: • 3 SPECint95 : compress, ijpeg, m88ksim • 3 SPECfp92 : alvinn, hydro2d, ear • 2 UNIX utilities : wc, cmp • # instructions issued/cycle =32 = constant • # of thread units varied • 16 int and fp ALUs HPCC SEMINARS

D. Coarse-Grained Speculative Multithreading • Developed an execution-driven simulator to evaluate this architecture • SimpleScalar simulator is replaced with native machine code to execute the computation stage • This change eliminates the ability to obtain clock-cycle accurate performance estimates of the superthreaded processor • However, it allows quick simulation. • Developed special C language library functions to execute programs on standard, off-the-shelf multiprocessor systems. • 8-processor SGI shared-memory multiprocessor is used HPCC SEMINARS

IV. Related Work • HEP, Horizon, Tera Machines without data $ • Only tolerate long memory delays • Alewife with data $ • XIMD, Elementary Multithreading, M-Machine, Multiscalar • Support synchronization and communication between threads • Multiscalar, SPSM • Allow speculative execution of threads • Multiscalar further allows speculation on data dependences ATLAS Multiscalar Superthreaded Raw None Level of Compiler Support Required Complete HPCC SEMINARS

V. Current Work & Problems • Currently, the source code is compiled with –O0, i.e. no optimization • Trying to compile the source code with –O3, i.e. optimization turned on • The computation stage code is moved somewhere else by the compiler and the code either has a crashing thread or an infinite run. • Suspecting about side-effect • Instructions should be told to compiler properly • If so, has to modify md (machine description) of GCC compiler HPCC SEMINARS

VI. Conclusions • Summarized recent work on Super-threaded Architecture. • This architecture allows the compiler to make the most aggressive assumptions possible about maybe dependences, which are dependences that can not be determined completely at compile-time using only the static information. • Also, this architecture supports speculation on control dependences by allowing threads to be initiated before it is known whether they actually should be executed. • Extended the superthreaded execution model to conventional off-the-shelf shared-memory multiprocessors HPCC SEMINARS

Additional Reference: • The Super-threaded Processor Architecture, Jenn-Yuan Tsai, Jian Huang, Christoffer Amlo, David J. Lilja, Pen-Chung Yew, IEEE Transactions on Computers, Special Issue on Multithreaded Architectures and Systems, Semptember 1999 • More example programs • More details about thread pipelining • More memory buffer details(I.e. memory buffer bandwidth) • 2 memory buffer ports per thread processing unit provide best price/performance ratio. • More details about communication bandwidth and latency • For BW: 2 requests/cycle is sufficient for 4-TPE, 8-issue STA processor • More details about data cache bandwidth and delays. • For BW: 4-way interleaved cache with 2 read ports and 1 write port provides sufficient bandwidth. HPCC SEMINARS

THANK YOU HPCC SEMINARS

When All Else Fails, Guess: The Use of Speculative Multithreading for High-Performance Computing