Carmelo Acosta Sriram Vajapeyam Alex Ramirez Mateo Valero UPC-Barcelona

CDE: A Compiler-driven, Dependence-centric, Eager-executing architecture for the billion transistor era Carmelo Acosta Sriram Vajapeyam Alex Ramirez Mateo Valero UPC-Barcelona

Motivation • Entering the billion transistor era • How to use the available Hw to increase performance • Maintain cost and complexity under control • Obtain a true general-purpose architecture • Do not limit High Performance to a single application class • Clustered architectures seem the way to go • Avoid excessive dependence on the compiler • Avoid impossible communication delays • Avoid complex interconnection networks • Hierarchical program partitioning • Both in the compiler and the hardware

Outline • Motivation • The CDE architecture • Hierarchical program partitioning • Epochs • Selective Eager Execution • Dependence clusters • Hierarchical architecture • Epoch Processing Core (EPC) • Processing Elements (PE) • Program execution • Related work • Summary and conclusions

The CDE architecture • The way CDE obtains performance • Rely on the compiler for code partitioning • Hierarchical program view • Matching hierarchical hardware • Use both run-time and compile-time speculation to keep the transistors occupied • How to achieve it • The Dependence Cluster (DC) is the basic execution unit • Larger than one instruction • Larger virtual instruction window • Reduces communication • Amortizes speculation costs • Commit, squash, and redo an entire DC

Hierarchical program partitioning • Horizontal control epochs • Large code segments • Loops, functions, hyperblock-like • Limit the scope of compiler optimizations • Trace scheduling • Selective eager execution • Vertical dependence clusters • Chains of dependent instructions • Localize communications

[ 0] 0x12001e09c: ldq t0, -21056(gp) [ 1] 0x12001e0a0: beq t0, 0x12001e0e4 [ 2] 0x12001e0a4: ldq t2, 8(t0) [ 3] 0x12001e0a8: beq t2, 0x12001e0dc [ 4] 0x12001e0ac: ldq t4, 8(t2) [ 5] 0x12001e0b0: ldq t4, 8(t4) [ 6] 0x12001e0b4: xor a0, t4, t4 [ 7] 0x12001e0b8: beq t4, 0x12001e0ec [ 8] 0x12001e0bc: ldq t2, 16(t2) [ 9] 0x12001e0c0: beq t2, 0x12001e0dc [10] 0x12001e0c4: ldq t6, 8(t2) [11] 0x12001e0c8: ldq t6, 8(t6) [12] 0x12001e0cc: xor a0, t6, t6 [13] 0x12001e0d0: beq t6, 0x12001e0ec [14] 0x12001e0d4: ldq t2, 16(t2) [15] 0x12001e0d8: bne t2, 0x12001e0ac [16] 0x12001e0dc: ldq t0, 16(t0) [17] 0x12001e0e0: bne t0, 0x12001e0a4 [18] 0x12001e0e4: ldq v0, 16(a0) [19] 0x12001e0e8: ret zero, (ra), 1 [20] 0x12001e0ec: ldq t2, 8(t2) [21] 0x12001e0f0: ldq v0, 16(t2) [22] 0x12001e0f4: ret zero, (ra), 1 NODE *xlygetvalue(NODE *sym) { register NODE *fp,*ep; /* check the environment list */ for (fp = xlenv; fp; fp = cdr(fp)) for (ep = car(fp); ep; ep = cdr(ep)) if (sym == car(car(ep))) return (cdr(car(ep))); /* return the global value */ return (getvalue(sym)); } a) Source code b) SuperScalar code DC#0 DC#1 DC#2 DC#3 DC#4 DC#5 DC#6 DC#7 DC#8 DC#9 DC #10 c) Control Epoch [2] [0] [8] [20] [16] [18] [19] [22] [8] [8] [21] [3] [9] [17] [1] [10] [14] [11] [15] [12] [13] [4] [5] [6] [7] Epochs

Eager execution Hard to predict branch • Traditional trace-scheduling • Bet on one direction • Optimize frequent case • Generate fix-up code for infrequent case • Eager-execution • Remove the branch • Optimize each separate case • Squash the incorrect trace Optimized trace + fix-up code Remove branch and execute both paths

Dependence clusters • Essentially a set of dependent instructions • May have dependencies with other DCs in the same Epoch • The compiler balances • Inter-DC dependencies • Localize communication within a DC • ILP • Place independent instructions in a different DC

Hierarchical architecture partitioning • Epoch Processing Core • Quickly sequences through control epochs • Epoch level speculation • Mesh of MIPS-2000 like Processing Elements • Execute individual Dependence Clusters PE EPC

Epoch Processing Core (EPC) • Fetches and processes epochs one at a time • Speculatively branches to the next epoch • Epoch level sequencing • Epoch level speculation • Renames live-in and live-outs of each epoch • Out of order epoch execution • Dispatches the DC’s to the PE grid • Coupled with the required data about the epoch • Renaming of live-in and live-outs

Processing Elements (PE) • MIPS-2000 like • In-order • Single-issue • Short pipeline • Local register file • Intra-DC dependencies • Communications manager • Inter-DC dependencies Comms. F D E M W Reg.file

DC #0 DC #1 DC #2 DC #3 DC #4 DC #5 DC #6 DC #7 DC #8 DC #9 DC #10 [2] [0] [20] [16] [8] [19] [18] [22] [8] [8] [9] [17] [1] [21] [3] [10] [14] [11] [15] [12] [13] [4] [5] [6] [7] Program execution (Cycle 0) • The EPC fetches, processes, renames and starts Epoch’s execution.

EPC Program execution (Cycle 1) • Initial EPC-PEs communication delay.

8 7 0 EPC Program execution (Cycle 2) DC#0 DC#7 DC#8 • DCs #0, #7 and #8 start execution on their respective PEs.

8 7 0 EPC Program execution (Cycle 3) DC#0 DC#7 DC#8 • Each PE continues its execution as statically scheduled by the compiler.

8 2 1 7 0 EPC Program execution (Cycle 4) DC#0 DC#1 DC#2 DC#7 DC#8 • DCs #1 and #2 start execution on their respective PEs.

8 2 1 7 0 EPC Program execution (Cycle 5) DC#0 DC#1 DC#2 DC#7 DC#8 • DC#0 (0-M) generates reg. t0, bypassed to next instruction (1-EX) and sent to DCs #1 and #2.

2’ 1’ 8 2 1 7 0 EPC Program execution (Cycle 6) DC#0 DC#1 DC#2 DC#7 DC#8 DC#1’ DC#2’ • DCs #1’ and #2’ (next instance) start execution. Reg. t0 arrives at DCs #1 and #2.

RAW Not Hierarchical HW Exploits Basic Block parallelism GPA Grid of ALUs High Instruction Fetch requirements Exploits HyperBlock parallelism Multiscalar Horizontal but not vertical Code partitioning SuperScalar Branch treatment ILDP Hardware only Approach Dynamic steer of dependent instructions to PEs Depends on an accumulator-based ISA Trace Processors Hardware only Approach Dynamic paths are captured in traces Related Work

Implementation considerations • Low complexity architecture based on regularity • Epoch Processing Core • Grid of PE • Communication network • High performance due to far-fetched speculation • Large virtual instruction window • Strong dependence on the compiler • Code partitioning, DC communication • Epochs limit the scope of optimizations

Solving multiple problems at once • CDE can also behave in a polimorphic way • Exploiting ILP • Far-fetched speculation through Epoch speculation • Exploiting TLP • Multi-threaded Epoch Processong Core • Distribute the PE's among all running threads • Exploiting DLP • No need to re-dispatch a DC to the PE's • Simply re-start the DC with new data

Summary and conclusions • Hierarchical partitioning • Epoch speculation maintains transistors occupied • Eager execution works around difficult branches • DC helps to keep complexity at bay • Amortizes cost of speculation (squash, commit) • Scalable performance with more PE • Increasing wire delays may limit scalability • Rely on the compiler to minimize communication • Design in its initial stages • Lots of unanswered questions • Specially regarding the memory hierarchy • Feedback is welcome!

Carmelo Acosta Sriram Vajapeyam Alex Ramirez Mateo Valero UPC-Barcelona

Carmelo Acosta Sriram Vajapeyam Alex Ramirez Mateo Valero UPC-Barcelona

Presentation Transcript

-prof sriram Prof Alex

Carmelo Anthony

Manuel Mateo manel.mateo@upc Management Department , Universitat Politècnica Catalunya.

Mateo Valero Director

Luis Ramirez

Gita Sriram

Richard Ramirez

Carmelo Anthony

Carmelo Cerrelli

Antonio Valero

Dr-Carmelo

Carmelo Acosta Sriram Vajapeyam Alex Ramirez Mateo Valero UPC-Barcelona

Valero Stock Analysis

Sriram IAS