CprE / ComS 583 Reconfigurable Computing

CprE / ComS 583Reconfigurable Computing Prof. Joseph Zambreno Department of Electrical and Computer Engineering Iowa State University Lecture #12 – Systolic Computing

A B C D W X Y Z Recap – Multi-FPGA Systems • Crossbar topology: • Devices A-D are routing only • Gives predictable performance • Potential waste of resources for near-neighbor connections CprE 583 – Reconfigurable Computing

Recap – Logic Emulation • Emulation takes a sizable amount of resources • Compilation time can be large due to FPGA compiles CprE 583 – Reconfigurable Computing

Recap – Virtual Wires • Overcome pin limitations by multiplexing pins and signals • Schedule when communication will take place CprE 583 – Reconfigurable Computing

Outline • Recap • Introduction and Motivation • Common Systolic Structures • Algorithmic Mapping • Mapping Examples • Finite impulse response • Matrix-vector product • Banded matrix-vector product • Banded matrix multiplication CprE 583 – Reconfigurable Computing

Systolic Computing sys·to·le (sǐs’tə-lē) n. – the rhythmic contraction of the heart, especially of the ventricles, by which blood is driven through the aorta and pulmonary artery after each dilation or diastole [Greek systolē, from systellein to contract, from syn- + stellein to send] – sys·tol·ic (sǐs-tõl’ǐk) adj. Data flows from memory in a rhythmic fashion, passing through many processing elements before it returns to memory. [Kung, 1982] CprE 583 – Reconfigurable Computing

Systolic Architectures • Goal – general methodology for mapping computations into hardware (spatial computing) structures • Composition: • Simple compute cells (e.g. add, sub, max, min) • Regular interconnect pattern • Pipelined communication between cells • I/O at boundaries x + x min x x c CprE 583 – Reconfigurable Computing

Motivation • Effectively utilize VLSI • Reduce “Von Neumann Bottleneck” • Target compute-intensive applications • Reduce design cost • Simplicity • Regularity • Exploit concurrency • Local communication • Short wires (small delay, less area) • Scalable CprE 583 – Reconfigurable Computing

Why Study? • Original motivation – specialized accelerator for an application • Model/goals is a close match to reconfigurable computing • Target algorithms match • Well-developed theory, techniques, and solutions • One big difference – Kung’s approach targeted custom silicon (not a reconfigurable fabric) • Compute elements needed to be more general CprE 583 – Reconfigurable Computing

… F1 F2 F3 FN F4 … F1,1 F1,2 F1,3 F1,N F1,4 … F2,1 F2,2 F2,3 F2,N F2,4 … … … … … … FM,1 FM,2 FM,3 FM,N FM,4 Common Systolic Structures • One-dimensional linear array • Two-dimensional mesh CprE 583 – Reconfigurable Computing

F1 F2 F3 F5 F4 F6 F8 F9 F7 F10 F12 F11 F13 F1 F2 F4 F7 F14 F15 F16 F3 F5 F8 F11 F6 F9 F12 F14 F10 F13 F15 F16 Hexagonal Array Communicates with six nearest neighbors Squared-up representation CprE 583 – Reconfigurable Computing

F8 F10 F1 F4 F2 F5 F2 F3 F9 F11 F1 F4 F5 F6 F7 F12 F13 F8 F9 F14 F15 F10 F11 F12 F14 F6 F3 F7 F13 F15 Binary Tree H-Tree Representation CprE 583 – Reconfigurable Computing

Mapping Approach • Allocate PEs • Schedule computation • Schedule PEs • Schedule data flow • Optimize • Available Transformations: • Preload repeated values • Replace feedback loops with registers • Internalize data flow • Broadcast common input CprE 583 – Reconfigurable Computing

Example – Finite Impulse Response • A Finite Impulse Response (FIR) filter is a type of digital filter • Finite – response to an impulse eventually settles to zero • Requires no feedback for (i=1; i<=n; i++) for (j=1; j <=k; j++) y[i] += w[j] * x[i+j-1]; CprE 583 – Reconfigurable Computing

FIR Attempt #1 • Parallelize the outer loop for (i=1; i<=n; i++) in parallel for (j=1; j <=k; j++) sequential y[i] += w[j] * x[i+j-1]; wj y1 xj y1 wj y2 xj+1 y2 wj yn xn+j-1 yn CprE 583 – Reconfigurable Computing

wj xj wj xj+1 wj xj+2 wj xj+n-1 xj xj+1 xj+2 xj+n-1 wj … … y1 y2 y3 yn y1 y2 y3 yn FIR Atttempt #1 (cont.) • Broadcast common inputs for (i=1; i<=n; i++) in parallel for (j=1; j <=k; j++) sequential y[i] += w[j] * x[i+j-1]; CprE 583 – Reconfigurable Computing

xj+n-1 xj+2 xj+1 xj … wj y1 y2 y3 yn FIR Attempt #1 (cont.) • Retime to eliminate broadcast for (i=1; i<=n; i++) in parallel for (j=1; j <=k; j++) sequential y[i] += w[j] * x[i+j-1]; CprE 583 – Reconfigurable Computing

xi-1 … wj y1 y2 y3 yn FIR Attempt #1 (cont.) • Broadcast common values for (i=1; i<=n; i++) in parallel for (j=1; j <=k; j++) sequential y[i] += w[j] * x[i+j-1]; CprE 583 – Reconfigurable Computing

FIR Attempt #2 • Parallelize the inner loop for (i=1; i<=n; i++) sequential for (j=1; j <=k; j++) in parallel y[i] += w[j] * x[i+j-1]; w1 yi xi yi w2 yi xi+1 yi wk yi xi+k-1 yi CprE 583 – Reconfigurable Computing

wk xi+k-1 w3 xi+2 w2 xi+1 w1 xi … yi yi FIR Attempt #2 (cont.) • Internalize data flow for (i=1; i<=n; i++) sequential for (j=1; j <=k; j++) in parallel y[i] += w[j] * x[i+j-1]; CprE 583 – Reconfigurable Computing

xi+k-1 xi+2 … xi+1 … xk+2 xi … x5 xk+1 … x4 x4 wk xk x3 x3 w3 x3 x2 w2 x2 w1 x1 … yn … y3 y2 y1 yn … y3 y2 y1 FIR Attempt #2 (cont.) • Allocation schedule for (i=1; i<=n; i++) sequential for (j=1; j <=k; j++) in parallel y[i] += w[j] * x[i+j-1]; CprE 583 – Reconfigurable Computing

xi+k-1 xi+2 … xi+1 … xk+2 xi … x5 xk+1 … x4 x4 xk x3 x3 x3 x2 x2 x1 w1 w2 w3 wk … yn … y3 y2 y1 yn … y3 y2 y1 FIR Attempt #2 (cont.) • Preload repeated values for (i=1; i<=n; i++) sequential for (j=1; j <=k; j++) in parallel y[i] += w[j] * x[i+j-1]; CprE 583 – Reconfigurable Computing

xn … x3 x2 x1 w1 w2 w3 wk … yn … y3 y2 y1 yn … y3 y2 y1 FIR Attempt #2 (cont.) • Broadcast common values for (i=1; i<=n; i++) sequential for (j=1; j <=k; j++) in parallel y[i] += w[j] * x[i+j-1]; CprE 583 – Reconfigurable Computing

… x2 x1 w1 w2 w3 wk … y1 y2 … … y2 y1 FIR Attempt #2 (cont.) • Retime to eliminate broadcast for (i=1; i<=n; i++) sequential for (j=1; j <=k; j++) in parallel y[i] += w[j] * x[i+j-1]; CprE 583 – Reconfigurable Computing

FIR Summary • Systolic • Memory bandwidth per output – 2 • O(1) cycles per output • O(k) hardware • Sequential • Memory bandwidth per output – 2k+1 • O(k) cycles per output • O(1) hardware xi x x x x w1 w2 w3 w4 + + + + yi CprE 583 – Reconfigurable Computing

Example – Matrix-Vector Product for (i=1; i<=n; i++) for (j=1; j<=n; j++) y[i] += a[i][j] * x[j]; CprE 583 – Reconfigurable Computing

Matrix-Vector Product (cont.) t = 4 a41 a23 a23 a14 – t = 3 a31 a22 a13 – – t = 2 a21 a12 – – – t = 1 a11 – – – – x1 x2 x3 x4 xn … y1 t = n y2 t = n+1 y3 t = n+2 y4 t = n+3 CprE 583 – Reconfigurable Computing

Banded Matrix-Vector Product q p for (i=1; i<=n; i++) for (j=1; j<=p+q-1; j++) y[i] += a[i][j-q-i] * x[j]; CprE 583 – Reconfigurable Computing

Banded Matrix-Vector Product (cont.) t = 5 a23 – a32 – t = 4 – a22 – a31 t = 3 a12 – a21 – t = 2 – a11 – – t = 1 – – – – yi t = 1 x1 t = 2 – ain yout yin t = 3 x2 t = 4 – xin xout t = 5 x3 CprE 583 – Reconfigurable Computing

Banded Matrix Multiplication CprE 583 – Reconfigurable Computing

– – b21 – – b22 – – b23 – – b11 – – b12 – – b13 – – Banded Matrix Multiplication (cont.) t = 7 c41 – – c22 – – c14 t = 6 – c31 – – – c32 – cout t = 5 – – c21 – c12 – – t = 4 – – – c11 – – – ain bin F aout bout – a12 – a13 – – a11 cin – – – – – – – a21 – t = 5 – a31 t = 4 – t = 3 – t = 2 t = 1 CprE 583 – Reconfigurable Computing

Summary • Systolic structures are good for computation-bound problems • Models costs in VLSI systems • Minimize number of memory accesses • Emphasize local interconnections (long wires are bad) • Candidate algorithms • Makes multiple use of input data (ex: n inputs, O(n3) computations • Concurrency • Simple control flow, simple processing elements CprE 583 – Reconfigurable Computing

CprE / ComS 583 Reconfigurable Computing

CprE / ComS 583 Reconfigurable Computing

Presentation Transcript