SIMD Architectures

SIMD Architectures Laxmi Narayan Bhuyan http://www.cs.ucr.edu/~bhuyan

Data Parallel Model • Operations can be performed in parallel on each element of a large regular data structure, such as an array • 1 Control Processsor broadcast to many PEs (see Ch. 1, Fig. 1-25, page 45 of [CSG99]) • When computers were large, could amortize the control portion of many replicated PEs • Condition flag per PE so that can skip • Data distributed in each memory • Early 1980s VLSI => SIMD rebirth: 32 1-bit PEs + memory on a chip was the PE • Data parallel programming languages lay out data to processor

Data Parallel Model • Vector processors have similar ISAs, but no data placement restriction • SIMD led to Data Parallel Programming languages • Advancing VLSI led to single chip FPUs and whole fast µProcs (SIMD less attractive) • SIMD programming model led to Single Program Multiple Data (SPMD) model • All processors execute identical program • Data parallel programming languages still useful, do communication all at once: “Bulk Synchronous” phases in which all communicate after a global barrier

SIMD Programming – High-Performance Fortran (HPF) • Single Program Multiple Data (SPMD) • FORALL Construct similar to Fork: FORALL (I=1:N), A(I) = B(I) + C(I), END FORALL • Data Mapping in HPF 1. To reduce interprocessor communication 2. Load balancing among processors http://www.npac.syr.edu/hpfa/ http://www.crpc.rice.edu/HPFF/

How does an SIMD computer work? • A Host computer is necessary to do the I/O operations • The user program is loaded into the control memory • The data is distributed to all the memory modules • The control unit decodes the instn and executes it if it is a scalar instn. If it is a vector instn, it broadcasts the control signals to the PEs to do the executions • Before broadcasting the control signals, the CU broadcasts an enable vector which will enable the PEs

Masking and Data Routing Mechanisms • A,B,C – working registers • Si = status (1 active, 0 inactive) • Ri – Data routing register • Di – holds address • Ii – Index register

Example

Matrix Multiplication

N * N Mesh

The Illiac IV Architecture • Distributed memory architecture • 64 PEs connected as an 8X8 2-D mesh with end around connection • LDB: Local Data Buffer • 64, 64-bit each • PEM: 2K X 64 bits memory

The Illiac IV Network

Maspar MP-1 Architecture • Configuration with 1K-16K PEs are available • Each PE has a 4-bit ALU, 1-bit logic unit, a 64-bit mantissa unit, a 16-bit exponent unit, communication input and output ports • Each PE has 40 32-bit registers available to the programmer • Each processor board has 1024 PEs arranges as 64 PE clusters (PECs) with 16 PEs per cluster • Each PEC is a chip connected to 8 neighbors via an octagonal mesh • Another network, called Multistage Crossbar Network, with three router stages gives a function of 1024X1024 crossbar for routing from any PEC to another PEC

SIMD Architectures

SIMD Architectures

Presentation Transcript

Streaming SIMD Extension (SSE)

Liquid SIMD: Abstracting SIMD Hardware Using Lightweight Dynamic Mapping

SIMD Processor Extensions

SIMD

The SIMD Review

SIMD Defragmenter: Efficient ILP Realization on Data-parallel Architectures

Intel SIMD architecture

Intel SIMD architecture

Bottlenecks of SIMD

Intel SIMD architecture

Streaming SIMD Extensions

Intel SIMD architecture

Intel SIMD architecture

What is the SIMD?

Intel SIMD architecture