A. Sazegari AltiVec Technical Lead

A. Sazegari AltiVec Technical Lead

Introduction • AltiVec™ is an extension to the PowerPC Instruction Set Architecture • Designed to extend Apple’s leadership position in multimedia processing AltiVec is a trademark of Motorola, Inc.

What You’ll Learn • About the AltiVec Architecture • Its performance potential • AltiVec programming

AltiVec Technology • Vector/SIMD technology • Fixed-length vector operands (packed data) • Single Instruction Multiple Data • RISC-style instruction set • Optimized for digital signal processing • Elevates multimedia to ﬁrst-class data type • Useful wherever data-parallelism exists

AltiVec Architecture • New Vector Register File: • 32 new 128-bit wide registers • New data-types: • Packed byte, halfword, and word integers • Packed IEEE single-precision ﬂoats • Saturation Arithmetic capability • 160 new PowerPC instructions

PowerPC Architecture Branch Unit IU FPU Instruction Stream GRF FPRF 64 32 Memory

AltiVec Architecture Branch Unit IU FPU Vector Unit Instruction Stream GRF FPRF Vector Register File 128 64 32 Memory

Programming Model Branch Registers • Separate Vector Register File • More space for coefﬁcients, variables, etc. • More names for scheduling • Wider for more parallelism • No interference with FP or integer Cond Count Link Time Time VRSave 128-bits 32-bits 64-bits GPR0 FPR0 VR0 Vector Register File • • • • • • • • Floating-Point Register File General Reg. File 32-registers VR31 FPR31 GPR31 XER FPSCR VSCR

Vector Data Types One Vector (128 bits) 16 signed or unsigned integer bytes 8 signed or unsigned integer halfwords 4 signed or unsigned integer words or 4 IEEE single-precision floating-point numbers

Simple SIMD Example T = vec_adds (A, B); // vector signed short T, A, B VRA VRB vaddshs T, A, B + + + + + + + + VRT • 8 halfword additions in one instruction • Saturation arithmetic (clamp to max or min on overﬂow)

Vector Dot Product VRA1 VRB1 X X X X X X X X X X X X X X X X vec_msum( ) VRC1 ∑ ∑ ∑ ∑ VRT1/A2 VRB2 vec_sums( ) ∑ VRT2

Arithmetic Operations • Add, Subtract, Average • Multiply, Multiply-add, Multiply-sum • Logicals (and, andc, or, nor, xor) • Rotates and shifts • Compares • Convert float <—> fixed (scaled) • ÷ and √ via Newton-Raphson refinement of reciprocal estimate

Vector Permute T = vec_perm (A, B, C); VRC 17 18 D E F 1E 1 0 12 11 10 A 14 14 14 14 VRA VRB 0 1 2 3 4 5 6 7 8 9 A B C D E F 10 11 12 13 14 15 16 17 18 19 1A 1B 1C 1D 1E 1F VRT • Arbitrary bytewise data reorganization • Small table-lookup

Compare and Select VRA1 C1 00 00 00 1A 1A C1 1A 00 C1 00 00 1A 00 1A C1 VRB1 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 vec_cmpeq( ) = = = = = = = = = = = = = = = = VRT1/C2 00 FF FF FF 00 00 00 00 FF 00 FF FF 00 FF 00 00 VRA1/A2 C1 00 00 00 1A 1A C1 1A 00 C1 00 00 1A 00 1A C1 VRB2 9A 9A 9A 9A 9A 9A 9A 9A 9A 9A 9A 9A 9A 9A 9A 9A vec_sel( ) VRT2 C1 9A 9A 9A 1A 1A C1 1A 9A C1 9A 9A 1A 9A 1A C1

Other AltiVec Instructions • Load and Store (vector or scalar element) • Pack, Unpack, and Merge elements • Splat (element or literal replication) • Bitwise vector shifts • Double-vector bytewise shifts

Data Stream Prefetch • Software directed prefetch into cache • 4 simultaneous streams • Independent and asynchronous • Can be non-contiguous Block Size = 0-32 Vectors 0-256 Blocks 1 2 3 N Memory Stride = ±32KBytes

Typical Implementation • ALL instructions fully-pipelined with single-cycle throughput • Simple ops: 1 cycle latency • Compound ops: 3–4 cycle latency • Dual AltiVec instruction issue • One arithmetic, one “permute” • No restriction on issue with scalar instructions

AltiVec vs. MMX • Both SIMD, but AltiVec: • Does everything MMX does, plus • Twice the SIMD parallelism • 4x the register namespace • 8x the register storage space • No mode switch or use overhead • Permute • Richer set of DSP instructions

AltiVec Performance • Peak Performance • Multimedia “kernels” • DSP benchmarks • Performance based on cycle-accurate simulator with real memory effects included • Performance stated relative to optimized PowerPC scalar code

Peak Performance • Vector operations at 400MHz: • Integer • 12.8 billion arithmetic ops/sec • + 6.4 billion byte crossbar ops/sec • Floating-point • 3.2 gigaﬂops • + 1.6 billion FP crossbar ops/sec

Multimedia Kernels • Video and Audio • 11.4x Discrete Cosine Transform (DCT) • 16.1x* Motion estimation (* by ∑|A-B|) • 12.5x Quantization • 9.6x RGB -> YCbCr (CCIR601) • 3.6x Inverse FFT (FP) • 4.9x Windowing (FP)

Multimedia Kernels • Image Processing • 6.2x Bilinear interpolation • 1.1cy/px Separable convolution • 2.2cy/px RGB to YUV • 1.3cy/px Median Filter (3x3)

Multimedia Kernels • Graphics • 6.2x Vector-matrix multiply (FP) • 17.5x Buffer accumulation • 6.6x Line clipping • 6.3x Bezier curves

Communication Kernels • Modems and Telephony • 2.5x CRC-32 • 10.5x 64-QAM Demodulator • 7.6x Linear prediction • 9.3x Real 13-tap FIR • 30.7x Autocorrelation • 12.5x GSM Module 4.2.11

Miscellaneous DSP Kernels • Miscellaneous • 2.5 to 20x Parallel table lookup • 10.0x Sorting • 5.8x Associative search • 16.0x Galois ﬁeld multiply • 4.0x Gamma Correction • 12.0cy/block Haar Transform (wavelet)

DSP Benchmarks • Results from an independent DSP benchmarking ﬁrm indicate AltiVec on integer DSP algorithms (FIR, FFT, etc.) is: • Twice as fast as the world’s fastest DSP (TMS320C6201) per clock, and four times faster including frequency • 2 to 5 times faster than Pentium™ II per clock (but µP would still be 35% smaller)

AltiVec Tools • Programming Model and ABI • Compilers and assemblers • Motorola’s MCC CodeWarrior plug-in • Apple’s MrC and PPCASM in MPW and MW • Metrowerks C/C++ • Emulator/Trace generator • MacsBug • Cycle-accurate simulator • Performance proﬁler

Programming in C • 11 new fundamental packed data types • AltiVec operators • Parse like function calls • Speciﬁc operators —> assembly instructions • Generic operators type sensitive • sizeof(), a=b, &a, *p, etc. • Compiler does register allocation, inlining, code scheduling, etc.

C Program Example zero = ( vector unsigned long ) ( 0 ); // zero = vec_xor ( zero, zero ); shiftFactor = vec_splat_u8 ( 11 ); z = vec_sro ( x, shiftFactor ); z = vec_srl ( z, shiftFactor ); do { carry = vec_addc ( z, y ); z = vec_add ( z, y ); y = vec_sld ( carry, zero, 4 ); } while ( !vec_all_eq ( y, zero ) );

Vector Shifts This ‘shiftFactor’ vector is populated in 2 sections for “vector shift right by octet” vsro and “vector shift right” vsr bit # ... || 121 | 122 | 123 | 124 || 125 | 126 | 127 || used by || <------ vsro -------> || <---- vsr ----> || vsro is based on the permute cross bar and shifts bytes, Instruction vsr is a 0 to 7 bit shift. Used sequentially,the combination of these instructions will shift a vector register right (or left) from 0 to 127 bits as specified in bits 121:127 of ‘shiftFactor’. bit # ... || 121 | 122 | 123 | 124 || 125 | 126 | 127|| shiftFactor = ... || 0 | 0 | 0 | 1 || 0 | 1 | 1 ||

AltiVec at Apple • Mac OS (blockmove, etc.) • QuickDraw • QTML (codecs, rasterizers…) • Media source code library • g4@apple.com

AltiVec Summary • Major architectural extension will make future PowerPCs great media processors • Early programming tools available now • Development systems 2H98 (Now) • AltiVec based systems in 1H99

A. Sazegari AltiVec Technical Lead