1 / 32

A. Sazegari AltiVec Technical Lead

A. Sazegari AltiVec Technical Lead. Introduction. AltiVec™ is an extension to the PowerPC Instruction Set Architecture Designed to extend Apple’s leadership position in multimedia processing. AltiVec is a trademark of Motorola, Inc. What You’ll Learn. About the AltiVec Architecture

deion
Download Presentation

A. Sazegari AltiVec Technical Lead

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. A. Sazegari AltiVec Technical Lead

  2. Introduction • AltiVec™ is an extension to the PowerPC Instruction Set Architecture • Designed to extend Apple’s leadership position in multimedia processing AltiVec is a trademark of Motorola, Inc.

  3. What You’ll Learn • About the AltiVec Architecture • Its performance potential • AltiVec programming

  4. AltiVec Technology • Vector/SIMD technology • Fixed-length vector operands (packed data) • Single Instruction Multiple Data • RISC-style instruction set • Optimized for digital signal processing • Elevates multimedia to first-class data type • Useful wherever data-parallelism exists

  5. AltiVec Architecture • New Vector Register File: • 32 new 128-bit wide registers • New data-types: • Packed byte, halfword, and word integers • Packed IEEE single-precision floats • Saturation Arithmetic capability • 160 new PowerPC instructions

  6. PowerPC Architecture Branch Unit IU FPU Instruction Stream GRF FPRF 64 32 Memory

  7. AltiVec Architecture Branch Unit IU FPU Vector Unit Instruction Stream GRF FPRF Vector Register File 128 64 32 Memory

  8. Programming Model Branch Registers • Separate Vector Register File • More space for coefficients, variables, etc. • More names for scheduling • Wider for more parallelism • No interference with FP or integer Cond Count Link Time Time VRSave 128-bits 32-bits 64-bits GPR0 FPR0 VR0 Vector Register File • • • • • • • • Floating-Point Register File General Reg. File 32-registers VR31 FPR31 GPR31 XER FPSCR VSCR

  9. Vector Data Types One Vector (128 bits) 16 signed or unsigned integer bytes 8 signed or unsigned integer halfwords 4 signed or unsigned integer words or 4 IEEE single-precision floating-point numbers

  10. Simple SIMD Example T = vec_adds (A, B); // vector signed short T, A, B VRA VRB vaddshs T, A, B + + + + + + + + VRT • 8 halfword additions in one instruction • Saturation arithmetic (clamp to max or min on overflow)

  11. Vector Dot Product VRA1 VRB1 X X X X X X X X X X X X X X X X vec_msum( ) VRC1 ∑ ∑ ∑ ∑ VRT1/A2 VRB2 vec_sums( ) ∑ VRT2

  12. Arithmetic Operations • Add, Subtract, Average • Multiply, Multiply-add, Multiply-sum • Logicals (and, andc, or, nor, xor) • Rotates and shifts • Compares • Convert float <—> fixed (scaled) • ÷ and √ via Newton-Raphson refinement of reciprocal estimate

  13. Vector Permute T = vec_perm (A, B, C); VRC 17 18 D E F 1E 1 0 12 11 10 A 14 14 14 14 VRA VRB 0 1 2 3 4 5 6 7 8 9 A B C D E F 10 11 12 13 14 15 16 17 18 19 1A 1B 1C 1D 1E 1F VRT • Arbitrary bytewise data reorganization • Small table-lookup

  14. Compare and Select VRA1 C1 00 00 00 1A 1A C1 1A 00 C1 00 00 1A 00 1A C1 VRB1 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 vec_cmpeq( ) = = = = = = = = = = = = = = = = VRT1/C2 00 FF FF FF 00 00 00 00 FF 00 FF FF 00 FF 00 00 VRA1/A2 C1 00 00 00 1A 1A C1 1A 00 C1 00 00 1A 00 1A C1 VRB2 9A 9A 9A 9A 9A 9A 9A 9A 9A 9A 9A 9A 9A 9A 9A 9A vec_sel( ) VRT2 C1 9A 9A 9A 1A 1A C1 1A 9A C1 9A 9A 1A 9A 1A C1

  15. Other AltiVec Instructions • Load and Store (vector or scalar element) • Pack, Unpack, and Merge elements • Splat (element or literal replication) • Bitwise vector shifts • Double-vector bytewise shifts

  16. Data Stream Prefetch • Software directed prefetch into cache • 4 simultaneous streams • Independent and asynchronous • Can be non-contiguous Block Size = 0-32 Vectors 0-256 Blocks 1 2 3 N Memory Stride = ±32KBytes

  17. Typical Implementation • ALL instructions fully-pipelined with single-cycle throughput • Simple ops: 1 cycle latency • Compound ops: 3–4 cycle latency • Dual AltiVec instruction issue • One arithmetic, one “permute” • No restriction on issue with scalar instructions

  18. AltiVec vs. MMX • Both SIMD, but AltiVec: • Does everything MMX does, plus • Twice the SIMD parallelism • 4x the register namespace • 8x the register storage space • No mode switch or use overhead • Permute • Richer set of DSP instructions

  19. AltiVec Performance • Peak Performance • Multimedia “kernels” • DSP benchmarks • Performance based on cycle-accurate simulator with real memory effects included • Performance stated relative to optimized PowerPC scalar code

  20. Peak Performance • Vector operations at 400MHz: • Integer • 12.8 billion arithmetic ops/sec • + 6.4 billion byte crossbar ops/sec • Floating-point • 3.2 gigaflops • + 1.6 billion FP crossbar ops/sec

  21. Multimedia Kernels • Video and Audio • 11.4x Discrete Cosine Transform (DCT) • 16.1x* Motion estimation (* by ∑|A-B|) • 12.5x Quantization • 9.6x RGB -> YCbCr (CCIR601) • 3.6x Inverse FFT (FP) • 4.9x Windowing (FP)

  22. Multimedia Kernels • Image Processing • 6.2x Bilinear interpolation • 1.1cy/px Separable convolution • 2.2cy/px RGB to YUV • 1.3cy/px Median Filter (3x3)

  23. Multimedia Kernels • Graphics • 6.2x Vector-matrix multiply (FP) • 17.5x Buffer accumulation • 6.6x Line clipping • 6.3x Bezier curves

  24. Communication Kernels • Modems and Telephony • 2.5x CRC-32 • 10.5x 64-QAM Demodulator • 7.6x Linear prediction • 9.3x Real 13-tap FIR • 30.7x Autocorrelation • 12.5x GSM Module 4.2.11

  25. Miscellaneous DSP Kernels • Miscellaneous • 2.5 to 20x Parallel table lookup • 10.0x Sorting • 5.8x Associative search • 16.0x Galois field multiply • 4.0x Gamma Correction • 12.0cy/block Haar Transform (wavelet)

  26. DSP Benchmarks • Results from an independent DSP benchmarking firm indicate AltiVec on integer DSP algorithms (FIR, FFT, etc.) is: • Twice as fast as the world’s fastest DSP (TMS320C6201) per clock, and four times faster including frequency • 2 to 5 times faster than Pentium™ II per clock (but µP would still be 35% smaller)

  27. AltiVec Tools • Programming Model and ABI • Compilers and assemblers • Motorola’s MCC CodeWarrior plug-in • Apple’s MrC and PPCASM in MPW and MW • Metrowerks C/C++ • Emulator/Trace generator • MacsBug • Cycle-accurate simulator • Performance profiler

  28. Programming in C • 11 new fundamental packed data types • AltiVec operators • Parse like function calls • Specific operators —> assembly instructions • Generic operators type sensitive • sizeof(), a=b, &a, *p, etc. • Compiler does register allocation, inlining, code scheduling, etc.

  29. C Program Example zero = ( vector unsigned long ) ( 0 ); // zero = vec_xor ( zero, zero ); shiftFactor = vec_splat_u8 ( 11 ); z = vec_sro ( x, shiftFactor ); z = vec_srl ( z, shiftFactor ); do { carry = vec_addc ( z, y ); z = vec_add ( z, y ); y = vec_sld ( carry, zero, 4 ); } while ( !vec_all_eq ( y, zero ) );

  30. Vector Shifts This ‘shiftFactor’ vector is populated in 2 sections for “vector shift right by octet” vsro and “vector shift right” vsr bit # ... || 121 | 122 | 123 | 124 || 125 | 126 | 127 || used by || <------ vsro -------> || <---- vsr ----> || vsro is based on the permute cross bar and shifts bytes, Instruction vsr is a 0 to 7 bit shift. Used sequentially,the combination of these instructions will shift a vector register right (or left) from 0 to 127 bits as specified in bits 121:127 of ‘shiftFactor’. bit # ... || 121 | 122 | 123 | 124 || 125 | 126 | 127|| shiftFactor = ... || 0 | 0 | 0 | 1 || 0 | 1 | 1 ||

  31. AltiVec at Apple • Mac OS (blockmove, etc.) • QuickDraw • QTML (codecs, rasterizers…) • Media source code library • g4@apple.com

  32. AltiVec Summary • Major architectural extension will make future PowerPCs great media processors • Early programming tools available now • Development systems 2H98 (Now) • AltiVec based systems in 1H99

More Related