CS 201 Advanced Topics SIMD, x86-64, ARM

CS 201Advanced TopicsSIMD, x86-64, ARM

Vector instructions (MMX/SSE/AVX)

Background: IA32 Floating Point What does this have to do with SIMD? Floating Point Unit (X87 FPU) Hardware to add, multiply, and divide IEEE floating point numbers 8 80-bit registers organized as a stack (st0-st7) Operands pushed onto stack and operators can pop results off into memory History 8086: first computer to implement IEEE FP separate 8087 FPU (floating point unit) 486: merged FPU and Integer Unit onto one chip Instruction decoder and sequencer Integer Unit FPU Memory

FPU Data Register Stack FPU register format (extended precision) 0 79 78 64 63 s exp frac %st(3) %st(2) %st(1) “Top” %st(0) FPU registers • 8 registers • Logically forms shallow stack • Top called %st(0) • When push too many, bottom values disappear stack grows down

Simplified FPU operation “load” instruction Pushes number onto stack “storep” instruction Pops top element from stack and stores it in memory unary operation “neg” = pop top element, negate it, push result onto stack binary operations “addp”, “multp” = pop top two elements, perform operation, push result onto stack Stack operation similar to Reverse Polish Notation a b + = push a, push b, add (pop a & b, add, push result)

Example calculation x = (a-b)/(-b+c) load c load b neg addp load b load a subp divp storep x

FPU instructions Large number of floating point instructions and formats ~50 basic instruction types load (fld*), store (fst*), add (fadd), multiply (fmul) sin (fsin), cos (fcos), tan (ftan) etc… Sample instructions: Instruction Effect Description fldz push 0.0 Load zero flds Addr push M[Addr] Load single precision real fmuls Addr %st(0) <- %st(0)*M[Addr] Multiply faddp %st(1) <- %st(0)+%st(1); pop Add and pop After pop, %st(0) has result

FPU instruction mnemonics Precision “s” single precision “l” double precision Operand order Default Op1 <op> Op2 “r” reverse operand order (i.e. Op2 <op> Op1) Stack operation “p” pop a single value from stack upon completion

Floating Point Code Example Compute Inner Product of Two Vectors Single precision arithmetic Common computation pushl %ebp # setup movl %esp,%ebp pushl %ebx movl 8(%ebp),%ebx # %ebx=&x movl 12(%ebp),%ecx # %ecx=&y movl 16(%ebp),%edx # %edx=n fldz # push +0.0 xorl %eax,%eax # i=0 cmpl %edx,%eax # if i>=n done jge .L3 .L5: flds (%ebx,%eax,4) # push x[i] fmuls (%ecx,%eax,4) # st(0)*=y[i] faddp # st(1)+=st(0); pop incl %eax # i++ cmpl %edx,%eax # if i<n repeat jl .L5 .L3: movl -4(%ebp),%ebx # finish movl %ebp, %esp popl %ebp ret # st(0) = result float ipf (float x[], float y[], int n) { int i; float result = 0.0; for (i = 0; i < n; i++) { result += x[i] * y[i]; } return result; }

Inner Product Stack Trace Initialization 1. fldz 0.0 %st(0) Iteration 0 Iteration 1 2. flds (%ebx,%eax,4) 5. flds (%ebx,%eax,4) 0.0 %st(1) x[0]*y[0] %st(1) x[0] %st(0) x[1] %st(0) 3. fmuls (%ecx,%eax,4) 6. fmuls (%ecx,%eax,4) 0.0 %st(1) x[0]*y[0] %st(1) x[0]*y[0] %st(0) x[1]*y[1] %st(0) 4. faddp 7. faddp 0.0+x[0]*y[0] %st(0) x[0]*y[0]+x[1]*y[1] %st(0) Serial, sequential operation

Motivation for SIMD Multimedia, graphics, scientific, and security applications Require a single operation across large amounts of data Frame differencing for video encoding Image Fade-in/Fade-out Sprite overlay in game Matrix computations Encryption/decryption Algorithm characteristics Access data in a regular pattern Operate on short data types (8-bit, 16-bit, 32-bit) Have an operating paradigm that has data streaming through fixed processing stages Data-flow operation

Natural fit for SIMD instructions Single Instruction, Multiple Data Also known as vector instructions Before SIMD One instruction per data location With SIMD One instruction over multiple sequential data locations Execution units must support “wide” parallel execution Examples in many processors Intel x86 MMX, SSE, AVX AMD 3DNow!

Example R R XR 1.08327 G = G + XG * 1.89234 B B XB 1.29835 R R G = G + X[i:i+2] B B R = R + XR * 1.08327 G = G + XG * 1.89234 B = B + XB * 1.29835 R = R + X[i+0] G = G + X[i+1] B = B + X[i+2]

Example for (i=0; i<64; i+=1)‏ A[i+0] = A[i+0] + B[i+0] for (i=0; i<64; i+=4)‏{ A[i+0] = A[i+0] + B[i+0] A[i+1] = A[i+1] + B[i+1] A[i+2] = A[i+2] + B[i+2] A[i+3] = A[i+3] + B[i+3] } for (i=0; i<100; i+=4)‏ A[i:i+3] = A[i:i+3] + B[i:i+3]

SIMD in x86 MMX (MultiMedia eXtensions) Pentium, Pentium II SSE (Streaming SIMD Extensions) (1999) Pentium 3 SSE2 (2000), SSE3 (2004) Pentium 4 SSSE3 (2004), SSE4 (2007) Intel Core AVX (2011) Intel Sandy Bridge, Ivy Bridge

General idea SIMD (single-instruction, multiple data) vector instructions • New data types, registers, operations • Parallel operation on small (length 2-8) vectors of integers or floats • Example: “4-way” x +

MMX (MultiMedia eXtensions) MMX re-uses FPU registers for SIMD execution of integer ops Alias the FPU registers st0-st7 as MM0-MM7 Treat as 8 64-bit data registers randomly accessible Partition registers based on data type of vector How many different partitions are there for a vectored add? Single operation applied in parallel on individual parts Why not new registers? Wanted to avoid adding CPU state Change does not impact context switching OS does not need to know about MMX Drawback: can't use FPU and MMX at the same time + 8 byte additions (PADDB) 4 short or word additions (PADDW) 2 int or dword additions (PADDD) + +

SSE (Streaming SIMD Extensions) Larger, independent registers MMX doesn't allow use of FPU and SIMD simultaneously 8 128-bit data registers separate from FPU New hardware registers (XMM0-XMM7) New status register for flags (MXCSR) Vectored floating point supported MMX only for vectored integer operations SSE adds support for vectored floating point operations 4 single precision floats Streaming support Prefetching and cacheability control in loading/storing operands Additional integer operations for permutations Shuffling, interleaving

SSE2 Adds more data types and instructions Vectored double-precision floating point operations 2 double precision floats Full support for vectored integer types over 128-bit XMM registers 16 single byte vectors 8 word vectors 4 double word vectors 2 quad word vector

SSE3 Horizontal vector operations Operations within vector (e.g. min, max) Speed up DSP and 3D ops Complex arithmetic (SSE3) All x86-64 chips have SSE3 Video encoding accelerators Sum of absolute differences (frame differencing) Horizontal Minimum Search (motion estimation) Conditional copying Graphics building blocks Dot product 32-bit vector integer operations on 128-bit registers Dword multiplies Vector rounding SSE4

Feature summary Integer vectors (64-bit registers) (MMX) Single-precision vectors (SSE) Double-precision vectors (SSE2) Integer vectors (128-bit registers) (SSE2) Horizontal arithmetic within register (SSE3/SSSE3) Video encoding accelerators (H.264) (SSE4) Graphics building blocks (SSE4)

time Intel Architectures (Focus Floating Point) Processors Architectures Features x86-16 8086 286 x86-32 386 486 Pentium Pentium MMX Pentium III Pentium 4 Pentium 4E MMX SSE SSE2 SSE3 4-way single precision fp 2-way double precision fp x86-64 / em64t Pentium 4F Core 2 Duo SSE4

SSE3 Registers All caller saved %xmm0 for floating point return value 128 bit %xmm0 %xmm8 Argument #1 %xmm1 %xmm9 Argument #2 %xmm2 %xmm10 Argument #3 %xmm3 %xmm11 Argument #4 %xmm4 %xmm12 Argument #5 %xmm5 %xmm13 Argument #6 %xmm6 %xmm14 Argument #7 %xmm7 %xmm15 Argument #8

SSE3 Registers Different data types and associated instructions Integer vectors: • 16-way byte • 8-way short • 4-way int Floating point vectors: • 4-way single (float) • 2-way double Floating point scalars: • single • double 128 bit LSB

SSE3 Instruction Names single slot (scalar) packed (vector) addps addss single precision addpd addsd double precision

SSE3 Instructions: Examples Single precision 4-way vector add: addps %xmm0 %xmm1 Single precision scalar add: addss %xmm0 %xmm1 %xmm0 + %xmm1 %xmm0 + %xmm1

SSE3 Basic Instructions Moves • Usual operand form: reg → reg, reg → mem, mem → reg • Packed versions to load vector from memory Arithmetic

x86-64 FP Code Example float ipf (float x[], float y[], int n) { int i; float result = 0.0; for (i = 0; i < n; i++) result += x[i]*y[i]; return result; } Compute inner product of two vectors • Single precision arithmetic • Uses SSE3 instructions ipf: xorps %xmm1, %xmm1 # result = 0.0 xorl %ecx, %ecx # i = 0 jmp .L8 # goto middle .L10: # loop: movslq %ecx,%rax # icpy = i incl %ecx # i++ movss (%rsi,%rax,4), %xmm0 # t = y[icpy] mulss (%rdi,%rax,4), %xmm0 # t *= x[icpy] addss %xmm0, %xmm1 # result += t .L8: # middle: cmpl %edx, %ecx # i:n jl .L10 # if < goto loop movaps %xmm1, %xmm0 # return result ret

SSE3 Conversion Instructions Conversions • Same operand forms as moves

Detecting if it is supported mov eax, 1 cpuid ; supported since Pentium test edx, 00800000h ; 00800000h (bit 23) MMX ; 02000000h (bit 25) SSE ; 04000000h (bit 26) SSE2 jnz HasMMX

Detecting if it is supported #include <stdio.h> #include <string.h> #define cpuid(func,ax,bx,cx,dx)\ __asm__ __volatile__ ("cpuid":\ "=a" (ax), "=b" (bx), "=c" (cx), "=d" (dx) : "a" (func)); int main(int argc, char* argv[]) { int a, b, c, d, i; char x[13]; int* q; for (i=0; i < 13; i++) x[i]=0; q=(int ) x; /* 12 char string returned in 3 registers */ cpuid(0,a,q[0],q[2],q[1]); printf("str: %s\n", x); /* Bits returned in all 4 registers */ cpuid(1,a,b,c,d); printf("a: %08x, b: %08x, c: %08x, d: %08x\n",a,b,c,d); printf(" bh * 8 = cache line size\n"); printf(" bit 0 of c = SSE3 supported\n"); printf(" bit 25 of c = AES supported\n"); printf(" bit 0 of d = On-board FPU\n"); printf(" bit 4 of d = Time-stamp counter\n"); printf(" bit 26 of d = SSE2 supported\n"); printf(" bit 25 of d = SSE supported\n"); printf(" bit 23 of d = MMX supported\n"); } http://thefengs.com/wuchang/courses/cs201/class/11/cpuid.c

Detecting if it is supported mashimaro <~> 12:43PM % cat /proc/cpuinfo processor : 0 vendor_id : GenuineIntel cpu family : 6 model : 15 model name : Intel(R) Core(TM)2 Quad CPU Q6600 @ 2.40GHz stepping : 11 cpu MHz : 2393.974 cache size : 4096 KB physical id : 0 siblings : 4 core id : 0 cpu cores : 4 apicid : 0 initial apicid : 0 fpu : yes fpu_exception : yes cpuid level : 10 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx lm constant_tsc arch_perfmon pebs bts rep_good pni monitor ds_cpl vmx est tm2 ssse3 cx16 xtpr lahf_lm bogomips : 4791.08 clflush size : 64 cache_alignment : 64 address sizes : 36 bits physical, 48 bits virtual power management:

AVX2 Intel codename Haswell (2013), Broadwell (2014) Expansion of most integer AVX to 256 bits “Gather support” to load data from non-contiguous memory 3-operand FMA operations (fused multiply-add operations) at full precision (a+b*c) Dot products, matrix multiplications, polynomial evaluations via Horner's rule (see DEC VAX POLY instruction 1977) Speeds up software-based division and square root operations (so dedicated hardware for these operations can be removed)

Programming SIMD Store data contiguously (i.e. in an array) Define total size of vector in bytes 8 bytes (64 bits) for MMX 16 bytes (128 bits) for SSE2 and beyond Define type of vector elements For 128 bit registers 2 double 4 float 4 int 8 short 16 char SIMD instructions based on each vector type

Example: SIMD via macros/libraries Rely on compiler macros or library calls for SSE acceleration Macros embed in-line assembly into program Call into library functions compiled with SSE Adding two 128-bit vectors containing 4 float // Microsoft-specific compiler intrinsic function __m128 _mm_add_ps(__m128 a , __m128 b ); __m128 a, b, c; // intrinsic function c = __mm_add_ps(a, b); 1 2 3 4 b 2 4 6 8 a + + + + 3 6 9 12 http://msdn.microsoft.com/en-us/library/y0dh78ez.aspx

Example: SIMD in C Adding two vectors (SSE) Must pass the compiler hints about your vector Size of each vector in bytes (i.e. vector_size(16)) Type of vector element (i.e. float) gcc –msse2 // vector of four single floats typedef float v4sf __attribute__ ((vector_size(16))); union f4vector { v4sf v; float f[4]; }; void add_vector() { union f4vector a, b, c; a.f[0] = 1; a.f[1] = 2; a.f[2] = 3; a.f[3] = 4; b.f[0] = 5; b.f[1] = 6; b.f[2] = 7; b.f[3] = 8; c.v = a.v + b.v; } http://thefengs.com/wuchang/courses/cs201/class/11/add_nosse.c http://thefengs.com/wuchang/courses/cs201/class/11/add_sse.c

Examples: SSE in C Measuring performance improvement using rdtsc http://thefengs.com/wuchang/courses/cs201/class/11

Vector Instructions Starting with version 4.1.1, gcc can autovectorize to some extent • -O3 or –ftree-vectorize • No speed-up guaranteed • Very limited • icc as of now much better For highest performance vectorize yourself using intrinsics • Intrinsics = C interface to vector instructions

AES AES-NI announced 2008 Added to Intel Westmere processors and beyond (2010) Separate from MMX/SSE/AVX AESENC/AESDEC performs one round of an AES encryption/decryption flow One single byte substitution step, one row-wise permutation step, one column-wise mixing step, addition of the round key (order depends on whether one is encrypting or decrypting) Speed up from 28 cycles per byte to 3.5 cycles per byte 10 rounds per block for 128-bit keys, 12 rounds per block for 192-bit keys, 14 rounds per block for 256-bit keys Software support from security vendors widespread http://software.intel.com/file/24917

x86-64

x86-64 History 64-bit version of x86 architecture Developed by AMD in 2000 First processor released in 2003 Adopted by Intel in 2004 Features 64-bit registers and instructions Additional integer registers Adoption and extension of Intel’s SSE No-execute bit Conditional move instruction (avoiding branches) http://www.x86-64.org/

64-bit registers From IA-32 %ah/al : 8 bits %ax: 16 bits %eax: 32 bits Now %rax - 64 bits 31 15 7 0 63 %ax %rax %eax %ah %al

More integer registers r8 – r15 Denoted %rXb - 8 bits %rXw - 16 bits %rXd - 32 bits %rX - 64 bits where X is from 8 to 15 Within gdb ‘info registers’

x86-64 Integer Registers %rax %r8 %eax %r8d %rbx %r9 %ebx %r9d %rcx %r10 %ecx %r10d %rdx %r11 %edx %r11d %rsi %r12 %esi %r12d %rdi %r13 %edi %r13d %rsp %r14 %esp %r14d %rbp %r15 %ebp %r15d • Twice the number of registers • Accessible as 8, 16, 32, 64 bits

More vector registers XMM0-XMM7 128-bit SSE registers prior to x86-64 XMM8-XMM15 Additional 8 128-bit registers

64-bit instructions All 32-bit instructions have quad-word equivalents Use suffix 'q‘ to denote movq $0x4,%rax addq %rcx,%rax Exception for stack operations pop, push, call, ret, enter, leave Implicitly 64 bit 32 bit versions not valid Values translated to 64 bit versions with zeros.

Modified calling convention Previously Function parameters pushed onto the stack Frame pointer management and update A lot of memory operations and overhead! x86-64 Use registers to pass function parameters %rdi, %rsi, %rdx, %rcx, %r8, %r9 used for argument build %xmm0 - %xmm7 for floating point arguments Avoid frame management when possible Simple functions do not incur frame management overhead Use stack if more than 6 parameters Kernel interface also uses registers for parameters %rdi, %rsi, %rdx, %r10, %r8, %r9 Callee saved registers %rbp, %rbx, from %r12 to %r15 All references to stack frame via stack pointer Eliminates need to update %ebp/%rbp

x86-64 Integer Registers %rax %r8 Return value Argument #5 %rbx %r9 Callee saved Argument #6 %rcx %r10 Callee saved Argument #4 %rdx %r11 Used for linking Argument #3 %rsi %r12 C: Callee saved Argument #2 %rdi %r13 Argument #1 Callee saved %rsp %r14 Stack pointer Callee saved %rbp %r15 Callee saved Callee saved

x86-64 Long Swap swap: movq (%rdi), %rdx movq (%rsi), %rax movq %rax, (%rdi) movq %rdx, (%rsi) ret void swap(long *xp, long *yp) { long t0 = *xp; long t1 = *yp; *xp = t1; *yp = t0; } Operands passed in registers • First (xp) in %rdi, second (yp) in %rsi • 64-bit pointers No stack operations required (except ret) Avoiding stack • Can hold all local information in registers

x86-64 Locals in the Red Zone swap_a: movq (%rdi), %rax movq %rax, -24(%rsp) movq (%rsi), %rax movq %rax, -16(%rsp) movq -16(%rsp), %rax movq %rax, (%rdi) movq -24(%rsp), %rax movq %rax, (%rsi) ret /* Swap, using local array */ void swap_a(long *xp, long *yp) { volatile long loc[2]; loc[0] = *xp; loc[1] = *yp; *xp = loc[1]; *yp = loc[0]; } Avoiding Stack Pointer Change • Have compiler manage the stack frame in a function without changing %rsp • Allocate window beyond stack pointer rtn Ptr %rsp unused −8 loc[1] −16 loc[0] −24

CS 201 Advanced Topics SIMD, x86-64, ARM

CS 201 Advanced Topics SIMD, x86-64, ARM

Presentation Transcript

CS 6293 Advanced Topics: Current Bioinformatics

201 Advanced Topics

Advanced topics in X86 assembly by Istvan Haller

ARM, SPARC, x86 ARM (Advanced RISC machine) Load/store architecture

CS 6293 Advanced Topics: Bioinformatics

Machine-Level Programming: X86-64

CS 6293 Advanced Topics: Current Bioinformatics

CS 201 Computer Systems Programming Chapter 11 “ x86 Microsoft Assembler ”

CS 6293 Advanced Topics: Translational Bioinformatics

CS 6293 Advanced Topics: Translational Bioinformatics

CS 201 Computer Systems Programming Chapter 11 “ x86 Microsoft Assembler ”

CS 6293 Advanced Topics: Translational Bioinformatics

CS 6293 Advanced Topics: Translational Bioinformatics

CS 6293 Advanced Topics:

CS 201 Computer Systems Programming Chapter 11 “ x86 Microsoft Assembler ”

CS 380C: Advanced Topics in Compilers

CS 6293 Advanced Topics: Current Bioinformatics

CS 6293 Advanced Topics: Current Bioinformatics

CS 64 - KICKOFF

CS 201 Computer Systems Programming Chapter 11 x86 Microsoft Assembler

CS 6293 Advanced Topics: Current Bioinformatics