- 80 Views
- Uploaded on
- Presentation posted in: General

Lecture 18

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Lecture 18

SSE Instructions and MP4

Ryan Chmiel

- SSE Instruction Overview
- Code examples using SSE instructions
- MP4

- Streaming SIMD defines a new architecture for floating point operations
- Introduced in Pentium III in March 1999
- Pentium III includes floating point, MMX technology, and XMM registers

- Use eight new 128-bit wide general-purpose registers (XMM0 - XMM7)
- Operate on IEEE-754 single-precision 32-bit real numbers
- Support packed and scalar operations on the new packed single precision floating point data types

- Packedinstructions operate vertically on all four pairs of floating point data elements in parallel
- instructions have suffix ps for single-precision, e.g., addps

- Scalar instructions operate on the least-significant data elements of the two operands
- instructions have suffix ss for single-precision, e.g., addss

- Data Movement
- movss, movups, movaps, etc.

- Arithmetic Instructions
- addps, mulss, divps, sqrtss, etc.

- Logical Instructions (only operate on packed data)
- xorps, orps, andnps, etc.

- Compare Instructions
- cmpss, cmpps, etc.

- Integer-Real Conversion Instructions
- cvtps2pi, cvtsi2ss, etc.

- Shuffle/Rearrange Instructions
- shufps, unpcklps, etc.

XMM1 4.0 3.0 2.0 1.0 XMM2 5.0 6.0 7.0 8.0

- movupsxmm3, xmm1 4.0 3.0 2.0 1.0
- movssxmm1, xmm2 4.0 3.0 2.0 8.0
- movssxmm3, [RealOne] 0.0 0.0 0.0 1.0
- addpsxmm1, xmm2 9.0 9.0 9.0 9.0
- subssxmm2, xmm1 5.0 6.0 7.0 7.0
- xorpsxmm2, xmm2 0.0 0.0 0.0 0.0

- This instruction allows you to take two floating point numbers from each operand to create a new value with the four numbers
- The destination operand (xmmreg1) contributes to the lower two places in the result, and the source operand contributes to the upper two
- Usage:
- shufpsxmmreg1, xmmreg2/mem128, imm8
- The first operand must be an xmm register
- The second operand can either be an xmm register or a memory location
- The third operand is a bit mask of four two-bit numbers that specifies which values from each operand you’ll be choosing:
127 0

XMMREG 11 10 01 00

- That makes no sense - show me some examples!

XMM1 1.0 2.0 3.0 4.0 XMM2 5.0 6.0 7.0 8.0

- shufpsxmm1, xmm2, 11001100b 5.0 8.0 1.0 4.0
- shufpsxmm1, xmm2, 01111001b 7.0 5.0 2.0 3.0
- shufpsxmm2, xmm1, 01111001b 3.0 1.0 6.0 7.0
- shufpsxmm1, xmm1, 10001100b 2.0 4.0 1.0 4.0
- shufpsxmm2, xmm2, 10010011b 6.0 7.0 8.0 5.0
- shufpsxmm1, xmm1, 00111001b 4.0 1.0 2.0 3.0

- Available as part of the Intel x86 Instruction Set Reference found on the Resources page of ECE 291 website
- Each instruction has a diagram that shows how the instruction manipulates the data and stores the result
- Unfortunately the reference is in alphabetical order, not in order of instruction type
- As previously mentioned, look for –ps and –ss suffixes to determine which instructions are SSE instructions
- Instructions suffixed with –pd and –sd operate on double-precision values and are included in the SSE2 instruction set. These instructions were introduced on the Pentium IV chip.

- You cannot push and pop xmm registers like you can do to general purpose registers. This means if you’re using xmm0 in a function, and you call another function that also uses xmm0, the value will be overwritten. Watch out for this!
- For some of the SSE instructions, such as the arithmetic and logical instructions, if you specify a memory location as the second operand, that memory location must be on a 16-byte boundary (it’s address must end with 0000h). If it doesn’t, you will get a general protection fault at runtime. To avoid this, always move the value at this memory location into an xmm register and use that xmm register as the second operand.
- To avoid more GPF’s, use movups (move unaligned packed single-precision) instead of movaps (move aligned packed single-precision). movaps checks for a 16-byte boundary as mentioned above and will crash your program if the address does not lie on one.

Variable1dd4.5, 32.0, -16.123, 291.0

Variable2dd0.0

…

movupsxmm0, [Variable1]

xorpsxmm1, xmm1

movecx, 4

.Loop

addssxmm1, xmm0

shufpsxmm0, xmm0, 00111001b

loop.Loop

movss[Variable2], xmm1

- What does this do?

- It sums the four numbers stored in [Variable1] and stores the result to [Variable2]
- Here’s the main part of the code again - is this the most efficient way to perform this operation?
.Loop

addssxmm1, xmm0

shufpsxmm0, xmm0, 00111001b

loop.Loop

movss[Variable2], xmm1

- A. Yes
- B. No
- C. I don’t know
- D. I don’t care

- At least you’re honest… but B is the correct answer
- You have four numbers to add, and the addps instruction can add four pairs of floating point numbers at once
- Solution:
- Line up the four values making two pairs and add both pairs in parallel
- Line up the two results into one pair add that pair
movupsxmm0, [Variable1]

movupsxmm1, xmm0

shufpsxmm1, xmm1, 00001110b ; upper two values do not matter

addpsxmm1, xmm0

movupsxmm0, xmm1

shufpsxmm0, xmm0, 00000001b

addssxmm0, xmm1

movss[Variable2], xmm0

- What is wrong with the first approach?
- It does not take advantage of parallelism - this code can be written using the regular FPU instructions

- What is the benefit of the second approach?
- It saves two add and two shuffle instructions each time the code is run. It does, however, add a move instruction, but this addition is far outweighed by the removal of the other four instructions.
- It does not contain any loops or jumps
- This will cut down on total program running time

- Moral of the story: exploit parallelism whenever you can!

movups xmm0, [Vector]

movups xmm1, xmm0

mulpsxmm1, xmm1

movups xmm2, xmm1

shufpsxmm2, xmm2, 00111001b

addssxmm1, xmm2

shufpsxmm2, xmm2, 00111001b

addssxmm1, xmm2

sqrtssxmm1, xmm1

unpcklps xmm1, xmm1

unpcklpsxmm1, xmm1

divps xmm0, xmm1

movups[Vector], xmm0

- So now what does this do?

movups xmm0, [Vector]; 0.0 Vz Vy Vx

movups xmm1, xmm0

mulpsxmm0, xmm0; 0.0 Vz*Vz Vy*Vy Vx*Vx

movups xmm2, xmm0

shufpsxmm2, xmm2, 00111001b; xxxxxxx 0.0 Vz*Vz Vy*Vy

addssxmm0, xmm2; xxxxxxx xxxxxxx xxxxxxx Vx*Vx+Vy*Vy

shufpsxmm2, xmm2, 00111001b; xxxxxxx xxxxxxx 0.0 Vz*Vz

addssxmm0, xmm2; xxxxxxx xxxxxxx xxxxxxx Vx*Vx+Vy*Vy+Vz*Vz

sqrtssxmm0, xmm0; xxxxxxx xxxxxxx xxxxxxx sqrt

unpcklps xmm0, xmm0; xxxxxxx xxxxxxx sqrt sqrt

unpcklpsxmm0, xmm0; sqrt sqrt sqrt sqrt

divps xmm1, xmm0; 0.0 Vz/sqrt Vy/sqrt Vx/sqrt

movups[Vector], xmm1

- It normalizes a vector and overwrites the vector with its normalization

- For some reason it is taking many people around five minutes to make their programs
- A few can’t get it to work at all - make times out and just sits there
- When I do the same thing it takes 15-20 seconds
- This doesn’t make any sense!
- We’re looking into the problem and hope to have it fixed ASAP
- Now, onto the writeup