Lecture 18
Sponsored Links
This presentation is the property of its rightful owner.
1 / 17

Lecture 18 PowerPoint PPT Presentation

  • Uploaded on
  • Presentation posted in: General

Lecture 18. SSE Instructions and MP4 Ryan Chmiel. Lecture outline. SSE Instruction Overview Code examples using SSE instructions MP4. Streaming SIMD Extensions. Streaming SIMD defines a new architecture for floating point operations Introduced in Pentium III in March 1999

Download Presentation

Lecture 18

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript

Lecture 18

SSE Instructions and MP4

Ryan Chmiel

Lecture outline

  • SSE Instruction Overview

  • Code examples using SSE instructions

  • MP4

Streaming SIMD Extensions

  • Streaming SIMD defines a new architecture for floating point operations

  • Introduced in Pentium III in March 1999

    • Pentium III includes floating point, MMX technology, and XMM registers

  • Use eight new 128-bit wide general-purpose registers (XMM0 - XMM7)

  • Operate on IEEE-754 single-precision 32-bit real numbers

  • Support packed and scalar operations on the new packed single precision floating point data types

Streaming SIMD Extensions

  • Packedinstructions operate vertically on all four pairs of floating point data elements in parallel

    • instructions have suffix ps for single-precision, e.g., addps

  • Scalar instructions operate on the least-significant data elements of the two operands

    • instructions have suffix ss for single-precision, e.g., addss

Categories of SSE Instructions

  • Data Movement

    • movss, movups, movaps, etc.

  • Arithmetic Instructions

    • addps, mulss, divps, sqrtss, etc.

  • Logical Instructions (only operate on packed data)

    • xorps, orps, andnps, etc.

  • Compare Instructions

    • cmpss, cmpps, etc.

  • Integer-Real Conversion Instructions

    • cvtps2pi, cvtsi2ss, etc.

  • Shuffle/Rearrange Instructions

    • shufps, unpcklps, etc.

SSE Instruction Examples

XMM1 4.0 3.0 2.0 1.0 XMM2 5.0 6.0 7.0 8.0

  • movupsxmm3, xmm1 4.0 3.0 2.0 1.0

  • movssxmm1, xmm2 4.0 3.0 2.0 8.0

  • movssxmm3, [RealOne] 0.0 0.0 0.0 1.0

  • addpsxmm1, xmm2 9.0 9.0 9.0 9.0

  • subssxmm2, xmm1 5.0 6.0 7.0 7.0

  • xorpsxmm2, xmm2 0.0 0.0 0.0 0.0


  • This instruction allows you to take two floating point numbers from each operand to create a new value with the four numbers

  • The destination operand (xmmreg1) contributes to the lower two places in the result, and the source operand contributes to the upper two

  • Usage:

    • shufpsxmmreg1, xmmreg2/mem128, imm8

    • The first operand must be an xmm register

    • The second operand can either be an xmm register or a memory location

    • The third operand is a bit mask of four two-bit numbers that specifies which values from each operand you’ll be choosing:

      127 0

      XMMREG 11 10 01 00

  • That makes no sense - show me some examples!

SHUFPS Examples

XMM1 1.0 2.0 3.0 4.0 XMM2 5.0 6.0 7.0 8.0

  • shufpsxmm1, xmm2, 11001100b 5.0 8.0 1.0 4.0

  • shufpsxmm1, xmm2, 01111001b 7.0 5.0 2.0 3.0

  • shufpsxmm2, xmm1, 01111001b 3.0 1.0 6.0 7.0

  • shufpsxmm1, xmm1, 10001100b 2.0 4.0 1.0 4.0

  • shufpsxmm2, xmm2, 10010011b 6.0 7.0 8.0 5.0

  • shufpsxmm1, xmm1, 00111001b 4.0 1.0 2.0 3.0

SSE Instruction Reference

  • Available as part of the Intel x86 Instruction Set Reference found on the Resources page of ECE 291 website

  • Each instruction has a diagram that shows how the instruction manipulates the data and stores the result

  • Unfortunately the reference is in alphabetical order, not in order of instruction type

  • As previously mentioned, look for –ps and –ss suffixes to determine which instructions are SSE instructions

  • Instructions suffixed with –pd and –sd operate on double-precision values and are included in the SSE2 instruction set. These instructions were introduced on the Pentium IV chip.

SSE Instruction Caveats

  • You cannot push and pop xmm registers like you can do to general purpose registers. This means if you’re using xmm0 in a function, and you call another function that also uses xmm0, the value will be overwritten. Watch out for this!

  • For some of the SSE instructions, such as the arithmetic and logical instructions, if you specify a memory location as the second operand, that memory location must be on a 16-byte boundary (it’s address must end with 0000h). If it doesn’t, you will get a general protection fault at runtime. To avoid this, always move the value at this memory location into an xmm register and use that xmm register as the second operand.

  • To avoid more GPF’s, use movups (move unaligned packed single-precision) instead of movaps (move aligned packed single-precision). movaps checks for a 16-byte boundary as mentioned above and will crash your program if the address does not lie on one.

SSE Coding Example 1

Variable1dd4.5, 32.0, -16.123, 291.0


movupsxmm0, [Variable1]

xorpsxmm1, xmm1

movecx, 4


addssxmm1, xmm0

shufpsxmm0, xmm0, 00111001b


movss[Variable2], xmm1

  • What does this do?

SSE Coding Example 1

  • It sums the four numbers stored in [Variable1] and stores the result to [Variable2]

  • Here’s the main part of the code again - is this the most efficient way to perform this operation?


    addssxmm1, xmm0

    shufpsxmm0, xmm0, 00111001b


    movss[Variable2], xmm1

    • A. Yes

    • B. No

    • C. I don’t know

    • D. I don’t care

SSE Coding Example 1

  • At least you’re honest… but B is the correct answer

  • You have four numbers to add, and the addps instruction can add four pairs of floating point numbers at once

  • Solution:

    • Line up the four values making two pairs and add both pairs in parallel

    • Line up the two results into one pair add that pair

      movupsxmm0, [Variable1]

      movupsxmm1, xmm0

      shufpsxmm1, xmm1, 00001110b ; upper two values do not matter

      addpsxmm1, xmm0

      movupsxmm0, xmm1

      shufpsxmm0, xmm0, 00000001b

      addssxmm0, xmm1

      movss[Variable2], xmm0

SSE Coding Example 1

  • What is wrong with the first approach?

    • It does not take advantage of parallelism - this code can be written using the regular FPU instructions

  • What is the benefit of the second approach?

    • It saves two add and two shuffle instructions each time the code is run. It does, however, add a move instruction, but this addition is far outweighed by the removal of the other four instructions.

    • It does not contain any loops or jumps

    • This will cut down on total program running time

  • Moral of the story: exploit parallelism whenever you can!

SSE Coding Example 2

movups xmm0, [Vector]

movups xmm1, xmm0

mulpsxmm1, xmm1

movups xmm2, xmm1

shufpsxmm2, xmm2, 00111001b

addssxmm1, xmm2

shufpsxmm2, xmm2, 00111001b

addssxmm1, xmm2

sqrtssxmm1, xmm1

unpcklps xmm1, xmm1

unpcklpsxmm1, xmm1

divps xmm0, xmm1

movups[Vector], xmm0

  • So now what does this do?

SSE Coding Example 2

movups xmm0, [Vector]; 0.0 Vz Vy Vx

movups xmm1, xmm0

mulpsxmm0, xmm0; 0.0 Vz*Vz Vy*Vy Vx*Vx

movups xmm2, xmm0

shufpsxmm2, xmm2, 00111001b; xxxxxxx 0.0 Vz*Vz Vy*Vy

addssxmm0, xmm2; xxxxxxx xxxxxxx xxxxxxx Vx*Vx+Vy*Vy

shufpsxmm2, xmm2, 00111001b; xxxxxxx xxxxxxx 0.0 Vz*Vz

addssxmm0, xmm2; xxxxxxx xxxxxxx xxxxxxx Vx*Vx+Vy*Vy+Vz*Vz

sqrtssxmm0, xmm0; xxxxxxx xxxxxxx xxxxxxx sqrt

unpcklps xmm0, xmm0; xxxxxxx xxxxxxx sqrt sqrt

unpcklpsxmm0, xmm0; sqrt sqrt sqrt sqrt

divps xmm1, xmm0; 0.0 Vz/sqrt Vy/sqrt Vx/sqrt

movups[Vector], xmm1

  • It normalizes a vector and overwrites the vector with its normalization


  • For some reason it is taking many people around five minutes to make their programs

  • A few can’t get it to work at all - make times out and just sits there

  • When I do the same thing it takes 15-20 seconds

  • This doesn’t make any sense!

  • We’re looking into the problem and hope to have it fixed ASAP

  • Now, onto the writeup

  • Login