vector unit assembly n.
Skip this Video
Download Presentation
Vector Unit Assembly

Loading in 2 Seconds...

play fullscreen
1 / 17

Vector Unit Assembly - PowerPoint PPT Presentation

  • Uploaded on

Vector Unit Assembly. Overview. Architecture Review VU0 Macro Mode Instruction Set Building a Vector Library. Review. Playstation2 has two vector units that are similar but not the same VU0 is the CPU’s alternate processing unit

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
Download Presentation

PowerPoint Slideshow about 'Vector Unit Assembly' - deo

Download Now An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
vector unit assembly

Vector Unit Assembly

  • Architecture Review
  • VU0 Macro Mode Instruction Set
  • Building a Vector Library
  • Playstation2 has two vector units that are similar but not the same
  • VU0 is the CPU’s alternate processing unit
  • VU1 is the GS’s alternate processing unit
  • Each Unit has a direct pipeline to it’s respective processor
  • Vector Units are designed for 4Dx32bit vectors
  • VU0/1 each have access to 32 float registers and 16 integer register
  • Float registers are not like PC registers; they are 128bits in size (PC is 32bit)
  • 128bits can fit 4 float values at once (4D vector)
  • Integer registers are typically used as loop counters and address calculators




shared bus






  • VU0 has two bus lines
  • One bus is dedicated to the CPU
  • The other bus is used to communicate with all other devices
  • VU0 has 4KB of $
vector unit processing speed
Vector Unit Processing Speed
  • The graph shows some vector-math intensive function calls
  • 200K calls were made to each function
macro and micro modes
Macro and Micro Modes
  • Vector Unit Zero (VU0) has two modes
  • Micro mode is a mode that allows your vector processor to act as an independent CPU
    • A mini program is uploaded and executed in parallel to the main CPU
  • Macro mode allows your CPU to directly offload heavy vector computation with low overhead
    • Most popular method, hands down.
micro mode
Micro Mode
  • When uploaded, the micro program is executed independent to the CPU
    • This means that we must time our execution so that the result is fetched by the CPU after the program is completed by the Vector Unit
    • Micro mode causes serious stalls and timing issues since execution speed is near impossible to determine
macro mode
Macro Mode
  • Macro mode is a much easier method of executing fast math functionality
  • Assembly can be used as inline instructions, telling the compiler to offload the math to VU0
  • Notes
    • Just because it’s in assembly does not mean it will be faster
    • Switching CPU focus has it’s overheads
assembly structure
Assembly Structure
  • There is typically a specific method to writing assembly routines
    • Load the variable data/addresses to registers
    • Apply vector computations to those registers
    • Store the result back into a variable address
  • Overhead of using assembly is in the load and store
  • Make sure that the computation stage will improve performance enough to offset the load/store overhead
vector unit mips instructions
Vector Unit MIPS Instructions
  • Coprocessor Transfer Instructions
    • Store / Load
  • Coprocessor Branch Instructions
  • Macro (primitive) calculation instructions
    • Add / Subtract / Multiply / Divide / ect…
  • Micro subroutine execution instructions

(VU Macro Instructions)

  • Adding two vectors using the EE Core (CPU)

// (Vec4T *v0, Vec4T *v1, Vec4T *v2)


v2->x = v0->x + v1->x;

v2->y = v0->y + v1->y;

v2->z = v0->z + v1->z;

v2->w = v0->w + v1->w;


  • Adding two vectors using the VU0

// (Vec4T *v0, Vec4T *v1, Vec4T *v2)


   asm __volatile__ ("

    lqc2    vf05, 0x0(%0)

    lqc2    vf06, 0x0(%1)

    vadd.xyzw vf07, vf05, vf06

    sqc2    vf07, 0x0(%2)” :

: "r" (v0) , "r" (v1), "r" (v2)



  • Notice how we must use a temp because of the cross

// (Vec4T *v1, Vec4T *v2, Vec4T *cross)


Vec4T temp;

temp.x = v1->y * v2->z - v1->z * v2->y;

temp.y = v1->z * v2->x - v1->x * v2->z;

temp.z = v1->x * v2->y - v1->y * v2->x;

VectorCopy(&temp, cross);



// (Vec4T *v1, Vec4T *v2, Vec4T *cross)


asm __volatile__("

lqc2 vf05, 0x0(%0)

lqc2 vf06, 0x0(%1) ACC, vf05, vf06 # first vf06, vf06, vf05 # - second

vsub.w vf06, vf00, vf00 # w = 0

sqc2 vf06, 0x0(%2)”

: // No Output

: "r"(v1), "r"(v2), "r"(cross)



vector outer product
The vopmula instruction performs an outer product

The result is stored into the special purpose ACC register

VF05 X Y Z

VF06 X Y Z


Vector Outer Product
for next time

For Next Time

Read Chapters 7.3.2 – 7.4.2

Read Chapters 9.3