Vector unit assembly
Sponsored Links
This presentation is the property of its rightful owner.
1 / 17

Vector Unit Assembly PowerPoint PPT Presentation

  • Uploaded on
  • Presentation posted in: General

Vector Unit Assembly. Overview. Architecture Review VU0 Macro Mode Instruction Set Building a Vector Library. Review. Playstation2 has two vector units that are similar but not the same VU0 is the CPU’s alternate processing unit

Download Presentation

Vector Unit Assembly

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript

Vector Unit Assembly


  • Architecture Review

  • VU0 Macro Mode Instruction Set

  • Building a Vector Library


  • Playstation2 has two vector units that are similar but not the same

  • VU0 is the CPU’s alternate processing unit

  • VU1 is the GS’s alternate processing unit

  • Each Unit has a direct pipeline to it’s respective processor

  • Vector Units are designed for 4Dx32bit vectors


  • VU0/1 each have access to 32 float registers and 16 integer register

  • Float registers are not like PC registers; they are 128bits in size (PC is 32bit)

  • 128bits can fit 4 float values at once (4D vector)

  • Integer registers are typically used as loop counters and address calculators




shared bus







  • VU0 has two bus lines

  • One bus is dedicated to the CPU

  • The other bus is used to communicate with all other devices

  • VU0 has 4KB of $

Vector Unit Processing Speed

  • The graph shows some vector-math intensive function calls

  • 200K calls were made to each function

Macro and Micro Modes

  • Vector Unit Zero (VU0) has two modes

  • Micro mode is a mode that allows your vector processor to act as an independent CPU

    • A mini program is uploaded and executed in parallel to the main CPU

  • Macro mode allows your CPU to directly offload heavy vector computation with low overhead

    • Most popular method, hands down.

Micro Mode

  • When uploaded, the micro program is executed independent to the CPU

    • This means that we must time our execution so that the result is fetched by the CPU after the program is completed by the Vector Unit

    • Micro mode causes serious stalls and timing issues since execution speed is near impossible to determine

Macro Mode

  • Macro mode is a much easier method of executing fast math functionality

  • Assembly can be used as inline instructions, telling the compiler to offload the math to VU0

  • Notes

    • Just because it’s in assembly does not mean it will be faster

    • Switching CPU focus has it’s overheads

Assembly Structure

  • There is typically a specific method to writing assembly routines

    • Load the variable data/addresses to registers

    • Apply vector computations to those registers

    • Store the result back into a variable address

  • Overhead of using assembly is in the load and store

  • Make sure that the computation stage will improve performance enough to offset the load/store overhead

Vector Unit MIPS Instructions

  • Coprocessor Transfer Instructions

    • Store / Load

  • Coprocessor Branch Instructions

  • Macro (primitive) calculation instructions

    • Add / Subtract / Multiply / Divide / ect…

  • Micro subroutine execution instructions

    (VU Macro Instructions)


  • Adding two vectors using the EE Core (CPU)

    // (Vec4T *v0, Vec4T *v1, Vec4T *v2)


    v2->x = v0->x + v1->x;

    v2->y = v0->y + v1->y;

    v2->z = v0->z + v1->z;

    v2->w = v0->w + v1->w;



  • Adding two vectors using the VU0

    // (Vec4T *v0, Vec4T *v1, Vec4T *v2)


       asm __volatile__ ("

        lqc2    vf05, 0x0(%0)

        lqc2    vf06, 0x0(%1)

        vadd.xyzw vf07, vf05, vf06

        sqc2    vf07, 0x0(%2)” :

    : "r" (v0) , "r" (v1), "r" (v2)




  • Notice how we must use a temp because of the cross

    // (Vec4T *v1, Vec4T *v2, Vec4T *cross)


    Vec4T temp;

    temp.x = v1->y * v2->z - v1->z * v2->y;

    temp.y = v1->z * v2->x - v1->x * v2->z;

    temp.z = v1->x * v2->y - v1->y * v2->x;

    VectorCopy(&temp, cross);



// (Vec4T *v1, Vec4T *v2, Vec4T *cross)


asm __volatile__("

lqc2 vf05, 0x0(%0)

lqc2 vf06, 0x0(%1) ACC, vf05, vf06 # first vf06, vf06, vf05 # - second

vsub.w vf06, vf00, vf00 # w = 0

sqc2 vf06, 0x0(%2)”

: // No Output

: "r"(v1), "r"(v2), "r"(cross)



The vopmula instruction performs an outer product

The result is stored into the special purpose ACC register




Vector Outer Product

For Next Time

Read Chapters 7.3.2 – 7.4.2

Read Chapters 9.3

  • Login