Vector unit assembly
1 / 17

Vector Unit Assembly - PowerPoint PPT Presentation

  • Uploaded on

Vector Unit Assembly. [email protected] Overview. Architecture Review VU0 Macro Mode Instruction Set Building a Vector Library. Review. Playstation2 has two vector units that are similar but not the same VU0 is the CPU’s alternate processing unit

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
Download Presentation

PowerPoint Slideshow about ' Vector Unit Assembly' - deo

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
Vector unit assembly

Vector Unit Assembly

[email protected]


  • Architecture Review

  • VU0 Macro Mode Instruction Set

  • Building a Vector Library


  • Playstation2 has two vector units that are similar but not the same

  • VU0 is the CPU’s alternate processing unit

  • VU1 is the GS’s alternate processing unit

  • Each Unit has a direct pipeline to it’s respective processor

  • Vector Units are designed for 4Dx32bit vectors


  • VU0/1 each have access to 32 float registers and 16 integer register

  • Float registers are not like PC registers; they are 128bits in size (PC is 32bit)

  • 128bits can fit 4 float values at once (4D vector)

  • Integer registers are typically used as loop counters and address calculators





shared bus







  • VU0 has two bus lines

  • One bus is dedicated to the CPU

  • The other bus is used to communicate with all other devices

  • VU0 has 4KB of $

Vector unit processing speed
Vector Unit Processing Speed

  • The graph shows some vector-math intensive function calls

  • 200K calls were made to each function

Macro and micro modes
Macro and Micro Modes

  • Vector Unit Zero (VU0) has two modes

  • Micro mode is a mode that allows your vector processor to act as an independent CPU

    • A mini program is uploaded and executed in parallel to the main CPU

  • Macro mode allows your CPU to directly offload heavy vector computation with low overhead

    • Most popular method, hands down.

Micro mode
Micro Mode

  • When uploaded, the micro program is executed independent to the CPU

    • This means that we must time our execution so that the result is fetched by the CPU after the program is completed by the Vector Unit

    • Micro mode causes serious stalls and timing issues since execution speed is near impossible to determine

Macro mode
Macro Mode

  • Macro mode is a much easier method of executing fast math functionality

  • Assembly can be used as inline instructions, telling the compiler to offload the math to VU0

  • Notes

    • Just because it’s in assembly does not mean it will be faster

    • Switching CPU focus has it’s overheads

Assembly structure
Assembly Structure

  • There is typically a specific method to writing assembly routines

    • Load the variable data/addresses to registers

    • Apply vector computations to those registers

    • Store the result back into a variable address

  • Overhead of using assembly is in the load and store

  • Make sure that the computation stage will improve performance enough to offset the load/store overhead

Vector unit mips instructions
Vector Unit MIPS Instructions

  • Coprocessor Transfer Instructions

    • Store / Load

  • Coprocessor Branch Instructions

  • Macro (primitive) calculation instructions

    • Add / Subtract / Multiply / Divide / ect…

  • Micro subroutine execution instructions

    (VU Macro Instructions)


  • Adding two vectors using the EE Core (CPU)

    // (Vec4T *v0, Vec4T *v1, Vec4T *v2)


    v2->x = v0->x + v1->x;

    v2->y = v0->y + v1->y;

    v2->z = v0->z + v1->z;

    v2->w = v0->w + v1->w;



  • Adding two vectors using the VU0

    // (Vec4T *v0, Vec4T *v1, Vec4T *v2)


       asm __volatile__ ("

        lqc2    vf05, 0x0(%0)

        lqc2    vf06, 0x0(%1)

        vadd.xyzw vf07, vf05, vf06

        sqc2    vf07, 0x0(%2)” :

    : "r" (v0) , "r" (v1), "r" (v2)




  • Notice how we must use a temp because of the cross

    // (Vec4T *v1, Vec4T *v2, Vec4T *cross)


    Vec4T temp;

    temp.x = v1->y * v2->z - v1->z * v2->y;

    temp.y = v1->z * v2->x - v1->x * v2->z;

    temp.z = v1->x * v2->y - v1->y * v2->x;

    VectorCopy(&temp, cross);



// (Vec4T *v1, Vec4T *v2, Vec4T *cross)


asm __volatile__("

lqc2 vf05, 0x0(%0)

lqc2 vf06, 0x0(%1) ACC, vf05, vf06 # first vf06, vf06, vf05 # - second

vsub.w vf06, vf00, vf00 # w = 0

sqc2 vf06, 0x0(%2)”

: // No Output

: "r"(v1), "r"(v2), "r"(cross)



Vector outer product

The vopmula instruction performs an outer product

The result is stored into the special purpose ACC register

VF05 X Y Z

VF06 X Y Z


Vector Outer Product

For next time

For Next Time

Read Chapters 7.3.2 – 7.4.2

Read Chapters 9.3