Vector Unit Assembly - PowerPoint PPT Presentation

Vector unit assembly
1 / 17

  • Uploaded on
  • Presentation posted in: General

Vector Unit Assembly. Overview. Architecture Review VU0 Macro Mode Instruction Set Building a Vector Library. Review. Playstation2 has two vector units that are similar but not the same VU0 is the CPU’s alternate processing unit

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.

Download Presentation

Vector Unit Assembly

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript

Vector unit assembly

Vector Unit Assembly



  • Architecture Review

  • VU0 Macro Mode Instruction Set

  • Building a Vector Library



  • Playstation2 has two vector units that are similar but not the same

  • VU0 is the CPU’s alternate processing unit

  • VU1 is the GS’s alternate processing unit

  • Each Unit has a direct pipeline to it’s respective processor

  • Vector Units are designed for 4Dx32bit vectors



  • VU0/1 each have access to 32 float registers and 16 integer register

  • Float registers are not like PC registers; they are 128bits in size (PC is 32bit)

  • 128bits can fit 4 float values at once (4D vector)

  • Integer registers are typically used as loop counters and address calculators





shared bus







  • VU0 has two bus lines

  • One bus is dedicated to the CPU

  • The other bus is used to communicate with all other devices

  • VU0 has 4KB of $

Vector unit processing speed

Vector Unit Processing Speed

  • The graph shows some vector-math intensive function calls

  • 200K calls were made to each function

Macro and micro modes

Macro and Micro Modes

  • Vector Unit Zero (VU0) has two modes

  • Micro mode is a mode that allows your vector processor to act as an independent CPU

    • A mini program is uploaded and executed in parallel to the main CPU

  • Macro mode allows your CPU to directly offload heavy vector computation with low overhead

    • Most popular method, hands down.

Micro mode

Micro Mode

  • When uploaded, the micro program is executed independent to the CPU

    • This means that we must time our execution so that the result is fetched by the CPU after the program is completed by the Vector Unit

    • Micro mode causes serious stalls and timing issues since execution speed is near impossible to determine

Macro mode

Macro Mode

  • Macro mode is a much easier method of executing fast math functionality

  • Assembly can be used as inline instructions, telling the compiler to offload the math to VU0

  • Notes

    • Just because it’s in assembly does not mean it will be faster

    • Switching CPU focus has it’s overheads

Assembly structure

Assembly Structure

  • There is typically a specific method to writing assembly routines

    • Load the variable data/addresses to registers

    • Apply vector computations to those registers

    • Store the result back into a variable address

  • Overhead of using assembly is in the load and store

  • Make sure that the computation stage will improve performance enough to offset the load/store overhead

Vector unit mips instructions

Vector Unit MIPS Instructions

  • Coprocessor Transfer Instructions

    • Store / Load

  • Coprocessor Branch Instructions

  • Macro (primitive) calculation instructions

    • Add / Subtract / Multiply / Divide / ect…

  • Micro subroutine execution instructions

    (VU Macro Instructions)



  • Adding two vectors using the EE Core (CPU)

    // (Vec4T *v0, Vec4T *v1, Vec4T *v2)


    v2->x = v0->x + v1->x;

    v2->y = v0->y + v1->y;

    v2->z = v0->z + v1->z;

    v2->w = v0->w + v1->w;




  • Adding two vectors using the VU0

    // (Vec4T *v0, Vec4T *v1, Vec4T *v2)


       asm __volatile__ ("

        lqc2    vf05, 0x0(%0)

        lqc2    vf06, 0x0(%1)

        vadd.xyzw vf07, vf05, vf06

        sqc2    vf07, 0x0(%2)” :

    : "r" (v0) , "r" (v1), "r" (v2)





  • Notice how we must use a temp because of the cross

    // (Vec4T *v1, Vec4T *v2, Vec4T *cross)


    Vec4T temp;

    temp.x = v1->y * v2->z - v1->z * v2->y;

    temp.y = v1->z * v2->x - v1->x * v2->z;

    temp.z = v1->x * v2->y - v1->y * v2->x;

    VectorCopy(&temp, cross);




// (Vec4T *v1, Vec4T *v2, Vec4T *cross)


asm __volatile__("

lqc2 vf05, 0x0(%0)

lqc2 vf06, 0x0(%1) ACC, vf05, vf06 # first vf06, vf06, vf05 # - second

vsub.w vf06, vf00, vf00 # w = 0

sqc2 vf06, 0x0(%2)”

: // No Output

: "r"(v1), "r"(v2), "r"(cross)



Vector outer product

The vopmula instruction performs an outer product

The result is stored into the special purpose ACC register




Vector Outer Product

For next time

For Next Time

Read Chapters 7.3.2 – 7.4.2

Read Chapters 9.3

  • Login