Vector unit assembly
This presentation is the property of its rightful owner.
Sponsored Links
1 / 17

Vector Unit Assembly PowerPoint PPT Presentation


  • 73 Views
  • Uploaded on
  • Presentation posted in: General

Vector Unit Assembly. [email protected] Overview. Architecture Review VU0 Macro Mode Instruction Set Building a Vector Library. Review. Playstation2 has two vector units that are similar but not the same VU0 is the CPU’s alternate processing unit

Download Presentation

Vector Unit Assembly

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Vector unit assembly

Vector Unit Assembly

[email protected]


Overview

Overview

  • Architecture Review

  • VU0 Macro Mode Instruction Set

  • Building a Vector Library


Review

Review

  • Playstation2 has two vector units that are similar but not the same

  • VU0 is the CPU’s alternate processing unit

  • VU1 is the GS’s alternate processing unit

  • Each Unit has a direct pipeline to it’s respective processor

  • Vector Units are designed for 4Dx32bit vectors


Review1

Review

  • VU0/1 each have access to 32 float registers and 16 integer register

  • Float registers are not like PC registers; they are 128bits in size (PC is 32bit)

  • 128bits can fit 4 float values at once (4D vector)

  • Integer registers are typically used as loop counters and address calculators


Review2

dedicated

CPU CORE

VU0

shared bus

I$

4KB

D$

4KB

SYS RAM

Review

  • VU0 has two bus lines

  • One bus is dedicated to the CPU

  • The other bus is used to communicate with all other devices

  • VU0 has 4KB of $


Vector unit processing speed

Vector Unit Processing Speed

  • The graph shows some vector-math intensive function calls

  • 200K calls were made to each function


Macro and micro modes

Macro and Micro Modes

  • Vector Unit Zero (VU0) has two modes

  • Micro mode is a mode that allows your vector processor to act as an independent CPU

    • A mini program is uploaded and executed in parallel to the main CPU

  • Macro mode allows your CPU to directly offload heavy vector computation with low overhead

    • Most popular method, hands down.


Micro mode

Micro Mode

  • When uploaded, the micro program is executed independent to the CPU

    • This means that we must time our execution so that the result is fetched by the CPU after the program is completed by the Vector Unit

    • Micro mode causes serious stalls and timing issues since execution speed is near impossible to determine


Macro mode

Macro Mode

  • Macro mode is a much easier method of executing fast math functionality

  • Assembly can be used as inline instructions, telling the compiler to offload the math to VU0

  • Notes

    • Just because it’s in assembly does not mean it will be faster

    • Switching CPU focus has it’s overheads


Assembly structure

Assembly Structure

  • There is typically a specific method to writing assembly routines

    • Load the variable data/addresses to registers

    • Apply vector computations to those registers

    • Store the result back into a variable address

  • Overhead of using assembly is in the load and store

  • Make sure that the computation stage will improve performance enough to offset the load/store overhead


Vector unit mips instructions

Vector Unit MIPS Instructions

  • Coprocessor Transfer Instructions

    • Store / Load

  • Coprocessor Branch Instructions

  • Macro (primitive) calculation instructions

    • Add / Subtract / Multiply / Divide / ect…

  • Micro subroutine execution instructions

    (VU Macro Instructions)


Eevectoradd

EEVectorAdd

  • Adding two vectors using the EE Core (CPU)

    // (Vec4T *v0, Vec4T *v1, Vec4T *v2)

    {

    v2->x = v0->x + v1->x;

    v2->y = v0->y + v1->y;

    v2->z = v0->z + v1->z;

    v2->w = v0->w + v1->w;

    }


Vectoradd

VectorAdd

  • Adding two vectors using the VU0

    // (Vec4T *v0, Vec4T *v1, Vec4T *v2)

    {

       asm __volatile__ ("

        lqc2    vf05, 0x0(%0)

        lqc2    vf06, 0x0(%1)

        vadd.xyzw vf07, vf05, vf06

        sqc2    vf07, 0x0(%2)” :

    : "r" (v0) , "r" (v1), "r" (v2)

    );

    }


Eecrossproduct

EECrossProduct

  • Notice how we must use a temp because of the cross

    // (Vec4T *v1, Vec4T *v2, Vec4T *cross)

    {

    Vec4T temp;

    temp.x = v1->y * v2->z - v1->z * v2->y;

    temp.y = v1->z * v2->x - v1->x * v2->z;

    temp.z = v1->x * v2->y - v1->y * v2->x;

    VectorCopy(&temp, cross);

    }


Crossproduct

CrossProduct

// (Vec4T *v1, Vec4T *v2, Vec4T *cross)

{

asm __volatile__("

lqc2 vf05, 0x0(%0)

lqc2 vf06, 0x0(%1)

vopmula.xyz ACC, vf05, vf06 # first

vopmsub.xyz vf06, vf06, vf05 # - second

vsub.w vf06, vf00, vf00 # w = 0

sqc2 vf06, 0x0(%2)”

: // No Output

: "r"(v1), "r"(v2), "r"(cross)

);

}


Vector outer product

The vopmula instruction performs an outer product

The result is stored into the special purpose ACC register

VF05XYZ

VF06XYZ

ACCXYZ

Vector Outer Product


For next time

For Next Time

Read Chapters 7.3.2 – 7.4.2

Read Chapters 9.3


  • Login