Vector Units and Quaternions

1 / 47

# Vector Units and Quaternions - PowerPoint PPT Presentation

Vector Units and Quaternions. Jim Van Verth Red Storm Entertainment jimvv@redstorm.com. About This Talk. Will discuss how to do quaternion math on PS2 Assume that you already know and want to use quaternions Assume that you already know something about how the VU works. About Me.

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.

## Vector Units and Quaternions

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

### Vector Units and Quaternions

Jim Van Verth

Red Storm Entertainment

jimvv@redstorm.com

• Will discuss how to do quaternion math on PS2
• Assume that you already know and want to use quaternions
• Assume that you already know something about how the VU works
• Lead engineer at Red Storm Entertainment
• Not a quaternion god
• Not a vector unit god
• Not really familiar with VCL
• Just a 3D guy trying to get by…
• Most examples written in macro mode (VU0)
• Easy to translate to micro mode
• Examples that would be faster in micro mode are discussed separately
Matrices on PS2
• PS2 is really well set up to do matrices
• Multiplies are highly parallel
• Not so good for quaternions
This is what we’re up against

Takes 4/7 cycles to transform a point

Takes 16/19 cycles to concat matrices (9/12 cycles for 3x3 matrix)

Matrix Multiply

vmulax ACC, vf2, vf1x

Why Quaternions?
• Quaternions take up less space: 4 floats vs. 9 (best case)
• Quaternions interpolate well
• Avoid floating point drift (normalize vs. Gram-Schmidt orthogonalization)
Quaternions on VU
• Fit very well
• Four floats, aligned to 16-bit boundary
• Work just like homogeneous point
• Make sure stored (x,y,z,w) not (w,x,y,z)
Quaternion Multiplication
• If quaternion is (x, y, z,w) or (v, w) then
• All standard vector operations
• Add, scale, dot product, cross product
Interleaves dot product and rest via accumulator

Takes advantage of linearity of cross product

Cycle count: 8/11

Less than matrix!

Quaternion Mult on PS2

vmul vf3, vf1, vf2

vopmula.xyz acc, vf1, vf2

vopmsub.xyz vf3, vf2, vf1vsubaz.w acc, vf3, vf3z

vmsubax.w acc, vf0, vf3x

vmsuby.w vf3, vf0, vf3y

w= w1·w2v1 • v2

v = w1·v2 + w2·v1 + v1v2

Vector Rotation
• Formula for vector rotation:
• Two mults takes 16 cycles, plus the inverse
• Can do better
Vector Rotation, Take Two
• If q is normalized, then can do:
• This is faster than two straight multiplies on serial processor
• Faster on vector processor, too!
pin vf1, q in vf2Vector Rotation on VU

vmul vf11, vf1, vf2

vopmula.xyz acc, vf2, vf1

vopmsub.xyz vf5, vf1, vf2

vmul.w vf6w, vf2w, vf2w

vadd.w vf7w, vf2w, vf2wvmulax.w accw, vf0w, vf11x

vopmula.xyz acc, vf2, vf5

vopmsub.xyz vf3, vf5, vf2

p =(vp)·v

+ w2·p

+ 2w·(vp)

+ v(vp)

First part builds all the pieces

Second part adds ‘em all together

Cycles: 13/16

Better than straight multiply

Worse than matrix

Vector Rotation on VU

vmul vf11, vf1, vf2

vopmula.xyz acc, vf2, vf1

vopmsub.xyz vf5, vf1, vf2

vmul.w vf6w, vf2w, vf2w

vadd.w vf7w, vf2w, vf2wvmulax.w accw, vf0w, vf11x

vopmula.xyz acc, vf2, vf5

vopmsub.xyz vf3, vf5, vf2

Full Transforms
• Combination of translation vector t, quat r, 3 scale factors s
• Once again, want to transform point
• Basic formula:
pin vf1, q in vf2

scale in vf3

translation in vf4

Takes four extra cycles for scale (including stalls), one extra for xlate

Cycle count: 18/21

Point Transformation

vmul vf1, vf1, vf3

vmul vf11, vf1, vf2

vopmula.xyz acc, vf2, vf1

vopmsub.xyz vf5, vf1, vf2

vmul.w vf6w, vf2w, vf2w

vadd.w vf7w, vf2w, vf2w vmulax.w accw, vf0w, vf11x

vopmula.xyz acc, vf2, vf5

vopmsub.xyz vf3, vf5, vf2

Transform Concatenation
• Look at formula:
• Have to transform point and multiply two quaternions and multiply scales
Transform Concatenation
• Takes 8 cycles for quat multiply, 18 for transform, 1 for scale
• Have three stall cycles available
• Bottom line: 24/27 cycles
• Much slower than matrix multiplication
• Not recommended
Matrix Conversion
• Quat-vector transformation not as efficient as matrix-vector transformation (13 cycles vs. 4)
• To do multiple points, want to convert quaternion to a 4x4 matrix
Matrix Conversion
• Corresponding 4x4 matrix to normalized quat q = (x,y,z,w) is:
• Not obvious how to do this efficiently
Matrix Conversion
• Two approaches
• One works well in macro mode
• One in micro mode
• uses Lower instructions to achieve better parallelism
Matrix Conversion (macro)
• Idea: matrix is built from two other matrices
Matrix Conversion (macro)
• Simplification: matrix multiply is series of row vector multiplies
• Create right matrix, generate left matrix via accumulator tricks
Matrix Conversion (macro)
• Look at one row in matrix multiply:

vmulax ACC, vf5, vf1x

• Or could just do:

vmulaw ACC, vf8, vf1w

• Is linear, so order doesn’t matter
Matrix Conversion (macro)
• Idea: all values we need for left matrix are in quaternion
• Load accumulator with mula by w value (always positive)
• vmadd or vmsub to multiply by positive or negative value and accumulate

vmulaw.xyz acc, vf2, vf5w

vmsubz.xyz vf13, vf1, vf5z

Matrix Conversion (macro)
• More simplification:
• Last row of Mq always (0,0,0,1), don’t compute!
• Last column always 0 too, don’t compute!
• Last row of Rq just the quat in VU format
• Just build:

vsuby.z vf1, vf0, vf4

vsubz.x vf2, vf0, vf4

vsubx.y vf3, vf0, vf4

vmr32.w vf12, vf0

vmr32.w vf13, vf0

vmr32.w vf14, vf0

Stage one:

Build right matrix

Clear right column of result

vf1=(w,z,-y,~)

vf2=(-z,w,x,~)

vf3=(y,-x,w,~)

vf4=(x,y,z,w)

Matrix Conversion (macro)
vmulaw.xyz acc, vf1, vf4w

vmsubay.xyz acc, vf3, vf4y

vmulaw.xyz acc, vf2, vf4w

vmsubz.xyz vf13, vf1, vf4z

vmulaw.xyz acc, vf3, vf4w

vmsubx.xyz vf14, vf2, vf4x

vmove.xyzw vf15, vf0

Stage two:

Matrix multiply to get first three rows

Clear bottom row

Note: accumulate only on xyz (w already cleared)

Cycles: 25/28

Matrix Conversion (macro)
Matrix Conversion (micro)
• Lots of duplicate calculations in matrix
• Idea: calculate only what we need, use shifting and accumulator tricks to parallelize efficiently
• Devised by Colin Hughes of SCEE
mula acc, vf1, vf1 loi SQRT_2

muli vf3, vf1, Imr32.w vf24, vf0

opmula acc, vf3, vf3move vf27, vf0

msubw vf5, vf3, vf3wmr32.w vf26, vf0

maddw vf6, vf3, vf3wmr32.w vf25, vf0

msubax.yz acc, vf4, vf2x nop

msuby.z vf26, vf4, vf2ymr32 vf3, vf5

msubay.xz acc, vf4, vf2ymr32 vf7, vf6

msubz.y vf25, vf4, vf2z mr32.y vf24, vf5

msubz.x vf24, vf4, vf2z mr32.x vf26, vf5

addy.z vf24, vf0, vf6y mr32.z vf25, vf3

addx.y vf26, vf0, vf6x mr32.x vf25, vf7

Three parts

Calculate elements

Clear matrix

Shift, add and copy into place

16/19 cycles

Matrix Conversion (micro)
Matrix Conversion
• If you’re converting a quaternion and going to use it immediately, can make some assumptions
• Don’t create bottom row (just use vf0)
• Don’t clear right column (just use xyz)
• Saves four cycles in macro mode case
Transform to Matrix
• Use one of the quaternion matrix techniques
• Scale first three rows by each scale factor
• Replace last row with translation
• Results:
• 29/32 for macro mode
• 20/23 for micro mode
Normalization
• Need to normalize quaternion to keep it useful for rotation
• (Also avoids floating point drift)
• Fortunately PS2 has reciprocal square root instruction
• Unfortunately it takes a while
vmul vf2, vf1, vf1

vrsqrt Q, vf0w, vf2w

vwaitq

vmulq vf1, vf1, Q

Compute dot product

Compute 1/length

Scale quaternion

With stalls, takes 24/27 cycles

Normalization
Normalization
• Another approach
• From “The Inner Product”, March 2002 Game Developer by Jonathan Blow
• Approximate 1/x via Newton-Raphson iteration
• First iteration takes (looks like) 4/7 cycles on VU0
• Second iteration takes as long as RSQRT
• Recommend: if x > 0.91521198, use approx
• Otherwise use RSQRT
Interpolation
• This is where it’s at
• It would be great if it was fast
• Um, well…
Interpolation
• First look at spherical linear interp
• That’s a lot of sines
• Could precompute , 1/sin 
• But at least 28 cycles for one of the other sines
• We (RSE) don’t use slerp anyway
Interpolation
• Lerp, then
• is simply(q in vf1, r in vf2, t in vf3w)
• vmsubaw acc, vf1, vf3w
• Need to normalize afterwards
• Makes 30/33 cycles
Interpolation
• Not quite that simple
• Problem: if q•r < 0, interpolation will take long way around sphere
• Need to negate one quat
• Gives the same orientation, but the interpolation takes the short route
vmul vf4, vf1, vf2

vnop

vnop

vnop

cfc2 t0,\$16

and t0,t0,0x0002

vmsubaw acc, vf2,vf3w

b Finish

Finish:vmsubw vf1, vf1, vf3w

Compute dot product

Check for negative

Interpolate

Takes 43/46 cycles

Linear Interpolation
Linear Interpolation
• There’s more we can do
• Jonathan Blow’s article, again
• Use spline to correct error in lerp
• More investigation needed
• Initial results: takes about 24-26 more cycles
• Looks faster than slerp, more accurate than lerp
How We’re Using All This
• A bit research-y at the moment
• VU0-based math library
• Optimization in specific routines
• In particular, concatenation and interpolation for bones animation
• More memory savings: store quat as 4.12 fixed-point shorts
Conclusions
• Quaternions useful on PS2
• Cheaper to concatenate (alone)
• Convert to matrix to transform
• Use linear interpolation
• Check out Jonathan Blow’s article
References
• Shoemake, Ken, “Animating Rotation with Quaternion Curves,” Computer Graphics, Vol. 19, No. 3 (July 1985).
• EE Core Instruction Set Manual
• VU User’s Manual
• Sony newsgroups
• Blow, Jonathan, “Hacking Quaternions,” Game Developer, Vol. 9, No. 3 (March 2002). [get updated source from www.gdmag.com/code.htm]