Takagi Factorization on a GPU using CUDA Gagandeep S. Sachdev , Vishay Vanjani & Mary W. Hall School of Computing, University of Utah pthcol qthcol pth row qth row Jacobi Method What is Takagi Factorization? • Special form of singular value decomposition (SVD) • Applicable to complex symmetric matrices. • Mathematically: A = UsUT • U= left singular matrix • s = Non-Negative Diagonal matrix (Eigen values). • UT= Transpose of V matrix • For a given pair of p and q, form a 2x2 symmetric matrix as shown: • Diagonalize this matrix such that: • Apply transformation to p,q rows and columns. • Apply such unitary transform to all possible pairs • NC2 = N(N-1)/2 combinations. • This finishes one sweep. • Keep performing sweeps until the matrix converges. • Convergence: sum of absolute values of off-diagonal elements. • Quadratic in nature • Complexity: N^3 • How is it different from SVD? • SVD is defined as A = UsV*. • Left and right singular vectors are different matrices • Computing Takagi vectors is not trivial • Takagi factorization takes advantage of symmetry • Less computational complexity • Less storage requirements cos a c 0 0 sin JT & J J - sin c b 0 cos 0 Applications: • Grunsky inequalities • Computation of the near-best uniform polynomial • Rational approximation of a high degree polynomial on a disk • Complex independent component analysis problems • Complex coordinate rotation method • Optical potential method • Nuclear magnetic resonance • Diagonalization of mass matrices of Majorana fermions Parallel Jacobi Algorithm • N/2 parallel unitary transforms possible. • Order does not alter the final result. • Parallel threads cannot share a common index • This is done (N-1) times for one sweep • All row updates need to finish before column updates • To prevent race conditions • Column entries will be part of rows of other thread’s row • Computation now done on full symmetric matrix • Need a mechanism to generate unique N/2 pairs for (N-1) times • Cyclic chess tournament scheduling GPU Execution Copy complex Matrix A from host to GPU Kernel : Initialize auxiliary data structures – real diagonal matrix D and complex Takagi vector matrix U 0 9 1 8 2 7 3 6 4 5 1 9 2 0 3 8 4 7 5 6 Kernel : Calculate threshold Tid=4 Tid=4 Tid=3 Tid=3 Tid=0 Tid=1 Tid=2 Tid=0 Tid=1 Tid=2 Convergence reached No Iteration: 0 Iteration: 1 Yes N-1 times ? Other Algorithms Yes No • Divide and Conquer Algorithm • Twisted Factorization method • Both involve tridiagonalization followed by diagonalization • Complexity: N^2 • Problems : • Not as parallelizable as Jacobi • Jacobi method is more stable • Example: ill-conditioned matrices Kernel : Diagonalize and row updates--------------Host synchronization-------- Kernel : Column and Unitary matrix updates --------------Host synchronization-------- Kernel : make diagonal elements non-negative and sort Copy U and D from GPU Related work Results • Fast SVD on 7900 GTK • A generation before CUDA enabled GPUs (2005) • Recently, SVD implemented on GPU using CUDA • Implemented only for real matrices • Bidiagonalization by Householder transforms • Diagonalization by implicitly shifted QR algorithm • Complexity: m*n^2 • High storage space • Matlab and serial C implementations available nequalities Speedup of GPUs and Pthreads for different matrix sizes Conclusion • Jacobi method is the most stable and accurate parallel algorithm. • Good speedups achieved over serial and Pthreads implementation. • Nvidia Fermi architecture gave much better speedups. • Non-coalesced reads are cached. • Much higher FLOP performance. • Pthreads implementation reach speedup of 1.64x over serial CPU version. • NVIDIA GTX 260 reaches a speedup of 11.69x. • NVIDIA GTX 480 (Fermi) reaches a speedup of 30.80x.