1 / 1

nequalities

Takagi Factorization on a GPU using CUDA Gagandeep S. Sachdev , Vishay Vanjani & Mary W. Hall School of Computing, University of Utah. p th col. q th col. p th row. q th row. Jacobi Method. What is Takagi Factorization? . Special form of singular value decomposition (SVD )

valora
Download Presentation

nequalities

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Takagi Factorization on a GPU using CUDA Gagandeep S. Sachdev , Vishay Vanjani & Mary W. Hall School of Computing, University of Utah pthcol qthcol pth row qth row Jacobi Method What is Takagi Factorization? • Special form of singular value decomposition (SVD) • Applicable to complex symmetric matrices. • Mathematically: A = UsUT • U= left singular matrix                                        • s  = Non-Negative Diagonal matrix (Eigen values). • UT= Transpose of V matrix • For a given pair of p and q, form a 2x2 symmetric matrix as shown: • Diagonalize this matrix such that: • Apply transformation to p,q rows and columns. • Apply such unitary transform to all possible pairs • NC2 = N(N-1)/2 combinations. • This finishes one sweep. • Keep performing sweeps until the matrix converges. • Convergence: sum of absolute values of off-diagonal elements. • Quadratic in nature • Complexity: N^3 • How is it different from SVD? • SVD is defined as A = UsV*. • Left and right singular vectors are different matrices • Computing Takagi vectors is not trivial • Takagi factorization takes advantage of symmetry • Less computational complexity • Less storage requirements cos a c 0 0 sin JT & J J - sin c b 0 cos 0 Applications: • Grunsky inequalities • Computation of the near-best uniform polynomial • Rational approximation of a high degree polynomial on a disk • Complex independent component analysis problems • Complex coordinate rotation method • Optical potential method • Nuclear magnetic resonance • Diagonalization of mass matrices of Majorana fermions Parallel Jacobi Algorithm • N/2 parallel unitary transforms possible. • Order does not alter the final result. • Parallel threads cannot share a common index • This is done (N-1) times for one sweep • All row updates need to finish before column updates • To prevent race conditions • Column entries will be part of rows of other thread’s row • Computation now done on full symmetric matrix • Need a mechanism to generate unique N/2 pairs for (N-1) times • Cyclic chess tournament scheduling GPU Execution Copy complex Matrix A from host to GPU Kernel : Initialize auxiliary data structures – real diagonal matrix D and complex Takagi vector matrix U 0 9 1 8 2 7 3 6 4 5 1 9 2 0 3 8 4 7 5 6 Kernel : Calculate threshold Tid=4 Tid=4 Tid=3 Tid=3 Tid=0 Tid=1 Tid=2 Tid=0 Tid=1 Tid=2 Convergence reached No Iteration: 0 Iteration: 1 Yes N-1 times ? Other Algorithms Yes No • Divide and Conquer Algorithm • Twisted Factorization method • Both involve tridiagonalization followed by diagonalization • Complexity: N^2 • Problems : • Not as parallelizable as Jacobi • Jacobi method is more stable • Example: ill-conditioned matrices Kernel : Diagonalize and row updates--------------Host synchronization-------- Kernel : Column and Unitary matrix updates --------------Host synchronization-------- Kernel : make diagonal elements non-negative and sort Copy U and D from GPU Related work Results • Fast SVD on 7900 GTK • A generation before CUDA enabled GPUs (2005) • Recently, SVD implemented on GPU using CUDA • Implemented only for real matrices • Bidiagonalization by Householder transforms • Diagonalization by implicitly shifted QR algorithm • Complexity: m*n^2 • High storage space • Matlab and serial C implementations available nequalities Speedup of GPUs and Pthreads for different matrix sizes Conclusion • Jacobi method is the most stable and accurate parallel algorithm. • Good speedups achieved over serial and Pthreads implementation. • Nvidia Fermi architecture gave much better speedups. • Non-coalesced reads are cached. • Much higher FLOP performance. • Pthreads implementation reach speedup of 1.64x over serial CPU version. • NVIDIA GTX 260 reaches a speedup of 11.69x. • NVIDIA GTX 480 (Fermi) reaches a speedup of 30.80x.

More Related