1 / 1

Why out-of-core?

Parallel Out-of-Core “Tall Skinny” QR James Demmel, Mark Hoemmen, et al. {demmel, mhoemmen}@eecs.berkeley.edu. EE CS. Electrical Engineering and Computer Sciences. B ERKELEY P AR L AB.

elmo-barr
Download Presentation

Why out-of-core?

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Parallel Out-of-Core “Tall Skinny” QRJames Demmel, Mark Hoemmen, et al.{demmel, mhoemmen}@eecs.berkeley.edu EECS Electrical Engineering and Computer Sciences BERKELEY PAR LAB P A R A L L E L C O M P U T I N G L A B O R A T O R Y Why out-of-core? Implementation and benchmark Applications of TSQR algorithm • Can solve huge problems without huge machine • Workstation as cluster alternative • Cluster: Job queue wait counts as “run time” • Instant turnaround for compile-run-debug cycle • Full(-speed) access to graphical development tools • Easier to get grant for < $5000 workstation, than for > $10M cluster! • Frees precious, power-hungry DRAM for other tasks • Great for embedded hardware: Cell phones, robots, … • Orthogonalization step in iterative methods for sparse systems • A bottleneck for new Krylov methods we are developing • Block column QR factorization • For QR of matrices stored in any 2-D block cyclic format • Currently a bottleneck in ScaLAPACK • Solving linear systems • More flops, but less data motion than LU • Works for sparse and dense systems • Solving least-squares optimization problems • Linear tree with multithreaded BLAS and LAPACK (Intel MKL) • Fixed # blocks at 10 • Varied # threads (only 2 cores available here, Itanium 2) • Varied # rows (1000 .. 10000) and # columns (100 .. 1000) per block • Compared with standard (in-core) multithreaded QR • Read input matrix from disk using same blocks as TSQR • Wrote Q factor to disk, just like TSQR Preliminary results Why parallel out-of-core? Some TSQR reduction trees • Less compute time per block frees processor time for other tasks • Total amount of I/O ideally independent of # of cores • If enough flops per datum, can hide reads/writes with computation What does out-of-core want? • For performance: • Non-blocking I/O operations • QoS guarantee on disk bandwidth per core • For tuning: • Know ideal disk bandwidth per core • Can measure actual disk bandwidth used per core • Awareness of I/O buffering implementation details • For ease of coding: • Automatic “(un)pickling” ((un)packing of structures) Top left: TSQR on a binary tree with 4 processors. Top right: TSQR on a linear tree with 1 processor. Bottom: TSQR on a “hybrid” tree with 4 processors. In all three diagrams, steps progress from left to right. The input matrix A is divided into blocks of rows. At each step, the blocks in core are highlighted in grey. Each grey box indicates a QR factorization: multiple blocks in the factorization are stacked. Arrows show data dependencies. “Tall Skinny” QR (TSQR) algorithm • For matrices with many more rows than columns (m >> n) • Algorithm: • Divide matrix into block rows • R: Reduction on the blocks with QR factorization as the operator • Q: Stored implicitly in reduction tree • Can use any reduction tree: • Binary tree minimizes # messages in parallel case • Linear tree minimizes bandwidth costs for sequential (out-of-core) • Other trees optimize for different cases • Possible parallelism: • In the reduction tree (one processor per local factorization) • In the local factorizations (e.g., multithreaded BLAS) Conclusions • OOC TSQR competes favorably with standard QR, if blocks are large enough and “square enough” • Some benefit from multithreaded BLAS, but not much • Suggests we should try a hybrid tree • Painful development process • Must serialize data manually • How do we measure disk bandwidth per core?

More Related