Accelerating off core matrix computation the cscf lu decomposition
This presentation is the property of its rightful owner.
Sponsored Links
1 / 16

Accelerating off-core matrix computation: the CSCF LU Decomposition PowerPoint PPT Presentation


  • 90 Views
  • Uploaded on
  • Presentation posted in: General

Accelerating off-core matrix computation: the CSCF LU Decomposition. Timothy Blattner and Shujia Zhou May 18, 2011. Acknowledgement.

Download Presentation

Accelerating off-core matrix computation: the CSCF LU Decomposition

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Accelerating off core matrix computation the cscf lu decomposition

Accelerating off-core matrix computation:the CSCF LU Decomposition

Timothy Blattner and Shujia Zhou

May 18, 2011


Acknowledgement

Acknowledgement

This project is sponsored by Lockheed MartinWe would like to thank Joseph Swartz, Sara Hritz, Michael Bellor, Sarah Sellman and Kang Edward for sharing their insight on the CSCF code and Milt Halem for his guidance


Outline

Outline

  • Goal

  • Introduction

  • Design

  • Algorithm

  • Current Implementation

  • Results

  • Future Direction


Accelerating off core matrix computation the cscf lu decomposition

Goal

  • Determine if the use of GPGPU based co-processors provides enough computational acceleration to reduce overall time-to-solution.

  • Develop generic off-load acceleration model


Introduction

Introduction

  • Lower-upper Decomposition

    • Used to solve a system of linear equations

    • Solves Ax = LUx = b

  • Slab-based solution

    • Algorithm uses a forward elimination moving through slabs from left to right

      • Followed by a backwards substitution going through the slabs from right to left

    • This is done using a triple buffer solution,

      • Buffer one is the right hand side slab

      • Buffer two is the slab being read in

      • Buffer three is the slab for the current computation

    • Each element is double precision complex


Introduction1

Introduction

  • Computation solved using a series of FORTRAN routines

    • Routines - Update_slab and Factor_slab use:

      • BLAS

        • ZGEMM

        • ZTRSM

          • ZGEMM and ZTRSM accelerated on GPU


Technical challenge

Technical Challenge

  • Size of GPU buffer is less than size of one of the CPU buffers

    • Example:

      • 1 million unknowns

      • Contains 1 million rows and 1,250 columns per slab

      • Total of 800 slabs

      • Each slab is ~18 GB in size

      • Largest GPU memory available is 6 GB (Tesla C2070)

  • Matrices are oblong

    • Number of rows is much larger than the number of columns


Current design

Current Design

  • Solve ZTRSM and ZGEMM on the GPU

    • CUBLAS

      • CUDA optimized version of BLAS routines

    • Domain Decomposition of ZGEMM into GPU buffers

    • A*B = C

      A and C 

A_GPU

Buffer

A_CPU

Buffer

Size = size of A_GPU Buffer

COPY

COPY N


Algorithm

Algorithm

  • Four Phases:

    • Phase 1: Baseline Benchmark

    • Phase 2: Decompose GEMM

    • Phase 3: Square Matrix Decomposition

      • Demonstrates effectiveness of GPU on square matrices, and potentially utilizes Fermi’s concurrent kernel execution

    • Phase 4: Asynchronous Memory Copy

      • Provides possibility of overlapping PCI express requests, potentially reducing the impact of the bottleneck


Current test platform

Current Test Platform

  • NVIDIA GTX 460

    • 336 cores

    • 1 GB GDDR5

  • Intel Q6600

    • 4 cores

    • 2.4 Ghz

  • 4 GB DDR2 – 800

  • 1 TB 7200 RPM Disk


Current implementation

Current Implementation

  • Baseline Benchmark and phase 2 complete for 10,000 unknowns

  • Implementation in Fortran

    • Use Fortran to C wrappers provided by NVIDIA for CUBLAS


Results speedup

Results (Speedup)

Speedup Analysis (CPU time vs GPU time)


Results pci express time

Results (PCI Express Time)

Slabs = Number of Rows / Number of Columns


Results pci express impact

Results (PCI Express Impact)


Results wall time

Results (Wall Time)


Future direction

Future Direction

  • Execute Phase 1 and 2 for up to 1 million unknowns

    • Run on Tesla C2070 on UMBC Bluegrit cluster

  • Implement Phase 3 and benchmark

  • Implement Phase 4 and benchmark

  • Investigate Factor_slab routine for speedup


  • Login