1 / 28

CUBLAS Library

CUBLAS Library. Dr. Bo Yuan E-mail: yuanb@sz.tsinghua.edu.cn. What is CUBLAS Library?. BLAS Basic Linear Algebra Subprogram A library to perform basic linear algebra Divided into three levels Such as MKL BLAS , CUBLAS, C++ AMP BLAS …… CUBLAS

brie
Download Presentation

CUBLAS Library

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. CUBLAS Library Dr. Bo Yuan E-mail: yuanb@sz.tsinghua.edu.cn

  2. What is CUBLAS Library? • BLAS • Basic Linear Algebra Subprogram • A library to perform basic linear algebra • Divided into three levels • Such as MKL BLAS,CUBLAS, C++ AMP BLAS…… • CUBLAS • An high level implementation of BLAS on top of the NVIDIA CUDA runtime • Single GPU or Multiple GPUs • Support CUDA Stream

  3. Three Levels Of BLAS Level 1 This level contains vector operations of the form Level 2 This level contains matrix-vector operations of the form Level 3 This level contains matrix-matrix operations of the form

  4. Why we need CUBLAS? • CUBLAS • Full support for all 152 standard BLAS routines • Support single-precision, double-precision, complex and double complex number data types • Support for CUDA steams • Fortran bindings • Support for multiple GPUs and concurrent kernels • Very efficient

  5. Why we need CUBLAS?

  6. Getting Started • Basic preparation • Install CUDA Toolkit • Include cublas_v2.h • Link cublas.lib • Some basic tips • Every CUBLAS function needs a handle • The CUBLAS function must be written between cublasCreate() and cublasDestory() • Every CUBLAS function returns a cublasStatus_tto report the state of execution. • Column-major storage • References • http://cudazone.nvidia.cn/cublas/ • CUDA Toolkit 5.0 CUBLAS Library.pdf

  7. CUBLAS Data Types • cublasHandle_t • cublasStatus_t • CUBLAS_STATUS_SUCCESS • CUBLAS_STATUS_NOT_INITIALIZED • CUBLAS_STATUS_ALLOC_FAILED • CUBLAS_STATUS_INVALID_VALUE • CUBLAS_STATUS_ARCH_MISMATCH • CUBLAS_STATUS_MAPPING_ERROR • CUBLAS_STATUS_EXECUTION_FAILED • CUBLAS_STATUS_INTERNAL_ERROR

  8. CUBLAS Data Types • cublasOperation_t

  9. CUBLAS Datatypes • cublasFillMode_t • cublasSideMode_t

  10. CUBLAS Data Types • cublasPointerMode_t • cublasAtomicsMode_t

  11. Example Code #include <stdio.h> #include <stdlib.h> #include <math.h> #include <cuda_runtime.h> #include "cublas_v2.h" //调用CUBLAS必须包含的头文件 #define M 6 #define N 5 #define IDX2F(i,j,ld) ((((j)-1)*(ld))+((i)-1)) //按列访问数组下标 static __inline__ void modify(cublasHandle_thandle,float* m,intldm,intn,intp,intq,floatalpha,float beta) { cublasSscal(handle,n-p+1,&alpha,&m[IDX2F(p,q,ldm)],ldm); cublasSscal(handle,ldm-p+1,&beta,&m[IDX2F(p,q,ldm)],1); }

  12. Example Code int main(void){ cudaError_tcudaStat; cublasStatus_t stat; cublasHandle_t handle; inti,j; float* devPtrA; float* a=0; a=(float*)malloc(M*N*sizeof(*a)); //在host上开辟数组空间 if (!a) { printf("host memory allocation failed"); return EXIT_FAILURE; }

  13. Example Code for (j=1;j<=N;j++) //数组初始化 { for (i=1;i<=M;i++) { a[IDX2F(i,j,M)]=(float)((i-1)*M+j); } } cudaStat = cudaMalloc((void**)&devPtrA,M*N*sizeof(*a)); //在device上开辟内存空间 if (cudaStat != cudaSuccess) { printf("device memory allocation failed"); return EXIT_FAILURE; } stat = cublasCreate(&handle); //初始化CUBLAS环境

  14. Example Code if (stat != cudaSuccess) { printf("CUBLAS initialization failed\n"); return EXIT_FAILURE; } stat = cublasSetMatrix(M,N,sizeof(*a),a,M,devPtrA,M); //把数据从host拷贝到device if (stat != CUBLAS_STATUS_SUCCESS) { printf("data download failed"); cudaFree(devPtrA); cublasDestroy(handle); return EXIT_FAILURE; } modify(handle,devPtrA,M,N,2,3,16.0f,12.0f); stat = cublasGetMatrix(M,N,sizeof(*a),devPtrA,M,a,M); //把数据从device拷贝到host

  15. Example Code if (stat != CUBLAS_STATUS_SUCCESS) { printf("data upload failed"); cudaFree(devPtrA); cublasDestroy(handle); return EXIT_FAILURE; } cudaFree(devPtrA); //释放指针 cublasDestroy(handle); //关闭CULBAS环境 for (j=1;j<=N;j++) { for (i=1;i<=M;i++) { printf("%7.0f",a[IDX2F(i,j,M)]); } } return EXIT_SUCCESS; }

  16. Matrix Multiply • Use level-3 function • Function Introduce • cublasStatus_t cublasSgemm(cublasHandle_t handle, cublasOperation_t transa, cublasOperation_t transb, int m, int n, int k, const float *alpha, const float *A, int lda, const float *B, int ldb, const float *beta, float *C, int ldc)

  17. Matrix Multiply

  18. Matrix Multiply intMatrixMulbyCUBLAS(float *A,float *B,int HA, intWB,intWA,float *C){ float *d_A,*d_B,*d_C; CUDA_SAFE_CALL(cudaMalloc((void **)&d_A,WA*HA*sizeof(float))); CUDA_SAFE_CALL(cudaMalloc((void **)&d_B,WB*WA*sizeof(float))); CUDA_SAFE_CALL(cudaMalloc((void **)&d_C,WB*HA*sizeof(float))); CUDA_SAFE_CALL(cudaMemcpy(d_A,A,WA*HA*sizeof(float),cudaMemcpyHostToDevice)); CUDA_SAFE_CALL(cudaMemcpy(d_B,B,WB*WA*sizeof(float),cudaMemcpyHostToDevice)); cublasStatus_t status; cublasHandle_t handle; status=cublasCreate(&handle); if (status!=CUBLAS_STATUS_SUCCESS) { printf("CUBLAS initialization error\n"); return EXIT_FAILURE; }

  19. Matrix Multiply intdevID; cudaDevicePropprops; CUDA_SAFE_CALL(cudaGetDevice(&devID)); CUDA_SAFE_CALL(cudaGetDeviceProperties(&props,devID)); printf("Device %d: \"%s\" with Compute %d.%d capability\n", devID, props.name, props.major, props.minor); constfloat alpha=1.0f; constfloat beta=0.0f; cublasSgemm(handle,CUBLAS_OP_N,CUBLAS_OP_N,WB,HA,WA,&alpha,d_B,WB,d_A,WA,&beta,d_C,WB); //level 3 function CUDA_SAFE_CALL(cudaMemcpy(C,d_C,WB*HA*sizeof(float),cudaMemcpyDeviceToHost)); cublasDestroy(handle); cudaFree(d_A); cudaFree(d_B); cudaFree(d_C); return 0; }

  20. The Rusult

  21. Some New Features • The handle to the CUBLAS library context is initialized using the cublasCreate function and is explicitly passed to every subsequent library function call. This allows the user to have more control over the library setup when using multiple host threads and multiple GPUs. • The scalarsa and b can be passed by reference on the host or the device, instead of only being allowed to be passed by value on the host. This change allows library functions to execute asynchronously using streams even when a and b are generated by a previous kernel.

  22. Some New Features • When a library routine returns a scalar result, it can be returned by reference on the host or the device, instead of only being allowed to be returned by value only on the host. This change allows library routines to be called asynchronously when the scalar result is generated and returned by reference on the device resulting in maximum parallelism.

  23. Stream • Stream • Concurrent Execution between Host and Device • Overlap of Data Transfer and Kernel Execution • With devices of compute capability 1.1 or higher • Hidden Data Transfer Time • Rules • Functions in a same stream execute sequentially • Functions in different streams execute concurrently • References • http://cudazone.nvidia.cn/ • CUDA C Programming Guide.pdf

  24. Parallelism with Streams • Create and set stream to be used by each CUBLAS routine • Users must call function cudaStreamCreate() to create different streams . • Users must call function cublasSetStream() toset a stream to be used by each individual CUBLAS routine. • Use asynchronous transfer function • cudaMemcpyAsync()

  25. Parallelism with Streams start=clock(); for (int i = 0; i < nstreams; i++) { cudaMemcpy(d_A,A,WA*HA*sizeof(float),cudaMemcpyHostToDevice); cudaMemcpy(d_B,B,WB*WA*sizeof(float),cudaMemcpyHostToDevice); cublasSgemm(handle,CUBLAS_OP_N,CUBLAS_OP_N,WB,HA,WA,&alpha,d_B,WB,d_A,WA,&beta,d_C,WB); cudaMemcpy(C,d_C,WB*HA*sizeof(float),cudaMemcpyDeviceToHost); } end=clock(); printf(“GPU Without Streamtime: %.2f秒.\n", (double)(end-start)/CLOCKS_PER_SEC);

  26. Parallelism with Streams start=clock(); for (int i = 0; i < nstreams; i++) { cudaMemcpyAsync(d_A,A,WA*HA*sizeof(float),cudaMemcpyHostToDevice,streams[i]); cudaMemcpyAsync(d_B,B,WB*WA*sizeof(float),cudaMemcpyHostToDevice,streams[i]); cublasSetStream(handle,streams[i]); cublasSgemm(handle,CUBLAS_OP_N,CUBLAS_OP_N,WB,HA,WA,&alpha,d_B,WB,d_A,WA,&beta,d_C,WB); cudaMemcpyAsync(C,d_C,WB*HA*sizeof(float),cudaMemcpyDeviceToHost); } end=clock(); printf("GPU With Stream time: %.2f秒.\n", (double)(end-start)/CLOCKS_PER_SEC);

  27. The Result

  28. Review • What is core functionality of BLAS and CUBLAS? • What is the advantage of CUBLAS? • What is the importance of handle in CUBLAS? • How to perform matrix multiplication using CUBLAS? • How is a matrix stored in CUBLAS? • How to use CUBLAS with stream techniques? • What can we do using CUBLAS in our research?

More Related