1 / 37

Micro Benchmarking CUDA Final Project Presentation

Steven Sparks Kibeom Kim. Micro Benchmarking CUDA Final Project Presentation. Micro Benchmarking - Objectives. Measure the texture cache parameters Latency Capacity Associativity Measure the memory bandwidth Texture global and cache memory Normal global memory Shared memory.

will
Download Presentation

Micro Benchmarking CUDA Final Project Presentation

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Steven Sparks Kibeom Kim Micro Benchmarking CUDAFinal Project Presentation

  2. Micro Benchmarking - Objectives • Measure the texture cache parameters • Latency • Capacity • Associativity • Measure the memory bandwidth • Texture global and cache memory • Normal global memory • Shared memory

  3. Measuring Cache Size

  4. Measuring Cache Size (9600M GT) ms n (floating point)

  5. Texture Cache • "Texture caches are used to save bandwidth and power only - in contrast to CPU caches which are also essential for lowering latency"http://www.realworldtech.com/page.cfm?ArticleID=RWT090808195242&p=10 • "For example, some programs can exceed the maximum theoretical memory bandwidth of the underlying global memory through judicious use of the texture memory cache. While the latency of texture cache reference is generally the same as DRAM"  • http://www.ddj.com/architect/218100902

  6. Stride

  7. Compact Maximum Count Count 320 160 80 40 20 Stride

  8. Cache Model

  9. Cache Model

  10. Cache Model

  11. Cache Model

  12. L1 cache of 9600M GT • Size : 640 * sizeof(float) = 2.5kb • Associativity : 20 • Associative offset :32 * sizeof(float) = 128byte • Block Size : Size/80 = 32byte

  13. Cache 2D locality and Associativity power of 2 x power of 2 4byte(RGBA) x power of 2

  14. Measuring Total Access Time • __global__ void CheckSet(int stride, int num, int N){    float temp1;    __shared__ float temp2;    int j0 = stride *( num - 1);    for (int i = count; i >= 0; --i)    {        for (int index = j0; index >= 0; index -= stride)        {            temp1 = tex1Dfetch(texRef, index);        }    }    temp2 = temp1;}

  15. Measuring Loop Overhead Time • __global__ void CheckSet(int stride, int num, int N){    float temp1, temp3 = 1.0f;    __shared__ float temp2;    int j0 = stride *( num - 1);    for (int i = count; i >= 0; --i)    {        for (int index = j0; index >= 0; index -= stride)        {            temp1 = temp3;        }    }    temp2 = temp1;}

  16. Measuring Exact Access Time • Texture Access Function - Register Access Fuction = • (Loop + Texture Access) - (Loop + Register Access) = • Texture Access - Register Access

  17. Binary Search Example stride : 32. num : 256. time : 45.483139 ms. It is not Compactstride : 32. num : 128. time : 45.382069 ms. It is not Compactstride : 32. num : 64. time : 45.726257 ms. It is not Compact stride : 32. num : 32. time : 45.369858 ms. It is not Compact stride : 32. num : 16. time : 33.000290 ms. It is Compactstride : 32. num : 24. time : 45.356770 ms. It is not Compactstride : 32. num : 20. time : 33.172470 ms. It is Compactstride : 32. num : 22. time : 45.362503 ms. It is not Compactstride : 32. num : 21. time : 45.405190 ms. It is not Compact

  18. Parallelization - Design Considerations • Texture cache exists on each multiprocessor • Test each set on different multiprocessor • Number of concurrent tests must be equal to or lower than the number of multiprocessors • Since we cannot measure time for each block to execute we can only determine if all sets are compact • If compactness fails, we do not know which sets are compact

  19. Parallelization - Implementation • Two loops in single threaded algorithm that are candidates for parallelization • Both have a fixed stride • First loop starts with a stride of one and length of one and continues to double the length until a non-compact set is found • Second loop uses some stride and finds the smallest length that makes a non-compact set • Only one function needed • Function needs to take any stride as an input • Input parameter determines whether to double or to increase by one • Returns the first length to make a non-compact set

  20. Parallelization - Host is_compact • The single threaded is_compact needs to simply create a cudaArray of length: •     stride * (length - 1) + 1 • Then it binds a texture to this array and measures the time it takes for the kernel to execute • Multi threaded version receives a list of lengths and a list of strides • Loops through the lists and finds the longest cudaArray needed • Creates the cudaArray to this length • Supplies the kernel with the lists • Times the execution of the kernel

  21. Parallelization - Kernel • each block looks at a different stride and length • each block will have a different maxIndex • __global__ void check_set(int* stride, int* length) • { • int i, index; • float temp; • // get the stride and length for this set • int S = stride[blockIdx.x]; • int N = length[blockIdx.x]; • // determine the max index that is to be accessed by this set • int maxIndex = S * (N - 1) + 1; • __shared__ float finish; • temp = 0; • index = 0; • for (i = 0; i < LOOP_COUNT; i++) • { • // call on texture memory • temp += tex1Dfetch(texRef, index); • // update with index of next memory access • index += S; • if (index >= maxIndex) • { • index = 0; • } • } // for (i = 0; i < LOOP_COUNT; i++) • finish = temp; • } // check_set

  22. Parallelization - First non-compact length • Keeping the stride fixed, increase the length until a non-compact set is found • Start with number of multiprocessors sets • If all are compact, then start with the next length and try again • When a non-compact set is found then use binary search

  23. Measuring Cache Size (9600M GT) ms n (floating point)

  24. L1 cache of 9600M GT • Size : 640 * sizeof(float) = 2.5kb • Associativity : 20 • Associative offset :32 * sizeof(float) = 128byte • Block Size : Size/80 = 32byte

  25. Memory Bandwidth - Objectives • Determine the maximum rate of transfer of data from memory to processing units • Limit overhead • Analyze texture memory - global and cache • Analyze global memory • Analyze shared memory • Accuracy top priority • Speed low priority

  26. Memory Bandwidth - Algorithm • Access the memory as much as possible • Increase the number of threads accessing the memory until saturation occurs

  27. Memory Bandwidth - Design Considerations • Compiler optimizations • Each value read from memory is added to temp variable and then before exiting kernel the value is set to shared memory • Keeps the compiler from removing the memory accesses and loops • Loop overhead • Could have used measurement of idle loop and subtracted from measurements • Made very little change in accuracy due to high number of iterations

  28. Memory Bandwidth - Design

  29. Memory Bandwidth - Design • Outputs to CSV file  • Number threads range from 32 to 512 in intervals of 32 • Number of blocks range from 1 to max blocks • GTX 260 - 512 Blocks • GeForce 9300 - 128 Blocks • CSV output contains every combination of the above and the number of bytes transfered per second (GB/s) • Throuput calculated: Thruput = (1000 * noThreads * noBlocks ) / ( accessTimer * 1073741824 )

  30. Texture Memory Bandwidth - Implementation • __global__ void bandwidth_kernel() • { • float temp = 0; • __shared__ float finished; • int i; • for (i = LOOP_COUNT; i != 0; --i) • { • temp += tex1Dfetch(texRef, 0); • } • finished = temp; • } • Accesses one single address multiple times

  31. Global Memory Bandwidth - Implmentation • __global__ void global_kernel(float* data) • { • float temp = 0; • __shared__ float finished; • int i; • float* ptr = &data[threadIdx.x]; • for (i = LOOP_COUNT; i != 0; --i) • { • temp += *ptr; • } • finished = temp; • } • Each thread accesses the same memory multiple times • Uses a pointer to that address to save offset calculation

  32. Global Memory Bandwidth - Implementation • __global__ void shared_kernel(float* data) • { • __shared__ float finished; • __shared__ float data_s; • float temp; • int i; • if (threadIdx.x == 0) • { • data_s = data[threadIdx.x]; • } • __syncthreads(); • for (i = LOOP_COUNT; i != 0; --i) • { • temp += data_s; • } • finished = temp; • } • Thread 0 loads global into shared • Each thread access shared memory multiple times

  33. Memory Bandwidth - Results • Shared memory has lower latency but slightly lower bandwidth • Texture memory has lowest bandwidth

  34. Memory Bandwidth - Results

  35. Memory Bandwidth - Results

  36. Memory Bandwidth - Results

  37. Memory Bandwidth - Results GTX260 Memory Specifications From nVidia Wedsite • Measured results = ~107 GB/sec

More Related