1 / 40

Paraprox : Pattern-Based Approximation for Data Parallel Applications

Paraprox : Pattern-Based Approximation for Data Parallel Applications. Mehrzad Samadi , D. Anoushe Jamshidi, Janghaeng Lee, and Scott Mahlke University of Michigan March 2014. University of Michigan Electrical Engineering and Computer Science.

verlee
Download Presentation

Paraprox : Pattern-Based Approximation for Data Parallel Applications

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Paraprox: Pattern-Based Approximation for Data Parallel Applications Mehrzad Samadi, D. Anoushe Jamshidi, Janghaeng Lee, and Scott Mahlke University of Michigan March 2014 University of Michigan Electrical Engineering and Computer Science Compilers Creating Custom Processors

  2. Approximate Computing • 100% accuracy is notalways necessary • Less Work • Better performance • Lower power consumption • There are many domains where approximate output is acceptable

  3. Data Parallelism is everywhere Financial Modeling Medical Imaging Physics Simulation Audio Processing Machine Learning Games Image Processing Statistics Video Processing • Mostly regular applications • Works on large data sets • Exact output is not required for operation Good opportunity for automatic approximation

  4. Approximating KMeans

  5. Approximating KMeans

  6. Approximating KMeans

  7. Approximating KMeans

  8. Approximating KMeans

  9. Approximating KMeans Approximating alone is not enough we need a way to control the output quality

  10. ApproximateComputing • Ask the programmer to do it • Not easy / practical • Hard to debug • Automatic Approximation • One solution does not fit all • Paraprox: Pattern-based Approximation • Pattern-specific approximation methods • Provide knobs to control the output quality

  11. Common Patterns Map Partitioning Reduction Signal Processing, Physics,… Image Processing, Finance, … Machine Learning, Physics,.. Scatter/Gather Stencil Scan Machine Learning, Search,… Image Processing, Physics,… Statistics,… M. McCool et al. “Structured Parallel Programming: Patterns for Efficient Computation.” Morgan Kaufmann, 2012.

  12. Paraprox Parallel Program (OpenCl/CUDA) Paraprox Approximation Methods Pattern Detection Runtime system Approximate Kernels Tuning Parameters

  13. Common Patterns Map Partitioning Reduction Signal Processing, Physics,… Image Processing, Finance, … Machine Learning, Physics,.. Scatter/Gather Stencil Scan Machine Learning, Search,… Image Processing, Physics,… Statistics,…

  14. Approximate Memoization BlackScholes

  15. Approximate Memoization Identify candidate functions Find the table size Check The Quality Determine qi for each input Fill the Table Execution

  16. Candidate Functions • Pure functions do not: • read or write any global or static mutable state. • call an impure function. • perform I/O. • In CUDA/OpenCL: • No global/shared memory access • No thread ID dependent computation

  17. Table Size Quality 64K 32K 16K Speedup

  18. How Many Bits per Input? Table Size = 32KB 15 bits address Output Quality A B C 5 5 5 95.2% Inputs that do not need high precision will get fewer number of bits. 6 4 5 4 6 5 5 6 4 5 4 6 96.5% 91.3% 95.4% 91.2% 6 5 4 4 7 4 5 7 3 95.1% 95.4% 95.8%

  19. Common Patterns Map Partitioning Reduction Signal Processing, Physics,… Image Processing, Finance, … Machine Learning, Physics,.. Scatter/Gather Stencil Scan Machine Learning, Search,… Image Processing, Physics,… Statistics,…

  20. Tile Approximation Difference with neighbors

  21. Stencil/Partitioning C = Input[i][j] W = Input[i][j-1] E = Input[i][j+1] NW = Input[i-1][j-1] N = Input[i-1][j] NE = Input[i-1][j+1] SW = Input[i+1][j-1] S = Input[i+1][j] SE = Input[i+1][j+1] NW N NE W C E SW S SE • Paraprox looks for global/texture/shared load accesses to the arrays with affine addresses • Control the output quality by changing the number of accesses per tile

  22. Stencil/Partitioning C = Input[i][j] W = Input[i][j-1] E = Input[i][j+1] NW = Input[i-1][j-1] N = Input[i-1][j] NE = Input[i-1][j+1] SW = Input[i+1][j-1] W S = Input[i+1][j] C SE = Input[i+1][j+1] E NW N NE W C E SW S SE • Paraprox looks for global/texture/shared load accesses to the arrays with affine addresses • Control the output quality by changing the number of accesses per tile

  23. Stencil/Partitioning C = Input[i][j] W = Input[i][j-1] E = Input[i][j+1] NW = Input[i-1][j-1] W N = Input[i-1][j] C NE = Input[i-1][j+1] E SW = Input[i+1][j-1] W S = Input[i+1][j] C SE = Input[i+1][j+1] E NW N NE W C E SW S SE • Paraprox looks for global/texture/shared load accesses to the arrays with affine addresses • Control the output quality by changing the number of accesses per tile

  24. Stencil/Partitioning C = Input[i][j] W = Input[i][j-1] E = Input[i][j+1] NW = Input[i-1][j-1] N = Input[i-1][j] NE = Input[i-1][j+1] SW = Input[i+1][j-1] S = Input[i+1][j] SE = Input[i+1][j+1] C C NW N NE C W C E C C SW S SE C C C • Paraprox looks for global/texture/shared load accesses to the arrays with affine addresses • Control the output quality by changing the number of accesses per tile

  25. Common Patterns Map Partitioning Reduction Signal Processing, Physics,… Image Processing, Finance, … Machine Learning, Physics,.. Scatter/Gather Stencil Scan Machine Learning, Search,… Image Processing, Physics,… Statistics,…

  26. Scan/ Prefix Sum • Prefix Sum • Cumulative histogram, list ranking,… • Data parallel implementation: • Divide the input into smaller subarrays • Compute the prefix sum of each subarray in parallel

  27. Data Parallel Scan 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 Phase I Scan Scan Scan Scan 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 4 4 4 4 Phase II Scan 4 8 12 16 Phase III Add Add Add 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

  28. Data Parallel Scan 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 Phase I Scan Scan Scan Scan 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 4 4 4 4 Phase II Scan 4 8 12 16 Phase III Add Add Add 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

  29. Scan Approximation Output Elements N 0

  30. Evaluation

  31. Experimental Setup • Clang 3.3 • GPU • NVIDIA GTX 560 • CPU • Intel Core I7 • Benchmarks • NVIDIA SDK, Rodinia, … Approximate Kernels AST Visitor Pattern Detection Action Generator Rewrite Driver CUDA

  32. Runtime System Quality Checking Quality Target Quality Speedup Green[PLDI2010] SAGE[MICRO2013]

  33. Speedups for Both CPU and GPU CPU Target = 90% GPU 7.9 Geometric Mean Speedup

  34. One Solution Does Not Fit All! Paraprox Loop Perforation

  35. We Have Control on Output Quality

  36. We Have Control on Output Quality

  37. Distribution of Errors

  38. Distribution of Errors

  39. Conclusion • Manual approximation is not easy/practical. • We need tools for approximation • One approximation method does not fit all applications. • By using pattern-based optimization, we achieved 2.6x speedup by maintaining 90% of the output quality.

  40. Paraprox: Pattern-Based Approximation for Data Parallel Applications Mehrzad Samadi, D. Anoushe Jamshidi, Janghaeng Lee, and Scott Mahlke University of Michigan March 2014 University of Michigan Electrical Engineering and Computer Science Compilers creating custom processors

More Related