paraprox pattern based approximation for data parallel applications n.
Download
Skip this Video
Loading SlideShow in 5 Seconds..
Paraprox : Pattern-Based Approximation for Data Parallel Applications PowerPoint Presentation
Download Presentation
Paraprox : Pattern-Based Approximation for Data Parallel Applications

Loading in 2 Seconds...

play fullscreen
1 / 40

Paraprox : Pattern-Based Approximation for Data Parallel Applications - PowerPoint PPT Presentation


  • 358 Views
  • Uploaded on

Paraprox : Pattern-Based Approximation for Data Parallel Applications. Mehrzad Samadi , D. Anoushe Jamshidi, Janghaeng Lee, and Scott Mahlke University of Michigan March 2014. University of Michigan Electrical Engineering and Computer Science.

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Paraprox : Pattern-Based Approximation for Data Parallel Applications' - verlee


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
paraprox pattern based approximation for data parallel applications

Paraprox: Pattern-Based Approximation for Data Parallel Applications

Mehrzad Samadi, D. Anoushe Jamshidi,

Janghaeng Lee, and Scott Mahlke

University of Michigan

March 2014

University of Michigan

Electrical Engineering and Computer Science

Compilers Creating Custom Processors

approximate computing
Approximate Computing
  • 100% accuracy is notalways necessary
  • Less Work
    • Better performance
    • Lower power consumption
  • There are many domains

where approximate

output is acceptable

data parallelism i s e verywhere
Data Parallelism is everywhere

Financial Modeling

Medical

Imaging

Physics

Simulation

Audio

Processing

Machine

Learning

Games

Image Processing

Statistics

Video Processing

  • Mostly regular applications
  • Works on large data sets
  • Exact output is not required for operation

Good opportunity for automatic approximation

approximating kmeans5
Approximating KMeans

Approximating alone is not enough we need a way to control the output quality

approximate computing1
ApproximateComputing
  • Ask the programmer to do it
    • Not easy / practical
    • Hard to debug
  • Automatic Approximation
    • One solution does not fit all
    • Paraprox: Pattern-based Approximation
    • Pattern-specific approximation methods
    • Provide knobs to control the output quality
common patterns
Common Patterns

Map

Partitioning

Reduction

Signal Processing, Physics,…

Image Processing, Finance, …

Machine Learning, Physics,..

Scatter/Gather

Stencil

Scan

Machine Learning, Search,…

Image Processing, Physics,…

Statistics,…

M. McCool et al. “Structured Parallel Programming: Patterns for Efficient Computation.” Morgan Kaufmann, 2012.

paraprox
Paraprox

Parallel Program

(OpenCl/CUDA)

Paraprox

Approximation

Methods

Pattern

Detection

Runtime system

Approximate Kernels

Tuning Parameters

common patterns1
Common Patterns

Map

Partitioning

Reduction

Signal Processing, Physics,…

Image Processing, Finance, …

Machine Learning, Physics,..

Scatter/Gather

Stencil

Scan

Machine Learning, Search,…

Image Processing, Physics,…

Statistics,…

approximate memoization1
Approximate Memoization

Identify candidate functions

Find the table size

Check The Quality

Determine qi for each input

Fill the Table

Execution

candidate functions
Candidate Functions
  • Pure functions do not:
    • read or write any global or static mutable state.
    • call an impure function.
    • perform I/O.
  • In CUDA/OpenCL:
    • No global/shared memory access
    • No thread ID dependent computation
table size
Table Size

Quality

64K

32K

16K

Speedup

how many bits per input
How Many Bits per Input?

Table Size = 32KB

15 bits address

Output Quality

A

B

C

5

5

5

95.2%

Inputs that do not need high precision will get fewer number of bits.

6

4

5

4

6

5

5

6

4

5

4

6

96.5%

91.3%

95.4%

91.2%

6

5

4

4

7

4

5

7

3

95.1%

95.4%

95.8%

common patterns2
Common Patterns

Map

Partitioning

Reduction

Signal Processing, Physics,…

Image Processing, Finance, …

Machine Learning, Physics,..

Scatter/Gather

Stencil

Scan

Machine Learning, Search,…

Image Processing, Physics,…

Statistics,…

tile approximation
Tile Approximation

Difference with neighbors

stencil partitioning
Stencil/Partitioning

C = Input[i][j]

W = Input[i][j-1]

E = Input[i][j+1]

NW = Input[i-1][j-1]

N = Input[i-1][j]

NE = Input[i-1][j+1]

SW = Input[i+1][j-1]

S = Input[i+1][j]

SE = Input[i+1][j+1]

NW

N

NE

W

C

E

SW

S

SE

  • Paraprox looks for global/texture/shared load accesses to the arrays with affine addresses
  • Control the output quality by changing the number of accesses per tile
stencil partitioning1
Stencil/Partitioning

C = Input[i][j]

W = Input[i][j-1]

E = Input[i][j+1]

NW = Input[i-1][j-1]

N = Input[i-1][j]

NE = Input[i-1][j+1]

SW = Input[i+1][j-1] W

S = Input[i+1][j] C

SE = Input[i+1][j+1] E

NW

N

NE

W

C

E

SW

S

SE

  • Paraprox looks for global/texture/shared load accesses to the arrays with affine addresses
  • Control the output quality by changing the number of accesses per tile
stencil partitioning2
Stencil/Partitioning

C = Input[i][j]

W = Input[i][j-1]

E = Input[i][j+1]

NW = Input[i-1][j-1] W

N = Input[i-1][j] C

NE = Input[i-1][j+1] E

SW = Input[i+1][j-1] W

S = Input[i+1][j] C

SE = Input[i+1][j+1] E

NW

N

NE

W

C

E

SW

S

SE

  • Paraprox looks for global/texture/shared load accesses to the arrays with affine addresses
  • Control the output quality by changing the number of accesses per tile
stencil partitioning3
Stencil/Partitioning

C = Input[i][j]

W = Input[i][j-1]

E = Input[i][j+1]

NW = Input[i-1][j-1]

N = Input[i-1][j]

NE = Input[i-1][j+1]

SW = Input[i+1][j-1]

S = Input[i+1][j]

SE = Input[i+1][j+1]

C

C

NW

N

NE

C

W

C

E

C

C

SW

S

SE

C

C

C

  • Paraprox looks for global/texture/shared load accesses to the arrays with affine addresses
  • Control the output quality by changing the number of accesses per tile
common patterns3
Common Patterns

Map

Partitioning

Reduction

Signal Processing, Physics,…

Image Processing, Finance, …

Machine Learning, Physics,..

Scatter/Gather

Stencil

Scan

Machine Learning, Search,…

Image Processing, Physics,…

Statistics,…

scan prefix sum
Scan/ Prefix Sum
  • Prefix Sum
  • Cumulative histogram, list ranking,…
  • Data parallel implementation:
    • Divide the input into smaller subarrays
    • Compute the prefix sum of each subarray in parallel
data parallel scan
Data Parallel Scan

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

Phase I

Scan

Scan

Scan

Scan

1

2

3

4

1

2

3

4

1

2

3

4

1

2

3

4

4

4

4

4

Phase II

Scan

4

8

12

16

Phase III

Add

Add

Add

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

data parallel scan1
Data Parallel Scan

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

Phase I

Scan

Scan

Scan

Scan

1

2

3

4

1

2

3

4

1

2

3

4

1

2

3

4

4

4

4

4

Phase II

Scan

4

8

12

16

Phase III

Add

Add

Add

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

scan approximation
Scan Approximation

Output Elements

N

0

experimental setup
Experimental Setup
  • Clang 3.3
  • GPU
    • NVIDIA GTX 560
  • CPU
    • Intel Core I7
  • Benchmarks
    • NVIDIA SDK, Rodinia, …

Approximate

Kernels

AST

Visitor

Pattern

Detection

Action

Generator

Rewrite

Driver

CUDA

runtime system
Runtime System

Quality

Checking

Quality

Target

Quality

Speedup

Green[PLDI2010]

SAGE[MICRO2013]

speedups for both cpu and gpu
Speedups for Both CPU and GPU

CPU

Target = 90%

GPU

7.9

Geometric

Mean

Speedup

one solution d oes n ot fit a ll
One Solution Does Not Fit All!

Paraprox

Loop Perforation

conclusion
Conclusion
  • Manual approximation is not easy/practical.
  • We need tools for approximation
  • One approximation method does not fit all applications.
  • By using pattern-based optimization, we achieved 2.6x speedup by maintaining 90% of the output quality.
paraprox pattern based approximation for data parallel applications1

Paraprox: Pattern-Based Approximation for Data Parallel Applications

Mehrzad Samadi, D. Anoushe Jamshidi,

Janghaeng Lee, and Scott Mahlke

University of Michigan

March 2014

University of Michigan

Electrical Engineering and Computer Science

Compilers creating custom processors