Cs 395 last lecture summary anti summary and final t houghts
This presentation is the property of its rightful owner.
Sponsored Links
1 / 24

CS 395 Last Lecture Summary, Anti-summary, and Final T houghts PowerPoint PPT Presentation


  • 104 Views
  • Uploaded on
  • Presentation posted in: General

CS 395 Last Lecture Summary, Anti-summary, and Final T houghts. Summary (1) Architecture. Modern architecture designs are driven by energy constraints Shortening latencies is too costly, so we use parallelism in hardware to increase potential throughput

Download Presentation

CS 395 Last Lecture Summary, Anti-summary, and Final T houghts

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Cs 395 last lecture summary anti summary and final t houghts

CS 395 Last LectureSummary, Anti-summary, and Final Thoughts


Summary 1 architecture

Summary (1) Architecture

  • Modern architecture designs are driven by energy constraints

  • Shortening latencies is too costly, so we use parallelism in hardware to increase potential throughput

  • Some parallelism is implicit (out-of-order superscalar processing,) but have limits

  • Others are explicit (vectorization and multithreading,) and rely on software to unlock


Summary 2 memory

Summary (2) Memory

  • Memory technologies trade off energy and cost for capacity, with SRAM registers on one end and spinning platter hard disks on the other

  • Locality (relationships between memory accesses) can help us get the best of all cases

  • Caching is the hardware-only solution to capturing locality, but software-driven solutions exist too (memcache for files, etc.)


Summary 3 software

Summary (3) Software

  • Want to fully occupy your hardware?

    • Express locality (tiling)

    • Vectorize (compiler or manual)

    • Multithread (e.g. OpenMP)

    • Accelerate (e.g. CUDA, OpenCL)

  • Take the cost into consideration. Unless you’re optimizing in your free time, your time isn’t free.


Research perspective 2010

Research Perspective (2010)

  • Can we generalize and categorize the most important, generally applicable GPU Computing software optimizations?

    • Across multiple architectures

    • Across many applications

  • What kinds of performance trends are we seeing from successive GPU generations?

  • Conclusion – GPUs aren’t special, and parallel programming is getting easier


Application survey

Application Survey

  • Surveyed the GPU Computing Gems chapters

  • Studied the Parboil benchmarks in detail

    Results:

  • Eight (for now) major categories of optimization transformations

    • Performance impact of individual optimizations on certain Parboil benchmarks included in the paper


1 input data access tiling

1: (Input) Data Access Tiling

DRAM

DRAM

DRAM

ImplicitCopy

ExplicitCopy

Cache

Scratchpad

LocalAccess

LocalAccess


2 output privatization

2. (Output) Privatization

  • Avoid contention by aggregating updates locally

  • Requires storage resources to keep copies of data structures

Private

Results

Local

Results

Global

Results


Running example spmv

Running Example: SpMV

x

v

Ax = v

Row

Col

Data

A


Running example spmv1

Running Example: SpMV

x

v

Ax = v

Row

Col

Data

A


3 scatter to gather transformation

3. “Scatter to Gather” Transformation

x

v

Ax = v

Row

Col

Data

A


3 scatter to gather transformation1

3. “Scatter to Gather” Transformation

x

v

Ax = v

Row

Col

Data

A


4 binning

4. Binning

A


5 regularization load balancing

5. Regularization (Load Balancing)


6 compaction

6. Compaction


7 data layout transformation

7. Data Layout Transformation


7 data layout transformation1

7. Data Layout Transformation


8 granularity coarsening

8. Granularity Coarsening

Time

4-way

parallel

Redundant

2-way

parallel

Essential

  • Parallel execution often requires redundant and coordination work

    • Merging multiple threads into one allows reuse of result, reducing redundancy


How much faster do applications really get each hardware generation

How much faster do applications really get each hardware generation?


Unoptimized code has improved drastically

Unoptimized Code Has Improved Drastically

  • Orders of magnitude speedup in many cases

  • Hardware does not solve all problems

    • Coalescing (lbm)

    • Highly contentious atomics (bfs)


Optimized code is improving faster than peak performance

Optimized Code Is Improving Faster than “Peak Performance”

  • Caches capture locality scratchpad can’t efficiently (spmv, stencil)

  • Increased local storage capacity enables extra optimization (sad)

  • Some benchmarks need atomic throughput more than flops (bfs, histo)


Optimization still matters

Optimization Still Matters

  • Hardware never changes algorithmic complexity (cutcp)

  • Caches do not solve layout problems for big data (lbm)

  • Coarsening still makes a big difference (cutcp, sgemm)

  • Many artificial performance cliffs are gone (sgemm, tpacf, mri-q)


Stuff we haven t covered

Stuff we haven’t covered

  • Good tools out there for profiling code beyond good timing (cache misses, etc.) If you can’t find why a particular piece of code is taking so long, look into hardware performance counters.

  • Patterns and practice

    • Some of the major patterns of optimization we covered, but only the basic ones. Many optimization patterns are algorithmic.


Fill out evaluations

Fill Out Evaluations!


  • Login