Cs 395 last lecture summary anti summary and final t houghts
Download
1 / 24

CS 395 Last Lecture Summary, Anti-summary, and Final T houghts - PowerPoint PPT Presentation


  • 167 Views
  • Uploaded on

CS 395 Last Lecture Summary, Anti-summary, and Final T houghts. Summary (1) Architecture. Modern architecture designs are driven by energy constraints Shortening latencies is too costly, so we use parallelism in hardware to increase potential throughput

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about ' CS 395 Last Lecture Summary, Anti-summary, and Final T houghts' - katina


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
Cs 395 last lecture summary anti summary and final t houghts

CS 395 Last LectureSummary, Anti-summary, and Final Thoughts


Summary 1 architecture
Summary (1) Architecture

  • Modern architecture designs are driven by energy constraints

  • Shortening latencies is too costly, so we use parallelism in hardware to increase potential throughput

  • Some parallelism is implicit (out-of-order superscalar processing,) but have limits

  • Others are explicit (vectorization and multithreading,) and rely on software to unlock


Summary 2 memory
Summary (2) Memory

  • Memory technologies trade off energy and cost for capacity, with SRAM registers on one end and spinning platter hard disks on the other

  • Locality (relationships between memory accesses) can help us get the best of all cases

  • Caching is the hardware-only solution to capturing locality, but software-driven solutions exist too (memcache for files, etc.)


Summary 3 software
Summary (3) Software

  • Want to fully occupy your hardware?

    • Express locality (tiling)

    • Vectorize (compiler or manual)

    • Multithread (e.g. OpenMP)

    • Accelerate (e.g. CUDA, OpenCL)

  • Take the cost into consideration. Unless you’re optimizing in your free time, your time isn’t free.


Research perspective 2010
Research Perspective (2010)

  • Can we generalize and categorize the most important, generally applicable GPU Computing software optimizations?

    • Across multiple architectures

    • Across many applications

  • What kinds of performance trends are we seeing from successive GPU generations?

  • Conclusion – GPUs aren’t special, and parallel programming is getting easier


Application survey
Application Survey

  • Surveyed the GPU Computing Gems chapters

  • Studied the Parboil benchmarks in detail

    Results:

  • Eight (for now) major categories of optimization transformations

    • Performance impact of individual optimizations on certain Parboil benchmarks included in the paper


1 input data access tiling
1: (Input) Data Access Tiling

DRAM

DRAM

DRAM

ImplicitCopy

ExplicitCopy

Cache

Scratchpad

LocalAccess

LocalAccess


2 output privatization
2. (Output) Privatization

  • Avoid contention by aggregating updates locally

  • Requires storage resources to keep copies of data structures

Private

Results

Local

Results

Global

Results


Running example spmv
Running Example: SpMV

x

v

Ax = v

Row

Col

Data

A


Running example spmv1
Running Example: SpMV

x

v

Ax = v

Row

Col

Data

A





5 regularization load balancing
5. Regularization (Load Balancing)



7 data layout transformation
7. Data Layout Transformation


7 data layout transformation1
7. Data Layout Transformation


8 granularity coarsening
8. Granularity Coarsening

Time

4-way

parallel

Redundant

2-way

parallel

Essential

  • Parallel execution often requires redundant and coordination work

    • Merging multiple threads into one allows reuse of result, reducing redundancy



Unoptimized code has improved drastically
Unoptimized generation? Code Has Improved Drastically

  • Orders of magnitude speedup in many cases

  • Hardware does not solve all problems

    • Coalescing (lbm)

    • Highly contentious atomics (bfs)


Optimized code is improving faster than peak performance
Optimized Code Is Improving Faster than “Peak Performance”

  • Caches capture locality scratchpad can’t efficiently (spmv, stencil)

  • Increased local storage capacity enables extra optimization (sad)

  • Some benchmarks need atomic throughput more than flops (bfs, histo)


Optimization still matters
Optimization Still Matters Performance”

  • Hardware never changes algorithmic complexity (cutcp)

  • Caches do not solve layout problems for big data (lbm)

  • Coarsening still makes a big difference (cutcp, sgemm)

  • Many artificial performance cliffs are gone (sgemm, tpacf, mri-q)


Stuff we haven t covered
Stuff we haven’t covered Performance”

  • Good tools out there for profiling code beyond good timing (cache misses, etc.) If you can’t find why a particular piece of code is taking so long, look into hardware performance counters.

  • Patterns and practice

    • Some of the major patterns of optimization we covered, but only the basic ones. Many optimization patterns are algorithmic.


Fill out evaluations
Fill Out Evaluations! Performance”


ad