cs 395 last lecture summary anti summary and final t houghts
Download
Skip this Video
Download Presentation
CS 395 Last Lecture Summary, Anti-summary, and Final T houghts

Loading in 2 Seconds...

play fullscreen
1 / 24

CS 395 Last Lecture Summary, Anti-summary, and Final T houghts - PowerPoint PPT Presentation


  • 175 Views
  • Uploaded on

CS 395 Last Lecture Summary, Anti-summary, and Final T houghts. Summary (1) Architecture. Modern architecture designs are driven by energy constraints Shortening latencies is too costly, so we use parallelism in hardware to increase potential throughput

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about ' CS 395 Last Lecture Summary, Anti-summary, and Final T houghts' - katina


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
summary 1 architecture
Summary (1) Architecture
  • Modern architecture designs are driven by energy constraints
  • Shortening latencies is too costly, so we use parallelism in hardware to increase potential throughput
  • Some parallelism is implicit (out-of-order superscalar processing,) but have limits
  • Others are explicit (vectorization and multithreading,) and rely on software to unlock
summary 2 memory
Summary (2) Memory
  • Memory technologies trade off energy and cost for capacity, with SRAM registers on one end and spinning platter hard disks on the other
  • Locality (relationships between memory accesses) can help us get the best of all cases
  • Caching is the hardware-only solution to capturing locality, but software-driven solutions exist too (memcache for files, etc.)
summary 3 software
Summary (3) Software
  • Want to fully occupy your hardware?
    • Express locality (tiling)
    • Vectorize (compiler or manual)
    • Multithread (e.g. OpenMP)
    • Accelerate (e.g. CUDA, OpenCL)
  • Take the cost into consideration. Unless you’re optimizing in your free time, your time isn’t free.
research perspective 2010
Research Perspective (2010)
  • Can we generalize and categorize the most important, generally applicable GPU Computing software optimizations?
    • Across multiple architectures
    • Across many applications
  • What kinds of performance trends are we seeing from successive GPU generations?
  • Conclusion – GPUs aren’t special, and parallel programming is getting easier
application survey
Application Survey
  • Surveyed the GPU Computing Gems chapters
  • Studied the Parboil benchmarks in detail

Results:

  • Eight (for now) major categories of optimization transformations
    • Performance impact of individual optimizations on certain Parboil benchmarks included in the paper
1 input data access tiling
1: (Input) Data Access Tiling

DRAM

DRAM

DRAM

ImplicitCopy

ExplicitCopy

Cache

Scratchpad

LocalAccess

LocalAccess

2 output privatization
2. (Output) Privatization
  • Avoid contention by aggregating updates locally
  • Requires storage resources to keep copies of data structures

Private

Results

Local

Results

Global

Results

running example spmv
Running Example: SpMV

x

v

Ax = v

Row

Col

Data

A

running example spmv1
Running Example: SpMV

x

v

Ax = v

Row

Col

Data

A

8 granularity coarsening
8. Granularity Coarsening

Time

4-way

parallel

Redundant

2-way

parallel

Essential

  • Parallel execution often requires redundant and coordination work
    • Merging multiple threads into one allows reuse of result, reducing redundancy
unoptimized code has improved drastically
Unoptimized Code Has Improved Drastically
  • Orders of magnitude speedup in many cases
  • Hardware does not solve all problems
    • Coalescing (lbm)
    • Highly contentious atomics (bfs)
optimized code is improving faster than peak performance
Optimized Code Is Improving Faster than “Peak Performance”
  • Caches capture locality scratchpad can’t efficiently (spmv, stencil)
  • Increased local storage capacity enables extra optimization (sad)
  • Some benchmarks need atomic throughput more than flops (bfs, histo)
optimization still matters
Optimization Still Matters
  • Hardware never changes algorithmic complexity (cutcp)
  • Caches do not solve layout problems for big data (lbm)
  • Coarsening still makes a big difference (cutcp, sgemm)
  • Many artificial performance cliffs are gone (sgemm, tpacf, mri-q)
stuff we haven t covered
Stuff we haven’t covered
  • Good tools out there for profiling code beyond good timing (cache misses, etc.) If you can’t find why a particular piece of code is taking so long, look into hardware performance counters.
  • Patterns and practice
    • Some of the major patterns of optimization we covered, but only the basic ones. Many optimization patterns are algorithmic.
ad