1 / 25

Optimizing for Intel multi-/many-core architectures

Optimizing for Intel multi-/many-core architectures. N. Satish, Throughput Computing Lab, Intel Labs. Outline of the talk. Architectural trends Optimizations for multi-/many-core platforms Challenges in performance scaling moving forward. Outline of the talk. Architectural trends

cleo
Download Presentation

Optimizing for Intel multi-/many-core architectures

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Optimizing for Intel multi-/many-core architectures N. Satish, Throughput Computing Lab, Intel Labs

  2. Outline of the talk • Architectural trends • Optimizations for multi-/many-core platforms • Challenges in performance scaling moving forward

  3. Outline of the talk • Architectural trends • Optimizations for multi-/many-core platforms • Challenges in performance scaling moving forward

  4. Increasing parallelism • Core scaling • Nhm (4 cores) -> Wsm (6 cores) -> … -> Intel Knights Ferry (32 cores) -> … • Data level parallelism (SIMD) scaling • Earliest SSE (64-bit) -> SSE (128-bit) -> AVX (256-bit) -> LRBNI (512-bit) • Thread scaling/core • Core 2 (1 thread/core) -> Nhm (2 threads/core) -> .. -> Intel Knights Ferry (4 threads/core) • Cache scaling (more slowly) • Memory latency not likely to drop • Need to make better use of cache, SMT, ILP

  5. Intel® MIC Architecture: An Intel Co-Processor Architecture … VECTORIA CORE VECTORIA CORE VECTORIA CORE VECTORIA CORE INTERPROCESSOR NETWORK … COHERENT CACHE COHERENT CACHE COHERENT CACHE COHERENT CACHE MEMORY and I/O INTERFACES FIXED FUNCTION LOGIC … COHERENT CACHE COHERENT CACHE COHERENT CACHE COHERENT CACHE INTERPROCESSOR NETWORK … VECTORIA CORE VECTORIA CORE VECTORIA CORE VECTORIA CORE Many cores and many, many more threads Standard IA programming and memory model Source: Kirk Skaugen, ISC 2010 keynote

  6. Knights Ferry • Software development platform • 32 cores, 1.2 GHz • 128 threads at 4 threads / core • 8MB shared coherent cache • 1-2GB GDDR5 • Bundled with Intel HPC tools Software development platform for Intel® MIC architecture Source: Kirk Skaugen, ISC 2010 keynote

  7. The Knights Family Future Knights Products Knights Corner 1st Intel® MIC product 22nm process >50 Intel Architecture cores Knights Ferry Source: Kirk Skaugen, ISC 2010 keynote

  8. Outline of the talk • Architectural trends • Optimizations for multi-/many-core platforms • Challenges in performance scaling moving forward

  9. Extent of possible gains • Tree search [SIGMOD 2010]: performance difference on Core i7: 8X over baseline, 5X over previous best reported results • In Lee et al. [ISCA 2010], we showed that performance of CPUs could be improved by an average of 8X for a range of throughput-intensive kernels

  10. General optimization flow • Scale down problem to fit in a single core cache and use 1 core (to optimize for compute) • Vectorization • Avoid core stalls due to lack of ILP • Optimize for memory latency, per-core bandwidth • Block for TLB, multiple levels of cache • Software pipelining, iteration lookahead • Avoid cache/TLB conflicts • Finally check core scalability (by weak scaling) • Dynamic load balancing, avoiding synchronization • If architecture does not have scalable bandwidth, might still find bandwidth bottlenecks

  11. Tree Search [SIGMOD 2010] Each query traverses a path in a binary tree until it hits a leaf node, and checks if the leaf node value matches the query Assume the whole tree fits in cache first (say 16 levels = 64 K entries) Parallelization is over queries; trivial Naïve SIMD – each lane handles one query – heavily latency bound

  12. SIMD Blocking • Rearrange binary tree and block for SIMD – no gathers/scatters • Tradeoff: SIMD efficiency = log(actual width):1.2X scaling for Core i7 • However, this improves to 3.1 X for KNF (of a peak of 4) • Bottleneck was now back to back dependencies – s/w pipeline • Compute bound: 2.5X better than previously reported on CPUs • MIC architecture performance is about 2.3X better than Core i7

  13. Optimizing for memory • For a larger tree, every level of the tree (beyond the first few levels) is a cache miss, and also a TLB miss at larger depths • TLB misses are expensive – 200-300 cycles • Cache misses can be 10-100 cycles in latency, depending on level • Heavily latency bound

  14. Tree Search [SIGMOD 2010] • Page blocking minimizes TLB misses (1 page can hold a sub-tree of 20 levels for a 2MB page) – only 2 misses for the whole search: 1.7X speedup on Core i7, 3X speedup on KNF • Cache misses are also minimized • Found that first few levels are kept warm in cache • No cache line blocking on KNF – cache line length == SIMD width

  15. Thread level parallelism • Do multiple queries at once • No issues in tree search (obtain BW bound/compute bound performance for large/small trees respectively)

  16. Tree search is but one example • Optimizations for compute and memory are more widely applicable • Compute: Vectorization, unrolling, sw pipelining (better tool support) • Memory: TLB/cache blocking, prefetching • Optimizations for core parallelism • Create many statically partitioned tasks – or develop a locality-aware dynamic load balancer • Enforce only actual dependencies instead of performing global barriers (especially for many-core architectures) • Eg: Stencils only require neighbor communication – can enforce neighbor dependencies (limits cache to cache traffic)

  17. Outline of the talk • Architectural trends • Optimizations for multi-/many-core platforms • Challenges in performance scaling moving forward

  18. Will things continue scaling well? • Challenge 1: SIMD efficiency • Certain algorithms are not SIMD friendly • Issue 1: Code can be irregular with branches • Issue 2: Code may require gather/scatter to/from distinct locations • To support these efficiently, the LRB Native Instruction Set has 3 features (1) mask support (2) supports gather/scatters and (3) pack/unpack instructions • Most additions are 512-bit vector instructions with masks that are used to predicate writes into vector registers • Gather involves loading values from non-contiguous memory locations (may not be cheap if they miss cache) • Pack is a restricted (cheaper) gather where elements within a cache line with the same mask are collected

  19. pack Input: v0 Output: [rbx] for(i = 0; i < N; i++) if(A[i]){ } else{ } } Assume N is large • Compute mask using vcmp • Use pack to collect items with mask 0 • Collect elements with mask 1 • Run SIMD friendly code on collected elements 22

  20. Example: ClearPath [Guy et al, SCA 09] Each person (“agent”) finds nearest neighbors and a velocity that will avoid collisions with them (computational geometry; involves branchy code) • SIMD utilization for a single agent is limited (1.25-1.5X on SSE), but inter-agent SIMD has gather/scatter, divergent branches • Obtain ~ 6.4X SIMD scaling for MIC using HW support

  21. Other challenges • Challenge 2: BW scaling • Fundamentally, memory bandwidth is not keeping pace with compute • Used to be about 1 byte/flop • Current: Westmere: 0.21 bytes/flop, AMD Magny Cours: 0.20 bytes/flop, NVIDIA GTX 480: 0.13 bytes/flop • Future GPUs [Bill Dally, SC 09]: 2017: 1 GPU node = 2 TB/s, 40 TFlops => 0.05 bytes/flop • Occasional disruptive changes improve BW • No magic solution – we really need to use the local storage well • Change algorithms if required – merge vs radix sort [SIGMOD 2010]

  22. Other challenges • Challenge 3: Cache is not going to keep increasing at the rate of compute • Cant increase latency, power too much • Most likely will stay in the 10’s of MB range, never going to be GBs • New levels of the memory hierarchy will come between • EDRAM is one example: offers bandwidth and capacity in-between caches and GDDR • Need to capture working sets at multiple levels • Need better autotuners

  23. Varying working set size: Physical Simulations [ISCA 2007]

  24. Current Clinical Compressed Sensing KNF performance results • HPC kernels • SGEMM (1 TFlop) [SC 2009 demo] • LU (> 0.5 TFlop ) [ISC 2010 demo] • Medical Imaging • Volume Rendering [TVCG 2009] – gather/scatter heavy: 5-8X faster than 4C Nhm • Compressed Sensing (MRI reconstruction) [EMBC 2010]: Clinically viable : 12 seconds

  25. Thanks for your attention!

More Related