Analyzing Memory Access Intensity in Parallel Programs on Multicore

Analyzing Memory Access Intensity in Parallel Programson Multicore Lixia Liu, Zhiyuan Li, Ahmed Sameh Department of Computer Science, Purdue University, USA 22nd ACM International Conference on Supercomputing (ICS), June 7-12, 2008.

Findings • Prefetching hardware in modern single processors successfully hides memory latency when accessing memory in a stride that can be predicted by the hardware. • The effectiveness of prefetching diminishes when multiple cores access memory simultaneously, straining the shared memory bandwidth.

Findings (continued) • Memory latency cannot be hidden by creating numerous concurrent threads • the increased threads aggravate the bandwidth problem. • for a given parallel program, when the number of executing cores exceeds a certain threshold, performance will degrade due to the bandwidth problem.

Motivations • Programmers • Performance bottleneck • Algorithms comparison • Compiler designers • Optimizations design: latency vs. bandwidth

Intel Quadcore Processor Q6600 block diagram

Intel Quadcore Processor Q6600 • Four 2.4GHz cores (.417 ns cycle time) • UJ : 2 X Four 2.5 GHz cores (Intel Xeon 5400 processors) • Two L2 caches (each of 4MB size) • UJ: Two L2 caches (each of 6 MB size) • 1066MHz FSB • UJ: 1333 MHz FSB • 64-bit wide bus • FSB peak bandwidth = 8.5GB per second • UJ: FSB peak bandwidth = 10.6GB per second • bandwidth between MCH and main memory is 12.8GB per second.

Intel Quadcore processors • Each core performs • out-of-order execution • completes up to four full instructions per cycle. • Each 4MB (JU: 6MB) L2 cache is shared by two cores. In order to reduce the cache miss rate each core has • One hardware instruction prefetcher • Many data prefetchers • all prefetchingindependently

Intel Quadcore processors (cont.) • Two of the prefetchers can bring data from the memory to the L2 cache. • Depending on memory reference patterns • Pre-fetchers dynamically adjust parameters (Strides, Look ahead distance) according to: • Bus bandwidth • Number of pending requests • Pre-fetch history

Matrix Multiply Code

Matrix Multiplication Execution on a Single Core Speedup= non-blocked execution time/ Blocked execution time • Blocking is not so helpful on a single core • <10% improvement • Perfetch hardware is perfect NO! • Is this also true on multicore?

BOTTLENECK OF THE MEMORY BUS On multicore machines, effectiveness of prefetching diminishes when it cannot proceed at the full speed due to the bandwidth constraint in spite of predictable strides.

Multicore Execution Results • Four cores: 70% performance gap • Observations • Efficient concurrency • No inter-thread communication • Bandwidth problem?

Matrix Multiplication Execution on a Single Core: revisited Speedup= non-blocked execution time/ Blocked execution time • Blocking is not so helpful on a single core • <10% improvement • Perfetch hardware is perfect • ANDavailable memory bandwidth is adequate For a certain application, is there a metric to quantitatively identify whether the memory bandwidth is adequate?

Metric Definition: Memory Access Intensity • Zero-latency instruction issue rate • IRZ : instructions/ cycle which can be issued supposing the operands are always available on-chip • α : average number of bytes accessed off-chip per instruction • The Memory Access Intensity for an application is given by: βA = α X IRZ bytes/cycle bytes accessed per cycle = (bytes/Instruction) X (Instructions/cycle)

When there will be a memory bottleneck? • Peak Memory Bandwidth, PMB : bytes/sec • βM (bytes per cycle)= PMB /Frequency • For an application there is a memory bottle neck if βA > βM

Is there a memory bottle neck for an application? • βM for Intel Core 2 Duo Q6600 = 3.54B/cycle • There is a memory bottleneck for an application if βA >βM • Take home midterm question: Compute βM for the UJ 2 Quad Core Intel Xeon 5400 .

Revisit Matrix Multiply Results

Explaination • βM: 3.54B/cycle • Methods to compute βA • Program analysis • Use Intel Vtune: Measure hardware counters • 4.97>3.54, thus blocking is necessary when 4 cores are executing

Other Results • Benchmarks: diagonally dominant banded linear systems • ScaLAPACK {version implemented in the Intel Math Kernel Library (MKL)} • Spike (will not discuss here, see paper) • A revised Spike (will not discuss here, see paper) • Configurations • βM: 3.54 bytes / CPU cycle • Band: narrow(11), medium(99) and wide(399) • Large matrix

ScaLAPACK: performance for narrow banded system

ScaLAPACK: performance for medium banded system

ScaLAPACK: performance for wide banded system

Results of the factorization step • For all three matrices band widths, little speedup is achieved by using multiple cores. • Reason (Vtune): parallelized code significantly increases the number of instructions • Solution: change the algorithm to reduce the number of extra instructions introduced by parallelization.

Results of the solve step • Flat speedup • Reason: Vtune shows βA >βM • Solution: remove the memory bottleneck • Vtune shows: Factorization step dominates the total execution time when the band gets wider. • Improvement on factorization more critical than the “solve” step to the total performance

Analyzing Memory Access Intensity in Parallel Programs on Multicore