FastForward for Efficient Pipeline Parallelism: A Cache-Optimized Concurrent Lock-Free Queue

FastForward for Efficient Pipeline Parallelism:A Cache-Optimized Concurrent Lock-Free Queue John Giacomoni Tipp Moseley and Manish Vachharajani University of Colorado at Boulder 2008.02.21

Why?Why Pipelines? • Multicore systems are the future • Many apps can be pipelined if the granularity is fine enough • ≈ < 1 µs • ≈ 3.5 x interrupt handler

Fine-GrainPipelining Examples • Network processing: • Intrusion detection (NID) • Traffic filtering (e.g., P2P filtering) • Traffic shaping (e.g., packet prioritization)

Network ProcessingScenarios

IP IP APP Dec APP OP App Dec Enc Enc IP OP APP OP Core-Placements 4x4 NUMA Organization (ex: AMD Opteron Barcelona)

Example3 Stage Pipeline

CommunicationOverhead

CommunicationOverhead Locks  320ns GigE

CommunicationOverhead Lamport  160ns Locks  320ns GigE

CommunicationOverhead Hardware  10ns Lamport  160ns Locks  320ns GigE

CommunicationOverhead Hardware  10ns FastForward  28ns Lamport  160ns Locks  320ns GigE

More Fine-GrainPipelining Examples • Network processing: • Intrusion detection (NID) • Traffic filtering (e.g., P2P filtering) • Traffic shaping (e.g., packet prioritization) • Signal Processing • Media transcoding/encoding/decoding • Software Defined Radios • Encryption • Counter-Mode AES • Other Domains • Fine-grain kernels extracted from sequential applications

FastForward • Cache-optimized point-to-point CLF queue • Fast • Robust against unbalanced stages • Hides die-die communication • Works with strong to weak memory consistency models

lamp_enqueue(data) { NH = NEXT(head); while (NH == tail) {}; buf[head] = data; head = NH; } lamp_dequeue(*data) { while (head == tail) {} *data = buf[tail]; tail = NEXT(tail); } Lamport’sCLF Queue (1)

lamp_enqueue(data) { NH = NEXT(head); while (NH == tail) {}; buf[head] = data; head = NH; } Lamport’sCLF Queue (2) head tail buf[0] buf[1] buf[2] buf[3] buf[4] buf[5] buf[6] buf[7] buf[ ] buf[ ] buf[ ] buf[n]

AMD OpteronCache Example M

lamp_enqueue(data) { NH = NEXT(head); while (NH == tail) {}; buf[head] = data; head = NH; } Lamport’sCLF Queue (2) head tail buf[0] buf[1] buf[2] buf[3] buf[4] buf[5] buf[6] buf[7] buf[ ] buf[ ] buf[ ] buf[n] Observe the mandatory cacheline ping-ponging for each enqueue and dequeue operation

lamp_enqueue(data) { NH = NEXT(head); while (NH == tail) {}; buf[head] = data; head = NH; } Lamport’sCLF Queue (3) head tail buf[0] buf[1] buf[2] buf[3] buf[4] buf[5] buf[6] buf[7] buf[ ] buf[ ] buf[ ] buf[n] Observe how cachelines will still ping-pong. What if the head/tail comparison was eliminated?

lamp_enqueue(data) { NH = NEXT(head); while (NH == tail) {}; buf[head] = data; head = NH; } ff_enqueue(data) { while(0 != buf[head]); buf[head] = data; head = NEXT(head); } FastForwardCLF Queue (1)

ff_enqueue(data) { while(0 != buf[head]); buf[head] = data; head = NEXT(head); } FastForwardCLF Queue (2) head tail buf[0] buf[0] buf[1] buf[1] buf[2] buf[3] buf[4] buf[5] buf[6] buf[7] buf[ ] buf[ ] buf[ ] buf[n] Observe how head/tail cachelines will NOT ping-pong. BUT, “buf” will still cause the cachelines to ping-pong.

ff_enqueue(data) { while(0 != buf[head]); buf[head] = data; head = NEXT(head); } FastForwardCLF Queue (3) head tail buf[0] buf[1] buf[2] buf[3] buf[4] buf[5] buf[6] buf[7] buf[ ] buf[ ] buf[ ] buf[n] Solution: Temporally slip stages by a cacheline. N:1 reduction in coherence misses per stage.

Slip Timing

Slip TimingLost

Maintaining Slip(Concepts) • Use distance as the quality metric • Explicitly compare head/tail • Causes cache ping-ponging • Perform rarely

Maintaining Slip(Method) adjust_slip() { dist = distance(producer, consumer); if (dist < *Danger*) { dist_old = 0; do { dist_old = dist; spin_wait(avg_stage_time * (*OK* - dist)); dist = distance(producer, consumer); } while (dist < *OK* && dist > dist_old); } }

ComparativePerformance Lamport FastForward

Thrashing andAuto-Balancing FastForward (Thrashing) FastForward (Balanced)

CacheVerification FastForward (Thrashing) FastForward (Balanced)

On/Off DieCommunications Off-die communication On-die communication M

On/Off-diePerformance FastForward (On-Die) FastForward (Off-Die)

ProvenProperty • “In the program order of the consumer, the consumer dequeues values in the same order that they were enqueued in the producer's program order.”

Workin Progress • Operating Systems • 27.5 ns/op • 3.1 % cost reduction vs. reported 28.5 ns • Reduced jitter • Applications • 128bit AES encrypting filter • Ethernet layer encryption at 1.45 mfps • IP layer encryption at 1.51 mfps • ~10 lines of code for each.

Gazing intothe Crystal Ball Hardware  10ns FastForward  28ns Lamport  160ns Locks  320ns GigE

Shared Memory Accelerated Queues Now Available! http://ce.colorado.edu/core Questions? john.giacomoni@colorado.edu

FastForward for Efficient Pipeline Parallelism: A Cache-Optimized Concurrent Lock-Free Queue

FastForward for Efficient Pipeline Parallelism: A Cache-Optimized Concurrent Lock-Free Queue

Presentation Transcript

ECE 242 Spring 2003 Data Structures in Java

Queueing Theory

CSE 230 Parallelism

D.O.T. Office of Pipeline Safety Pipeline Repair Environmental Guidance System (Pilot Project)

Priority Queue

Stack and Queue

CMPT 300 Introduction to Operating Systems

MS108 Computer System I

OpenVMS Distributed Lock Manager Performance

Chapter 5 Overview

Atomic Actions, Concurrent Processes and Reliability

Transceiver Pipeline and Radio Modeling

Oracle8i Administration

Memory Hierarchy Design

OpenMP

Day 2

Master Program (Laurea Magistrale) in Computer Science and Networking

Characteristics of a RTS

Advanced Pipelining

Cache-Oblivious Priority Queue and Graph Algorithm Applications