1 / 35

FastForward for Efficient Pipeline Parallelism: A Cache-Optimized Concurrent Lock-Free Queue

FastForward for Efficient Pipeline Parallelism: A Cache-Optimized Concurrent Lock-Free Queue. John Giacomoni. Tipp Moseley and Manish Vachharajani University of Colorado at Boulder 2008.02.21. Why? Why Pipelines?. Multicore systems are the future

tamar
Download Presentation

FastForward for Efficient Pipeline Parallelism: A Cache-Optimized Concurrent Lock-Free Queue

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. FastForward for Efficient Pipeline Parallelism:A Cache-Optimized Concurrent Lock-Free Queue John Giacomoni Tipp Moseley and Manish Vachharajani University of Colorado at Boulder 2008.02.21

  2. Why?Why Pipelines? • Multicore systems are the future • Many apps can be pipelined if the granularity is fine enough • ≈ < 1 µs • ≈ 3.5 x interrupt handler

  3. Fine-GrainPipelining Examples • Network processing: • Intrusion detection (NID) • Traffic filtering (e.g., P2P filtering) • Traffic shaping (e.g., packet prioritization)

  4. Network ProcessingScenarios

  5. IP IP APP Dec APP OP App Dec Enc Enc IP OP APP OP Core-Placements 4x4 NUMA Organization (ex: AMD Opteron Barcelona)

  6. Example3 Stage Pipeline

  7. Example3 Stage Pipeline

  8. CommunicationOverhead

  9. CommunicationOverhead Locks  320ns GigE

  10. CommunicationOverhead Lamport  160ns Locks  320ns GigE

  11. CommunicationOverhead Hardware  10ns Lamport  160ns Locks  320ns GigE

  12. CommunicationOverhead Hardware  10ns FastForward  28ns Lamport  160ns Locks  320ns GigE

  13. More Fine-GrainPipelining Examples • Network processing: • Intrusion detection (NID) • Traffic filtering (e.g., P2P filtering) • Traffic shaping (e.g., packet prioritization) • Signal Processing • Media transcoding/encoding/decoding • Software Defined Radios • Encryption • Counter-Mode AES • Other Domains • Fine-grain kernels extracted from sequential applications

  14. FastForward • Cache-optimized point-to-point CLF queue • Fast • Robust against unbalanced stages • Hides die-die communication • Works with strong to weak memory consistency models

  15. lamp_enqueue(data) { NH = NEXT(head); while (NH == tail) {}; buf[head] = data; head = NH; } lamp_dequeue(*data) { while (head == tail) {} *data = buf[tail]; tail = NEXT(tail); } Lamport’sCLF Queue (1)

  16. lamp_enqueue(data) { NH = NEXT(head); while (NH == tail) {}; buf[head] = data; head = NH; } Lamport’sCLF Queue (2) head tail buf[0] buf[1] buf[2] buf[3] buf[4] buf[5] buf[6] buf[7] buf[ ] buf[ ] buf[ ] buf[n]

  17. AMD OpteronCache Example M

  18. lamp_enqueue(data) { NH = NEXT(head); while (NH == tail) {}; buf[head] = data; head = NH; } Lamport’sCLF Queue (2) head tail buf[0] buf[1] buf[2] buf[3] buf[4] buf[5] buf[6] buf[7] buf[ ] buf[ ] buf[ ] buf[n] Observe the mandatory cacheline ping-ponging for each enqueue and dequeue operation

  19. lamp_enqueue(data) { NH = NEXT(head); while (NH == tail) {}; buf[head] = data; head = NH; } Lamport’sCLF Queue (3) head tail buf[0] buf[1] buf[2] buf[3] buf[4] buf[5] buf[6] buf[7] buf[ ] buf[ ] buf[ ] buf[n] Observe how cachelines will still ping-pong. What if the head/tail comparison was eliminated?

  20. lamp_enqueue(data) { NH = NEXT(head); while (NH == tail) {}; buf[head] = data; head = NH; } ff_enqueue(data) { while(0 != buf[head]); buf[head] = data; head = NEXT(head); } FastForwardCLF Queue (1)

  21. ff_enqueue(data) { while(0 != buf[head]); buf[head] = data; head = NEXT(head); } FastForwardCLF Queue (2) head tail buf[0] buf[0] buf[1] buf[1] buf[2] buf[3] buf[4] buf[5] buf[6] buf[7] buf[ ] buf[ ] buf[ ] buf[n] Observe how head/tail cachelines will NOT ping-pong. BUT, “buf” will still cause the cachelines to ping-pong.

  22. ff_enqueue(data) { while(0 != buf[head]); buf[head] = data; head = NEXT(head); } FastForwardCLF Queue (3) head tail buf[0] buf[1] buf[2] buf[3] buf[4] buf[5] buf[6] buf[7] buf[ ] buf[ ] buf[ ] buf[n] Solution: Temporally slip stages by a cacheline. N:1 reduction in coherence misses per stage.

  23. Slip Timing

  24. Slip TimingLost

  25. Maintaining Slip(Concepts) • Use distance as the quality metric • Explicitly compare head/tail • Causes cache ping-ponging • Perform rarely

  26. Maintaining Slip(Method) adjust_slip() { dist = distance(producer, consumer); if (dist < *Danger*) { dist_old = 0; do { dist_old = dist; spin_wait(avg_stage_time * (*OK* - dist)); dist = distance(producer, consumer); } while (dist < *OK* && dist > dist_old); } }

  27. ComparativePerformance Lamport FastForward

  28. Thrashing andAuto-Balancing FastForward (Thrashing) FastForward (Balanced)

  29. CacheVerification FastForward (Thrashing) FastForward (Balanced)

  30. On/Off DieCommunications Off-die communication On-die communication M

  31. On/Off-diePerformance FastForward (On-Die) FastForward (Off-Die)

  32. ProvenProperty • “In the program order of the consumer, the consumer dequeues values in the same order that they were enqueued in the producer's program order.”

  33. Workin Progress • Operating Systems • 27.5 ns/op • 3.1 % cost reduction vs. reported 28.5 ns • Reduced jitter • Applications • 128bit AES encrypting filter • Ethernet layer encryption at 1.45 mfps • IP layer encryption at 1.51 mfps • ~10 lines of code for each.

  34. Gazing intothe Crystal Ball Hardware  10ns FastForward  28ns Lamport  160ns Locks  320ns GigE

  35. Shared Memory Accelerated Queues Now Available! http://ce.colorado.edu/core Questions? john.giacomoni@colorado.edu

More Related