1 / 50

FastForward for Efficient Pipeline Parallelism: A Cache-Optimized Concurrent Lock-Free Queue

FastForward for Efficient Pipeline Parallelism: A Cache-Optimized Concurrent Lock-Free Queue. John Giacomoni. Advisor: Dr. Manish Vachharajani University of Colorado at Boulder. The Rise of Multicore. Triple DES with 32B Blocks. Nanoseconds/Block. Number of Threads. Why Pipelines?.

ama
Download Presentation

FastForward for Efficient Pipeline Parallelism: A Cache-Optimized Concurrent Lock-Free Queue

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. FastForward for Efficient Pipeline Parallelism:A Cache-Optimized Concurrent Lock-Free Queue John Giacomoni Advisor: Dr. Manish Vachharajani University of Colorado at Boulder

  2. The Rise ofMulticore

  3. Triple DES with 32B Blocks Nanoseconds/Block Number of Threads Why Pipelines? • Data parallelism has limits • Pipeline parallelism if: • Granularity is fine enough • ≈ < 1 µs • ≈ 3.5 x interrupt handler • Total/Partial order

  4. Fine-GrainPipelining Examples • Network processing: • Intrusion detection (NID) • Traffic filtering (e.g., P2P filtering) • Traffic shaping (e.g., packet prioritization) • Signal Processing • Software Defined Radios • Encryption • Triple-DES • Other Domains • ODE Solvers • Fine-grain kernels extracted from sequential applications

  5. Network ProcessingScenarios

  6. IP IP APP Dec APP OP App Dec Enc Enc IP OP APP OP Core-Placements 4x4 NUMA Organization (ex: AMD Opteron Barcelona)

  7. Routing/BridgeData Flow OP App IP OS

  8. Example3 Stage Pipeline

  9. Example3 Stage Pipeline

  10. Communication Overhead

  11. Communication Overhead Locks  190ns GigE

  12. Communication Overhead Lamport  160ns Locks  190ns GigE

  13. Communication Overhead Hardware  10ns Lamport  160ns Locks  190ns GigE

  14. Communication Overhead Hardware  10ns FastForward  28ns Lamport  160ns Locks  190ns GigE

  15. The ProgrammingAbstraction “Stack” • Sequential Programming • Improves programmer productivity • Very successful • Problematic on modern machines

  16. The Complexityof Modern Systems • Sequential Programming • Parallel Programming • Very complex • Ignoring cross-layer behavior very problematic • Can lead to Incorrect behavior

  17. Hardware Matters(Memory Consistency) X = 0 Y = 0 Y,X Legal outputs? 0,0  1,1  0,1  1,0  Weak Consistency Permits! ?

  18. Hardware Matters(Memory Consistency) X = 0 B.Flag = 0 C.Flag = 0 What is the output? 3  2  1  0 

  19. TheBig Picture • Programming Modern Machines is a cross cutting problem • Need to evaluate/account/consider every layer • Omitted systems areas • Networking • File systems • Distributed systems • Business server development • Omitted related areas • User interfaces • Security

  20. FastForward • Cache-optimized point-to-point CLF queue • Fast • Robust against unbalanced stages • Hides die-die communication • Works with strong to weak memory consistency models

  21. lamp_enqueue(data) { NH = NEXT(head); while (NH == tail) {}; buf[head] = data; head = NH; } lamp_dequeue(*data) { while (head == tail) {} *data = buf[tail]; tail = NEXT(tail); } Lamport’sCLF Queue (1)

  22. lamp_enqueue(data) { NH = NEXT(head); while (NH == tail) {}; buf[head] = data; head = NH; } Lamport’sCLF Queue (2) head tail buf[0] buf[1] buf[2] buf[3] buf[4] buf[5] buf[6] buf[7] buf[ ] buf[ ] buf[ ] buf[n]

  23. NUMACache Example M

  24. lamp_enqueue(data) { NH = NEXT(head); while (NH == tail) {}; buf[head] = data; head = NH; } Lamport’sCLF Queue (2) head tail buf[0] buf[1] buf[2] buf[3] buf[4] buf[5] buf[6] buf[7] buf[ ] buf[ ] buf[ ] buf[n] Observe the mandatory cacheline ping-ponging for each enqueue and dequeue operation

  25. lamp_enqueue(data) { NH = NEXT(head); while (NH == tail) {}; buf[head] = data; head = NH; } Lamport’sCLF Queue (3) head tail buf[0] buf[1] buf[2] buf[3] buf[4] buf[5] buf[6] buf[7] buf[ ] buf[ ] buf[ ] buf[n] Observe how cachelines will still ping-pong. What if the head/tail comparison was eliminated?

  26. lamp_enqueue(data) { NH = NEXT(head); while (NH == tail) {}; buf[head] = data; head = NH; } ff_enqueue(data) { while(0 != buf[head]); buf[head] = data; head = NEXT(head); } FastForwardCLF Queue (1)

  27. ff_enqueue(data) { while(0 != buf[head]); buf[head] = data; head = NEXT(head); } FastForwardCLF Queue (2) head tail buf[0] buf[0] buf[1] buf[1] buf[2] buf[3] buf[4] buf[5] buf[6] buf[7] buf[ ] buf[ ] buf[ ] buf[n] Observe how head/tail cachelines will NOT ping-pong. BUT, “buf” will still cause the cachelines to ping-pong.

  28. ff_enqueue(data) { while(0 != buf[head]); buf[head] = data; head = NEXT(head); } FastForwardCLF Queue (3) head tail buf[0] buf[1] buf[2] buf[3] buf[4] buf[5] buf[6] buf[7] buf[ ] buf[ ] buf[ ] buf[n] Solution: Temporally slip stages by a cacheline.

  29. Slip Timing

  30. Slip TimingLost

  31. Maintaining Slip(Concepts) • Use distance as the quality metric • Explicitly compare head/tail • Causes cache ping-ponging • Perform rarely

  32. Maintaining Slip(Method) adjust_slip() { dist = distance(producer, consumer); if (dist < *Danger*) { dist_old = 0; do { dist_old = dist; spin_wait(avg_stage_time * (*OK* - dist)); dist = distance(producer, consumer); } while (dist < *OK* && dist > dist_old); } }

  33. ComparativePerformance Lamport FastForward

  34. Thrashing andAuto-Balancing FastForward (Thrashing) FastForward (Balanced)

  35. CacheVerification FastForward (Thrashing) FastForward (Balanced)

  36. On/Off DieCommunications Off-die communication On-die communication M

  37. On/Off-diePerformance FastForward (On-Die) FastForward (Off-Die)

  38. ProvenProperty • “In the program order of the consumer, the consumer dequeues values in the same order that they were enqueued in the producer's program order.”

  39. Routing/BridgeData Flow OP App IP OS

  40. FShm Forward(Bridge) • AES encrypting filter • Link layer encryption • ~10 lines of code • IDS • Complex Rules • IPS • DDoS • Data Recorders • Traffic Analysis • Forensics • CALEA 64B*  1.36 Mfps

  41. FlexibleCommunication • Pure software stack communicating via shared memory • Abstracted at the driver/NIC boundary • Cross-Domain modules (Kernel/Process, T/T, P/P, K/K) • Compatible with existing OS/library/language services • Can communicate with any device on the memory interconnect

  42. FastForwardConclusions… • Cross-layer optimization • FastForward - Cache-optimized point-to-point CLF queue • Fast • Robust against unbalanced stages • Hides die-die communication • Works with strong to weak memory consistency models

  43. Gazing intothe Crystal Ball Hardware  10ns FastForward  28ns Lamport  160ns Locks  190ns GigE

  44. Gazing intothe Crystal Ball Hardware  10ns FastForward  14ns FastForward  28ns Lamport  160ns Locks  190ns GigE

  45. TheReal World Cycles per Iteration Iteration

  46. The ReallyReal World Cycles per Iteration Iteration

  47. The PipelinedReal World

  48. BareMetal

  49. BareMetal Cycles per Iteration Iteration

  50. Questions? john.giacomoni@colorado.edu http://www.cs.colorado.edu/~jgiacomo http://ce.colorado.edu/core

More Related