1 / 29

Theory of Multicore Algorithms

Theory of Multicore Algorithms. Jeremy Kepner and Nadya Bliss MIT Lincoln Laboratory HPEC 2008

aneko
Download Presentation

Theory of Multicore Algorithms

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Theory of Multicore Algorithms Jeremy Kepner and Nadya Bliss MIT Lincoln Laboratory HPEC 2008 This work is sponsored by the Department of Defense under Air Force Contract FA8721-05-C-0002. Opinions, interpretations, conclusions, and recommendations are those of the author and are not necessarily endorsed by the United States Government.

  2. Outline • Programming Challenge • Example Issues • Theoretical Approach • Integrated Process • Parallel Design • Distributed Arrays • Kuck Diagrams • Hierarchical Arrays • Tasks and Conduits • Summary

  3. Multicore Programming Challenge Past Programming Model: Von Neumann Future Programming Model: ??? • Great success of Moore’s Law era • Simple model: load, op, store • Many transistors devoted to delivering this model • Moore’s Law is ending • Need transistors for performance • Processor topology includes: • Registers, cache, local memory, remote memory, disk • Cell has multiple programming models Can we describe how an algorithm is suppose to behave on a hierarchical heterogeneous multicore processor?

  4. Example Issues Where is it running? Where is the data? How is distributed? Initialization policy? X,Y : NxN How does the data flow? What are the allowed messages size? Overlap computations and communications? Which binary to run? • A serial algorithm can run a serial processor with relatively little specification • A hierarchical heterogeneous multicore algorithm requires a lot more information Y = X + 1

  5. Theoretical Approach M0 M0 M0 M0 P0 P0 P0 P0 Topic12 An 0 1 2 3 Topic12 m B A : RNP(N) Task1S1() Conduit M0 M0 M0 M0 B : RNP(N) Task2S2() 2 P0 P0 P0 P0 4 5 6 7 Replica 0 Replica 1 • Provide notation and diagrams that allow hierarchical heterogeneous multicore algorithms to be specified

  6. Integrated Development Process Y = X + 1 Y = X + 1 Y = X + 1 X,Y : NxN X,Y : P(N)xN X,Y : P(P(N))xN Cluster Embedded Computer Desktop 3. Deploy code 1. Develop serial code 2. Parallelize code 4. Automatically parallelize code • Should naturally support standard parallel embedded software development practices

  7. Outline • Serial Program • Parallel Execution • Distributed Arrays • Redistribution • Parallel Design • Distributed Arrays • Kuck Diagrams • Hierarchical Arrays • Tasks and Conduits • Summary

  8. Serial Program X,Y : NxN • Math is the language of algorithms • Allows mathematical expressions to be written concisely • Multi-dimensional arrays are fundamental to mathematics Y = X + 1

  9. Parallel Execution X,Y : NxN PID=NP-1 • Run NP copies of same program • Single Program Multiple Data (SPMD) • Each copy has a unique PID • Every array is replicated on each copy of the program PID=1 PID=0 Y = X + 1

  10. Distributed Array Program X,Y : P(N)xN PID=NP-1 • Use P() notation to make a distributed array • Tells program which dimension to distribute data • Each program implicitly operates on only its own data (owner computes rule) PID=1 PID=0 Y = X + 1

  11. Explicitly Local Program X,Y : P(N)xN • Use .loc notation to explicitly retrieve local part of a distributed array • Operation is the same as serial program, but with different data on each processor (recommended approach) Y.loc = X.loc + 1

  12. Parallel Data Maps P(N)xN NxP(N) P(N)xP(N) Array Math • A map is a mapping of array indices to processors • Can be block, cyclic, block-cyclic, or block w/overlap • Use P() notation to set which dimension to split among processors Computer PID 2 3 0 1

  13. Redistribution of Data Y : NxP(N) X: P(N)xN X = Y = P0 P1 Data Sent P2 P3 P0 P1 P2 P3 • Different distributed arrays can have different maps • Assignment between arrays with the “=” operator causes data to be redistributed • Underlying library determines all the message to send Y = X + 1

  14. Outline • Serial • Parallel • Hierarchical • Cell • Parallel Design • Distributed Arrays • Kuck Diagrams • Hierarchical Arrays • Tasks and Conduits • Summary

  15. Single Processor Kuck Diagram A : RNN M0 P0 • Processors denoted by boxes • Memory denoted by ovals • Lines connected associated processors and memories • Subscript denotes level in the memory hierarchy

  16. Parallel Kuck Diagram A : RNP(N) M0 M0 M0 M0 P0 P0 P0 P0 Net0.5 • Replicates serial processors • Net denotes network connecting memories at a level in the hierarchy (incremented by 0.5) • Distributed array has a local piece on each memory

  17. Hierarchical Kuck Diagram SM2 SMNet2 SM1 SM1 SMNet1 SMNet1 M0 M0 M0 M0 P0 P0 P0 P0 Net1.5 Net0.5 Net0.5 2-LEVEL HIERARCHY The Kuck notation provides a clear way of describing a hardware architecture along with the memory and communication hierarchy Subscript indicates hierarchy level • Legend: • P - processor • Net - inter-processor network • M - memory • SM - shared memory • SMNet - shared memory network x.5 subscript for Net indicates indirect memory access *High Performance Computing: Challenges for Future Systems, David Kuck, 1996

  18. Cell Example M1 MNet1 7 0 1 2 3 M0 M0 M0 M0 M0 M0 Net0.5 P0 P0 P0 P0 P0 P0 Kuck diagram for the Sony/Toshiba/IBM processor PPPE = PPE speed (GFLOPS) M0,PPE = size of PPE cache (bytes) PPPE -M0,PPE =PPE to cache bandwidth (GB/sec) PSPE = SPE speed (GFLOPS) M0,SPE = size of SPE local store (bytes) PSPE -M0,SPE = SPE to LS memory bandwidth (GB/sec) Net0.5 = SPE to SPE bandwidth (matrix encoding topology, GB/sec) MNet1 = PPE,SPE to main memory bandwidth (GB/sec) M1 = size of main memory (bytes) SPEs PPE

  19. Outline • Hierarchical Arrays • Hierarchical Maps • Kuck Diagram • Explicitly Local Program • Parallel Design • Distributed Arrays • Kuck Diagrams • Hierarchical Arrays • Tasks and Conduits • Summary

  20. Hierarchical Arrays 0 0 0 0 1 1 1 1 PID Local arrays PID Global array 0 1 2 3 • Hierarchical arrays allow algorithms to conform to hierarchical multicore processors • Each processor in P controls another set of processors P • Array local to P is sub-divided among local P processors

  21. Hierarchical Array and Kuck Diagram M0 M0 M0 M0 P0 P0 P0 P0 A : RNP(P(N)) A.loc SM1 SM1 SM net1 SM net1 A.loc.loc net0.5 net0.5 net1.5 • Array is allocated across SM1 of P processors • Within each SM1 responsibility of processing is divided among local P processors • P processors will move their portion to their local M0

  22. Explicitly Local Hierarchical Program X,Y : P(P(N))xN • Extend .loc notation to explicitly retrieve local part of a local distributed array .loc.loc (assumes SPMD on P) • Subscript p and p provide explicit access to (implicit otherwise) Y.locp.locp = X.locp.locp + 1

  23. Block Hierarchical Arrays 0 0 0 0 1 1 1 1 in-core blk PID Core blocks Local arrays PID Global array 0 out-of-core 0 1 2 1 b=4 3 2 0 3 1 2 3 • Memory constraints are common at the lowest level of the hierarchy • Blocking at this level allows control of the size of data operated on by each P

  24. Block Hierarchical Program X,Y : P(Pb(4)(N))xN • Pb(4) indicates each sub-array should be broken up into blocks of size 4. • .nblk provides the number of blocks for looping over each block; allows controlling size of data on lowest level for i=0, X.loc.loc.nblk-1 Y.loc.loc.blki = X.loc.loc.blki + 1

  25. Outline • Basic Pipeline • Replicated Tasks • Replicated Pipelines • Parallel Design • Distributed Arrays • Kuck Diagrams • Hierarchical Arrays • Tasks and Conduits • Summary

  26. Tasks and Conduits M0 M0 M0 M0 A : RNP(N) P0 P0 P0 P0 Task1S1() Topic12 An 0 1 2 3 Conduit M0 M0 M0 M0 B : RNP(N) P0 P0 P0 P0 Task2S2() Topic12 m B 4 5 6 7 • S1 superscript runs task on a set of processors; distributed arrays allocated relative to this scope • Pub/sub conduits move data between tasks

  27. Replicated Tasks M0 M0 M0 M0 P0 P0 P0 P0 Topic12 An 0 1 2 3 Topic12 m B A : RNP(N) Task1S1() Conduit M0 M0 M0 M0 B : RNP(N) Task2S2() 2 P0 P0 P0 P0 4 5 6 7 Replica 0 Replica 1 • 2 subscript creates tasks replicas; conduit will round-robin

  28. Replicated Pipelines Topic12 An Topic12 m B Replica 0 Replica 1 M0 M0 M0 M0 A : RNP(N) Task1S1() 2 P0 P0 P0 P0 0 1 2 3 Conduit M0 M0 M0 M0 B : RNP(N) Task2S2() 2 P0 P0 P0 P0 4 5 6 7 Replica 0 Replica 1 • 2 identical subscript on tasks creates replicated pipeline

  29. Summary • Hierarchical heterogeneous multicore processors are difficult to program • Specifying how an algorithm is suppose to behave on such a processor is critical • Proposed notation provides mathematical constructs for concisely describing hierarchical heterogeneous multicore algorithms

More Related