1 / 49

Design and analysis of algorithms for multicore architectures

Design and analysis of algorithms for multicore architectures. Alejandro Salinger April 2 nd , 2009 Joint work with Alex López-Ortiz and Reza Dorrigiv. Outline. Models of computation Motivation Parallelism in multicore Low Degree PRAM (LoPRAM) Work-optimal algorithms Divide & conquer

devlin
Download Presentation

Design and analysis of algorithms for multicore architectures

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Design and analysis of algorithms for multicore architectures Alejandro Salinger April 2nd, 2009 Joint work with Alex López-Ortiz and Reza Dorrigiv

  2. Outline • Models of computation • Motivation • Parallelism in multicore • Low Degree PRAM (LoPRAM) • Work-optimal algorithms • Divide & conquer • Dynamic programming • Related Work • Conclusions Design and Analysis of Algorithms for Multicore Architectures - Alejandro Salinger

  3. Abstract Modeling • Capture the characteristics of a phenomenon, with the adequate degree of accuracy in order to facilitate analysis and prediction [Maggs et al.]. • Examples: Financial markets, weather forecast, particles movement, genetic information, etc. • Several models about the same system or phenomenon. • Trade-off between simplicity and accuracy. Design and Analysis of Algorithms for Multicore Architectures - Alejandro Salinger

  4. Theoretical Models of Computation vs Design and Analysis of Algorithms for Multicore Architectures - Alejandro Salinger

  5. CPU MEM I/O Random Access Machine (RAM) • Models Von Neumman’s architecture. • A program executing over an infinite array of registers. • Random access. Design and Analysis of Algorithms for Multicore Architectures - Alejandro Salinger

  6. Random Access Machine (RAM) • Simple operation = 1 unit of time. • Memory access = 1 unit of time. • Model captures the essence of a computer. • Simple. • Useful in practice. versus Design and Analysis of Algorithms for Multicore Architectures - Alejandro Salinger

  7. Parallelism is here! • Why multicore processors? • Sequential programming is too easy. • We love doing things in parallel. • We finally know how to effectively program in parallel. • None of the above. [~Valiant, SPAA08] There is no other way to make processors faster. Design and Analysis of Algorithms for Multicore Architectures - Alejandro Salinger

  8. CPU Frequencies [Hennsey 06] Design and Analysis of Algorithms for Multicore Architectures - Alejandro Salinger

  9. The Walls • High power consumption. Power Wall • Power ~CVf2. • Simpler processors enjoy more MIPS per watt. • Little instruction level parallelism. ILP Wall • Branch prediction, speculation, out-of-order issue, register renaming, etc. • Not good when there is control-dependant computations, data-dependant memory addressing. • Memory latency. Memory Wall • Memory and caches speed does not match processor. • Communication bottleneck. [Patterson, Smith] Design and Analysis of Algorithms for Multicore Architectures - Alejandro Salinger

  10. AMD Opteron Quad Core Design and Analysis of Algorithms for Multicore Architectures - Alejandro Salinger

  11. Multicore Architectures Predominant model in practice. “64 to 128 core per microprocessor by 2015” [Intel roadmap] “the next version of Moore’s law” [Steve Scott, CTO, Cray] Design and Analysis of Algorithms for Multicore Architectures - Alejandro Salinger

  12. Parallel Models of Computation PRAM (Parallel Random Access Machine) • p synchronous processors. • Multiple-Instruction Multiple Data (MIMD). • Communication through shared memory. • Unit cost operations: read, compute, write. Memory Design and Analysis of Algorithms for Multicore Architectures - Alejandro Salinger

  13. PRAM • Traditionally algorithms assume Θ(n) processors. • What if there are only p < n available processors? • Simulate the Θ(n) processor solution using Brent’s lemma. , m>p Recall, optimal speedup: Design and Analysis of Algorithms for Multicore Architectures - Alejandro Salinger

  14. PRAM (cont.) • Facilitates analysis and design, however: • Memory accesses and local operations have different costs. • Memory has limited bandwidth. • Processes are not synchronous. • Difficult to derive algorithms that take full advantage of Θ(n) processors. Design and Analysis of Algorithms for Multicore Architectures - Alejandro Salinger

  15. Multicore Parallelism • Small number of cores: • low degree parallelism • High-level thread-based control of parallelism. • Shared memory, private and shared caches. Design and Analysis of Algorithms for Multicore Architectures - Alejandro Salinger

  16. What to do with these cores? “Programmability has now replaced power as the number one impediment to the continuation of Moore’s law.” [Gartner] • Before, applications took advantage of processor advances. • Now, we need to take advantage of parallelism. “Before, parallel programming meant high performance. Now it means everyday applications in laptops. The goal, the class of programmer and expectations are different.” [Andrew Chin, Intel] Design and Analysis of Algorithms for Multicore Architectures - Alejandro Salinger

  17. The Multicore Challenge Design a model such that: • Reflects available degree of parallelism. • Multi-threaded. • Graceful degradation with a smaller number of processors than the originally assumed (dynamic). • Easy theoretical analysis. • Easy to program. Design and Analysis of Algorithms for Multicore Architectures - Alejandro Salinger

  18. Low Degree Parallelism • Number of cores is growing. • Not a constant. • Modeled as ~log n. • Similar to bit-level parallelism: • Considered a constant when word was 4 or 8 bits. • Now described as ~log n in the word RAM model. Design and Analysis of Algorithms for Multicore Architectures - Alejandro Salinger

  19. Low Degree PRAM (LoPRAM) • PRAM with O(log n) processors. • Multiple-Instruction Multiple-Data. • CREW: Concurrent-Read Exclusive-Write. • High-level thread-based parallelism (asynchronous). • Communication through shared memory. • Semaphores and automatic serialization available and transparent to programmer. • p = O(log n) but not p = Θ(log n). Design and Analysis of Algorithms for Multicore Architectures - Alejandro Salinger

  20. Threads in the LoPRAM • PAL threads: Parallel ALgorithmic threads • A pal-thread call is a request for the creation of a thread. • The thread that requests does it in batch mode and suspends execution until requests are done. • If there are cores available, the pal-thread is activated and becomes a conventional thread. • Otherwise, the request is added to a tree of requests, under the node corresponding to the calling node. pending active blocked Design and Analysis of Algorithms for Multicore Architectures - Alejandro Salinger

  21. pending active blocked Threads in the LoPRAM • PAL threads (cont.) • When a thread blocks, the core is assigned to the first pending child. • If there are available cores, pending requests are activated in the order given by the breadth-first order traversal of the tree. • When there are no pending children, control goes back to the parent. Design and Analysis of Algorithms for Multicore Architectures - Alejandro Salinger

  22. Example: MergeSort void mergeSort(int numbers[], int temp[], int array_size) { m_sort(numbers, temp, 0, array_size - 1); } void m_sort(int numbers[], int temp[], int left, int right) { int mid = (right + left) / 2; if (right > left) { m_sort(numbers, temp, left, mid); m_sort(numbers, temp, mid+1, right); merge(numbers, temp, left, mid+1, right); } } Design and Analysis of Algorithms for Multicore Architectures - Alejandro Salinger

  23. Example: MergeSort void mergeSort(int numbers[], int temp[], int array_size) { m_sort(numbers, temp, 0, array_size - 1); } void m_sort(int numbers[], int temp[], int left, int right) { int mid = (right + left) / 2; if (right > left) { palthreads { // do in parallel if possible m_sort(numbers, temp, left, mid); m_sort(numbers, temp, mid+1, right); } // implicit join merge(numbers, temp, left, mid+1, right); } } Design and Analysis of Algorithms for Multicore Architectures - Alejandro Salinger

  24. Order of Execution pending active blocked Mergesort with n = 16 and p = 4. Design and Analysis of Algorithms for Multicore Architectures - Alejandro Salinger

  25. Analysis The first term dominates so long as p = O(log n). Design and Analysis of Algorithms for Multicore Architectures - Alejandro Salinger

  26. In general: Divide & Conquer • Recursive divide-and-conquer algorithms with time given by: • By the master theorem: Design and Analysis of Algorithms for Multicore Architectures - Alejandro Salinger

  27. Divide & Conquer Design and Analysis of Algorithms for Multicore Architectures - Alejandro Salinger

  28. Case 1 Case 2 Tp(n) = Θ(T(n)/p) works so long as p = O(log n). Design and Analysis of Algorithms for Multicore Architectures - Alejandro Salinger

  29. Case 3 i) Sequential merging: Tp(n) = Θ(f(n)) ii) Parallel merging: Tp(n) = Θ(f(n)/p) Design and Analysis of Algorithms for Multicore Architectures - Alejandro Salinger

  30. Master Theorem for LoPRAM • Divide and Conquer algorithms that can be solved with the master theorem report optimal speedup, so long as the number of processors is less than or equal to log n, i.e. Tp(n) = T1(n)/p Design and Analysis of Algorithms for Multicore Architectures - Alejandro Salinger

  31. LoPRAM • Optimal speedup so long as p≤ √n or p≤ log n depending on the cost of the merging phase. • The p≤ √n barrierwasobserved for certain P-complete problems in the PRAM [Kruskal et al 90’]. • The p≤ log n barrier was observed in heaps [Munro and Robertson 79’]. • With these bounds, communication between processor is practical with a complete network. Design and Analysis of Algorithms for Multicore Architectures - Alejandro Salinger

  32. T(n)=2T(n/2)+O(n) T(n)=O(nlog n) Tp(n)= O((nlog n)/p) Mergesort Design and Analysis of Algorithms for Multicore Architectures - Alejandro Salinger

  33. T(n)=2T(n/2)+O(n) T(n)=O(nlog n) Tp(n)= O((nlog n)/p) Mergesort Design and Analysis of Algorithms for Multicore Architectures - Alejandro Salinger

  34. T(n)=7T(n/2)+O(n2) T(n)=O(n2.8) Tp(n)=O(n2.8/p) Matrix Multiplication Design and Analysis of Algorithms for Multicore Architectures - Alejandro Salinger

  35. T(n)=7T(n/2)+O(n2) T(n)=O(n2.8) Tp(n)=O(n2.8/p) Matrix Multiplication Design and Analysis of Algorithms for Multicore Architectures - Alejandro Salinger

  36. LoPRAM • Divide and Conquer algorithms are easy to parallelize. • Not so easy when p is O(n) (e.g. mergesort). • Other examples? • Yes, it extends to dynamic programming, where it also reports optimal speedup. Design and Analysis of Algorithms for Multicore Architectures - Alejandro Salinger

  37. Dynamic Programming • Generic parallel algorithms for problems solvable by dynamic programming. • Given a dynamic programming solution, determine the corresponding Directed Acyclic Graph (DAG) and execute it in parallel. • Speedup depends on the degree of parallelism of the DAG. Design and Analysis of Algorithms for Multicore Architectures - Alejandro Salinger

  38. Dynamic Programming Design and Analysis of Algorithms for Multicore Architectures - Alejandro Salinger

  39. Dynamic Programming Design and Analysis of Algorithms for Multicore Architectures - Alejandro Salinger

  40. Dynamic Programming 0 2 4 0 2 0 5 3 0 2 1 6 0 0 2 0 4 3 2 1 Assume p = 3 0 1 2 0 0 Design and Analysis of Algorithms for Multicore Architectures - Alejandro Salinger

  41. Related Work Bulk Parallel Synchronous (BSP) model [Valiant]: • Processors with local memory, router for point-to-point messages, periodic synchronization. • Synchronization every L step at most: cost for synchronization and communication latency. • g local operations per memory access: bandwidth limitations. • Incentive for latency hiding. Design and Analysis of Algorithms for Multicore Architectures - Alejandro Salinger

  42. Related Work BSP*: BSP for multicores [Valiant] • d levels (pi,Li,mi,gi) • pi: number of components. • Li: synchronization cost. • mi: size of memory. • gi: data rate. • Level 0: cores. • Portable algorithms. • “Immortal algorithms.” level j-1 level j mj gj Design and Analysis of Algorithms for Multicore Architectures - Alejandro Salinger

  43. Related Work Cilk: programming platform for multithreaded computations with provable performance • Parallel computation modeled as DAG of tasks. • Scheduler: work-stealing. • Processing time, space, and communication optimal to within constant factors for “fully strict” multithreaded computations. Design and Analysis of Algorithms for Multicore Architectures - Alejandro Salinger

  44. Cache efficiency • Cost of memory access >> local operation. • Cost of memory access determined by cache miss or hit, not by routing cost, latency or gap (as in BSP or LogP models). • How to schedule for cache performance? • Private caches: processors working on different data (e.g. work stealing). • Shared cache: processors working on same data (e.g. Parallel Depth First). Design and Analysis of Algorithms for Multicore Architectures - Alejandro Salinger

  45. Related Work: Cache [Blelloch et al. SODA08] • Multicore-cache model (L1 private, L2 shared). • Controlled-PDF scheduler. • Cache efficiency within a constant of sequential complexity for L1 and L2 caches for a broad class of divide and conquer algorithms. • Divide and conquer algorithms consider parallel merging. Design and Analysis of Algorithms for Multicore Architectures - Alejandro Salinger

  46. Related Work: Cache [Chowdurry and Ramachandran, SPAA08] • Generic cache-efficient dynamic programming algorithms. • 3 models of caches: • Private caches for each core. • Shared cache for each cores. • Multicore: Private L1 and shared L2. • Algorithms for each of 3 types of problems: • Local dependency (e.g. longest common subsequence). • Gaussian Elimination Paradigm (e.g. LU decomposition). • Parenthesis problem (e.g. matrix chain multiplication). • Cache-efficient execution up to critical path length of the algorithm, I∞(n)=Θ(n). Design and Analysis of Algorithms for Multicore Architectures - Alejandro Salinger

  47. Conclusions • Microprocessor development has shifted the paradigm from higher frequencies to multicore. • This scenario calls for a new approach in theoretical models of computation. • We introduced a new model that • is faithful to current architectures, • avoids the pitfalls of the PRAM, • is theoretically simple, • allows for significant classes of algorithms to be parallelized with little effort. Design and Analysis of Algorithms for Multicore Architectures - Alejandro Salinger

  48. Future Work • Extend optimal parallelization to more general classes of divide and conquer algorithms and other types of problems. • Consider cache efficiency for different cache models, maybe for other types of problems. • Determine barrier in number of processors for optimal parallelization for other classes of problems. Design and Analysis of Algorithms for Multicore Architectures - Alejandro Salinger

  49. Design and Analysis of Algorithms for Multicore Architectures - Alejandro Salinger

More Related