1 / 16

Revisiting a slide from the syllabus: CS 525 will cover

Revisiting a slide from the syllabus: CS 525 will cover. Parallel and distributed computing architectures Shared memory processors Distributed memory processors Multi-core processors Parallel programming S hared memory programming ( OpenMP ) D istributed memory programming (MPI)

misha
Download Presentation

Revisiting a slide from the syllabus: CS 525 will cover

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Revisiting a slide from the syllabus:CS 525 will cover • Parallel and distributed computing architectures • Shared memory processors • Distributed memory processors • Multi-core processors • Parallel programming • Shared memory programming (OpenMP) • Distributed memory programming (MPI) • Thread-based programming • Parallel algorithms • Sorting • Matrix-vector multiplication • Graph algorithms • Applications • Computational science and engineering • High-performance computing

  2. Revisiting a slide from the syllabus:CS 525 will cover • Parallel and distributed computing architectures • Shared memory processors • Distributed memory processors • Multi-core processors • Parallel programming • Shared memory programming (OpenMP) • Distributed memory programming (MPI) • Thread-based programming • Parallel algorithms • Sorting • Matrix-vector multiplication • Graph algorithms • Applications • Computational science and engineering • High-performance computing This lecture hopes to touch on each of the highlighted aspects via one running example

  3. Intel Nehalem: an example of a multi-core shared-memory architecture A few of the features… Block diagram • 2 sockets • 4 cores per socket • 2 hyper-threads per core • 16 threads in total • Proc speed: 2.5 GHz • 24 GB total memory • Cache: • L1 (32 KB, on-core) • L2 (2x6 MB, on-core) • L3 (8 MB, shared) • Memory: virtually globally shared (from programmer’s point of view) • Multithreading: simultaneous (Multiple instructions from ready threads executed in a given cycle)

  4. Overview of Programming Models • Programming models provide support for expressing concurrency and synchronization • Process based models assume that all data associated with a process is private, by default, unless otherwise specified • Lightweight processes and threads assume that all memory is global • A thread is a single stream of control in the flow of a program • Directive based programming models extend the threaded model by facilitating creation and synchronization of threads

  5. Advantages of Multithreading (threaded programming) • Threads provide software portability • Multithreading enables latency hiding • Multithreading takes scheduling and load balancing burdens away from programmers • Multithreaded programs are significantly easier to write than message-passing programs • Becoming increasingly widespread in use due to the abundance of multicore platforms

  6. Shared Address Space Programming APIs • The POSIX Thread (Pthread) API • Has emerged as the standard threads API • Low-level primitives (relatively difficult to work with) • OpenMP • Is a directive-based API for programming shared address space platforms (has become a standard) • Used with Fortran, C, C++ • Directives provide support for concurrency, synchronization, and data handling without having to explicitly manipulate threads

  7. Parallelizing Graph Algorithms • Challenges: • Runtime dominated by memory latency than processor speed • Little work is done while visiting a vertex or an edge • Little computation to hide memory access cost • Access patterns determined only at runtime • Prefetching techniques inapplicable • There is poor data locality • Difficult to obtain good memory system performance • For these reasons, parallel performance • on distributed memory machines is often poor • on shared memory machines is often better We consider here graph coloring as an example of a graph algorithm to parallelize on shared memory machines

  8. Graph coloring • Graph coloring is an assignment of colors (positive integers) to the vertices of a graph such that adjacent vertices get different colors • The objective is to find a coloring with the least number of colors • Examples of applications: • Concurrency discovery in parallel computing (illustrated in the figure to the right) • Sparse derivative computation • Frequency assignment • Register allocation, etc

  9. A greedy algorithm for coloring • Graph coloring is NP-hard to solve optimally (and even to approximate) • The following Greedy algorithm gives very good solution in practice color is a vertex-indexed array that stores the color of each vertex forbiddenColors is a color-indexed array used to mark impermissible colors to a vertex Complexity of GREEDY: O(|E|) (thanks to the way the array forbiddenColors is used)

  10. Parallelizing Greedy Coloring • Desired goal: parallelize GREEDY such that • Parallel runtime is roughly O(|E|/p) when p processors (threads) are used • Number of colors used is nearly the same as in the serial case • Difficult to achieve since GREEDY is inherently sequential • Challenge: come up with a way to create concurrency in a nontrivial way

  11. A potentially “generic” parallelization technique • “Standard” Partitioning • Break up the given problem into p independent subproblems of almost equal sizes • Solve the psubproblems concurrently Main work lies in the decomposition step which is often no easier than solving the original problem • “Relaxed” Partitioning • Break up the problem into p, not necessarily entirely independent, subproblems of almost equal sizes • Solve the psubproblems concurrently • Detect inconsistencies in the solutions concurrently • Resolve any inconsistencies Can be used potentially successfully if the resolution in the fourth step involves only local adjustments

  12. “Relaxed Partitioning” applied towards parallelizing Greedy coloring • Speculation and Iteration: • Color as many vertices as possible concurrently, tentatively tolerating potential conflicts, detect and resolve conflicts afterwards (iteratively)

  13. Parallel Coloring on Shared Memory Platforms (using Speculation and Iteration) Lines 4 and 9 can be parallelized using the OpenMP directive #pragmaomp parallel for

  14. Sample experimental results of Algorithm 2 (Iterative) on Nehalem: I Graph (RMAT-G): 16.7M vertices; 133.1M edges; Degree (avg=16, Max= 1,278, variance=416)

  15. Sample experimental results of Algorithm 2 (Iterative) on Nehalem: II Graph (RMAT-B): 16.7M vertices; 133.7M edges; Degree (avg=16, Max= 38,143, variance=8,086)

  16. Recap • Saw an example of a multi-core architecture • Intel Nehalem • Took a bird’s eye view of parallel programming models • OpenMP (highlighted) • Saw an example of design of a graph algorithm • Graph coloring • Saw some experimental results

More Related