1 / 47

CSE-700 Parallel Programming Introduction

CSE-700 Parallel Programming Introduction. 박성우. POSTECH Sep 6, 2007. Common Features?. ... runs faster on. Multi-core CPUs. IBM Power4, dual-core, 2000 Intel reaches thermal wall , 2004 ) no more free lunch ! Intel Xeon, quad-core, 2006

prisca
Download Presentation

CSE-700 Parallel Programming Introduction

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. CSE-700 Parallel ProgrammingIntroduction 박성우 POSTECH Sep 6, 2007

  2. Common Features?

  3. ... runs faster on

  4. Multi-core CPUs • IBM Power4, dual-core, 2000 • Intel reaches thermal wall, 2004 ) no more free lunch! • Intel Xeon, quad-core, 2006 • Sony PlayStation 3 Cell, eight cores enabled, 2006 • Intel, 80-cores, 2011 (prototype finished) source: Herb Sutter - "Software and the concurrency revolution"

  5. Parallel Programming Models • Posix threads (API) • OpenMP (API) • HPF (High Performance Fortran) • Cray's Chapel • Nesl • Sun's Fortress • IBM's X10 • ... • and a lot more.

  6. Parallelism • Data parallelism • ability to apply a function in parallel to each element of a collection of data • Thread parallelism • ability to run multiple threads concurrently • Each thread uses its own local state. • Shared memory parallelism

  7. Data ParallelismThread ParallelismShared Memory Parallelism

  8. Data Parallelism = Data Separation hardware thread #1 hardware thread #2 hardware thread #3 an a1 a2 ... an+m an+m+l an+1 an+2 ... an+m+1 ...

  9. Data Parallelism in Hardware • GeForce 8800 • 128 stream processors @ 1.3Ghz, 500+GFlops

  10. Data Parallelism in Programming Languages • Fortress • parallelism is the default. for i à 1:m, j à 1:n do // 1:n is a generator a[i, j] := b[i] c[j] end • Nesl (1990's) • supports nested data parallelism • the function being applied itself can be parallel. {sum(a) : a in [[2, 3], [8, 3, 9], [7]]};

  11. Data Parallel Haskell (DAMP '07) • Haskell + nested data parallelism • flattening (vectorization) • transforms a nested parallel program such that it manipulates only flat arrays. • fusion • eliminate many intermediate arrays • Ex: 10,000x10,000 sparse matrix multiplication with 1 million elements

  12. Data ParallelismThread ParallelismShared Memory Parallelism

  13. Thread Parallelism synchronous communication hardware thread #1 hardware thread #2 message message local state local state

  14. Pure Functional Threads • Purely functional threads can run concurrently. • Effect-free computations can be executed in parallel with any other effect-free computations. • Example: collision-detection A' B' A B

  15. Manticore (DAMP '07) • Three layers • sequential base language • functional language drawn from SML • no mutable references and arrays! • data-parallel programming • Implicit: • the compiler and runtime system manage thread creation. • E.g.) parallel arrays of parallel arrays [: 2 * n | n in nums where n > 0 :] fun mapP f xs = [: f x | x in xs :] • concurrent programming

  16. Concurrent Programming in Manticore (DAMP '07) • Based on Concurrent ML • threads and synchronous message passing • Threads do not share mutable states. • actually no mutable references and arrays • explicit: • The programmer manages thread creation.

  17. Data ParallelismThread ParallelismShared Memory Parallelism(Shared State Concurrency)

  18. Share Memory Parallelism hardware thread #1 hardware thread #2 hardware thread #3 shared memory

  19. World War II

  20. Company of Heroes • Interaction of a LOT of objects: • thousands of objects • Each object has its own mutable state. • Each object update affects several other objects. • All objects are updated 30+ times per second. • Problem: • How do we handle simultaneous updates to the same memory location?

  21. Manual Lock-based Synchronization pthread_mutex_lock(mutex); mutate_variable(); pthread_mutex_unlock(mutex); • Locks and conditional variables ) fundamentally flawed!

  22. Bank Accounts Beautiful Concurrency, Peyton Jones, 2007 • Invariant: atomicity • no thread observes a state in which the money has left one account, but has not arrived in the other. thread #1 thread #2 ... thread #n transfer request transfer request transfer request account A account B shared memory

  23. Bank Accounts using Locks • In an object-oriented language: class Account { Int balance; synchronized void deposit (Int n) { balance = balance + n; }} • Code for transfer: void transfer (Account from, Account to, Int amount) { from.withdraw (amount); to.deposit (amount); } an intermediate state!

  24. A Quick Fix: Explicit Locking void transfer (Account from, Account to, Int amount) { from.lock(); to.lock(); from.withdraw (amount); to.deposit (amount); from.unlock(); to.unlock(); } • Now, the program is prone to deadlock.

  25. Locks are Bad • Taking two few locks ) simultaneous update • Taking too many locks ) no concurrency or deadlock • Taking the wrong locks ) error-prone programming • Taking locks in the wrong order ) error-prone programming • ... • Fundamental problem: no modular programming • Correct implementations of withdraw and deposit do not give a correct implementation of transfer.

  26. Transactional Memory • An alternative to lock-based synchronization • eliminates many problems associated with lock-based synchronization • no deadlock • read sharing • safe modular programming • Hot research area • hardware transactional memory • software transactional memory • C, Java, functional languages, ...

  27. Transactions in Haskell transfer :: Account -> Account -> Int -> IO () -- transfer 'amount' from account 'from' to account 'to' transfer from to amount = atomically (do { deposit to amount ; withdraw from amount }) • atomically act • atomicity: • the effects become visible to other threads all at once. • isolation: • the action act does not see any effects from other threads.

  28. Conclusion:We need parallelism!

  29. Tim Sweeney's POPL '06 Invited Talk- Last Slide

  30. CSE-700 Parallel Programming Fall 2007

  31. CSE-700 in a Nutshell • Scope • Parallel computing from the viewpoint of programmers and language designers • We will not talk about hardware for parallel computing • Audience • Anyone interested in learning parallel programming • Prerequisite • C programming • Desire to learn new programming languages

  32. Material • Books • Introduction to Parallel Programming (2nd). Ananth Grama et al. • Parallel Programming with MPI. Peter Pacheco. Parallel Programming in OpenMP. Rohit Chandra et al. • Any textbook on MPI and OpenMP is fine. • Papers

  33. Teaching Staff • Instructors • Gla • Myson • ... • and YOU! • We will lead this course TOGETHER.

  34. Resources • Plquad • quad-core Linux • OpenMP and MPI already installed • Ask for an account if you need one.

  35. Basic Plan - First Half • Goal • learn the basics of parallel programming through 5+ assignments on OpenMP and MPI • Each lecture consists of: • discussion on the previous assignment • Each of you is expected to give a presentation. • presentation on OpenMP and MPI by the instructors • discussion on the next assignment

  36. Basic Plan - Second Half • Recent parallel languages • learn a recent parallel language • write a cool program in your parallel language • give a presentation on your experience • Topics in parallel language research • choose a topic • give a presentation on it

  37. What Matters Most? • Spirit of adventure • Proactivity • Desire to provoke Happy Chaos • I want you to develop this course into a total, complete, yet happy chaos. • A truly inspirational course borders almost on chaos.

  38. Impact of Memory and Cache on Performance

  39. Impact of Memory Bandwidth [1] Consider the following code fragment: for (i = 0; i < 1000; i++) column_sum[i] = 0.0; for (j = 0; j < 1000; j++) column_sum[i] += b[j][i]; The code fragment sums columns of the matrix b into a vectorcolumn_sum.

  40. Impact of Memory Bandwidth [2] • The vector column_sum is small and easily fits into the cache • The matrix b is accessed in a column order. • The strided access results in very poor performance. Multiplying a matrix with a vector: (a) multiplying column-by-column, keeping a running sum; (b) computing each element of the result as a dot product of a row of the matrix with the vector.

  41. Impact of Memory Bandwidth [3] We can fix the above code as follows: for (i = 0; i < 1000; i++) column_sum[i] = 0.0; for (j = 0; j < 1000; j++) for (i = 0; i < 1000; i++) column_sum[i] += b[j][i]; In this case, the matrix is traversed in a row-order and performance can be expected to be significantly better.

  42. Lesson • Memory layouts and organizing computation appropriately can make a significant impact on the spatial and temporal locality.

  43. Assignment 1Cache & Matrix Multiplication

  44. Typical Sequential Implementation • A : n x n • B : n x n • C = A * B : n x n for i = 1 to n for j = 1 to n C[i, j] = 0; for k = 1 to n C[i, j] += A[i, k] * B [k, j];

  45. Using Submatrixes • Improves data locality significantly.

  46. Experimental Results

  47. Assignment 1 • Machine • the older, the better. • Myson offers his ancient notebook for you. • Pentium II 600Mhz • no L1 cache • 64KB L2 cache • running Linux • Prepare a presentation on your experimental results.

More Related