Parallel Processing - PowerPoint PPT Presentation

dayton
parallel processing n.
Skip this Video
Loading SlideShow in 5 Seconds..
Parallel Processing PowerPoint Presentation
Download Presentation
Parallel Processing

play fullscreen
1 / 70
Download Presentation
Parallel Processing
120 Views
Download Presentation

Parallel Processing

- - - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

  1. Parallel Processing Chapter 9

  2. Problem: • Branches, cache misses, dependencies limit the (Instruction Level Parallelism) ILP available • Solution:

  3. Problem: • Branches, cache misses, dependencies limit the (Instruction Level Parallelism) ILP available • Solution: • Divide program into parts • Run each part on separate CPUs of larger machine

  4. Motivations

  5. Motivations • Desktops are incredibly cheap • Custom high-performance uniprocessor • Hook up 100 desktops • Squeezing out more ILP is difficult

  6. Motivations • Desktops are incredibly cheap • Custom high-performance uniprocessor • Hook up 100 desktops • Squeezing out more ILP is difficult • More complexity/power required each time • Would require change in cooling technology

  7. Challenges • Parallelizing code is not easy • Communication can be costly • Requires HW support

  8. Challenges • Parallelizing code is not easy • Languages, software engineering, software verification issue – beyond scope of class • Communication can be costly • Requires HW support

  9. Challenges • Parallelizing code is not easy • Languages, software engineering, software verification issue – beyond scope of class • Communication can be costly • Performance analysis ignores caches - these costs are much higher • Requires HW support

  10. Challenges • Parallelizing code is not easy • Languages, software engineering, software verification issue – beyond scope of class • Communication can be costly • Performance analysis ignores caches - these costs are much higher • Requires HW support • Multiple processes modifying the same data causes race conditions, and out of order processors arbitrarily reorder things.

  11. Performance - Speedup • _____________________ • 70% of the program is parallelizable • What is the highest speedup possible? • What is the speedup with 100 processors?

  12. Speedup • Amdahl’s Law!!!!!! • 70% of the program is parallelizable • What is the highest speedup possible? • What is the speedup with 100 processors?

  13. Speedup • Amdahl’s Law!!!!!! • 70% of the program is parallelizable • What is the highest speedup possible? • 1 / (.30 + .70 / ) = 1 / .30 = 3.33 • What is the speedup with 100 processors? 8

  14. Speedup • Amdahl’s Law!!!!!! • 70% of the program is parallelizable • What is the highest speedup possible? • 1 / (.30 + .70 / ) = 1 / .30 = 3.33 • What is the speedup with 100 processors? • 1 / (.30 + .70/100) = 1 / .307 = 3.26 8

  15. Taxonomy • SISD – single instruction, single data • SIMD – single instruction, multiple data • MISD – multiple instruction, single data • MIMD – multiple instruction, multiple data

  16. Taxonomy • SISD – single instruction, single data • uniprocessor • SIMD – single instruction, multiple data • MISD – multiple instruction, single data • MIMD – multiple instruction, multiple data

  17. Taxonomy • SISD – single instruction, single data • uniprocessor • SIMD – single instruction, multiple data • vector, MMX extensions, graphics cards • MISD – multiple instruction, single data • MIMD – multiple instruction, multiple data

  18. P D P D P D P D Controller SIMD Controller fetches instructions All processors execute the same instruction Conditional instructions only way for variation P D P D P D P D

  19. Taxonomy • SISD – single instruction, single data • uniprocessor • SIMD – single instruction, multiple data • vector, MMX extensions, graphics cards • MISD – multiple instruction, single data • MIMD – multiple instruction, multiple data

  20. Taxonomy • SISD – single instruction, single data • uniprocessor • SIMD – single instruction, multiple data • vector, MMX extensions, graphics cards • MISD – multiple instruction, single data • Never built – pipeline architectures?!? • MIMD – multiple instruction, multiple data

  21. Taxonomy • SISD – single instruction, single data • uniprocessor • SIMD – single instruction, multiple data • vector, MMX extensions, graphics cards • MISD – multiple instruction, single data • Streaming apps? • MIMD – multiple instruction, multiple data • Most multiprocessors • Cheap, flexible

  22. Example • Sum the elements in A[] and place result in sum int sum=0; int i; for(i=0;i<n;i++) sum = sum + A[i];

  23. Parallel versionShared Memory

  24. Parallel versionShared Memory int A[NUM]; int numProcs; int sum; int sumArray[numProcs]; myFunction( (input arguments) ) { int myNum - …….; int mySum = 0; for (i = (NUM/numProcs)*myNum; i < (NUM/numProcs)*(myNum+1);i++) mySum += A[i]; sumArray[myNum] = mySum; barrier(); if (myNum == 0) { for(i=0;i<numProcs;i++) sum += sumArray[i]; } }

  25. Why Synchronization? • Why can’t you figure out when proc x will finish work?

  26. Why Synchronization? • Why can’t you figure out when proc x will finish work? • Cache misses • Different control flow • Context switches

  27. Supporting Parallel Programs • Synchronization • Cache Coherence • False Sharing

  28. Synchronization • Sum += A[i]; • Two processors, i = 0, i = 50 • Before the action: • Sum = 5 • A[0] = 10 • A[50] = 33 • What is the proper result?

  29. Synchronization • Sum = Sum + A[i]; • Assembly for this equation, assuming • A[i] is already in $t0: • &Sum is already in $s0 lw $t1, 0($s0) add $t1, $t1, $t0 sw $t1, 0($s0)

  30. lw $t1, 0($s0) SynchronizationOrdering #1 add $t1, $t1, $t0 sw $t1, 0($s0) 5 5 15 38 15 38

  31. lw $t1, 0($s0) SynchronizationOrdering #2 add $t1, $t1, $t0 sw $t1, 0($s0) 5 5 15 38 38 15

  32. Synchronization Problem • Reading and writing memory is a non-atomic operation • You can not read and write a memory location in a single operation • We need hardware primitives that allow us to read and write without interruption

  33. Solution • Software Solution • “lock” – function that allows one processor to leave, all others to loop • “unlock” – releases the next looping processor (or resets to allow next arriving proc to leave) • Hardware • Provide primitives that read & write in order to implement lock and unlock

  34. SoftwareUsing lock and unlock lock(&balancelock) Sum += A[i] unlock(&balancelock)

  35. HardwareImplementing lock & unlock • Swap $1, 100($2) • Swap the contents of $1 and M[$2+100]

  36. Hardware: Implementing lock & unlock with swap • If lock has 0, it is free • If lock has 1, it is held Lock: Li $t0, 1 Loop: swap $t0, 0($a0) bne $t0, $0, loop

  37. Hardware: Implementing lock & unlock with swap • If lock has 0, it is free • If lock has 1, it is held Lock: Li $t0, 1 Loop: swap $t0, 0($a0) bne $t0, $0, loop Unlock: sw $0, 0($a0)

  38. Outline • Synchronization • Cache Coherence • False Sharing

  39. Cache Coherence P1 P2 • Current a value in:P1$ P2$ DRAM • * * 7 • 1. P2: Rd a • 2. P2: Wr a, 5 • 3. P1: Rd a • P2: Wr a, 3 • P1: Rd a $$$ $$$ DRAM P1,P2 are write-back caches

  40. Cache Coherence P1 P2 • Current a value in:P1$ P2$ DRAM • * * 7 • 1. P2: Rd a • 2. P2: Wr a, 5 • 3. P1: Rd a • P2: Wr a, 3 • P1: Rd a $$$ $$$ 1 DRAM P1,P2 are write-back caches

  41. Cache Coherence P1 P2 • Current a value in:P1$ P2$ DRAM • * * 7 • 1. P2: Rd a * 7 7 • 2. P2: Wr a, 5 • 3. P1: Rd a • P2: Wr a, 3 • P1: Rd a $$$ $$$ 1 DRAM P1,P2 are write-back caches

  42. Cache Coherence P1 P2 2 • Current a value in:P1$ P2$ DRAM • * * 7 • 1. P2: Rd a * 7 7 • 2. P2: Wr a, 5 • 3. P1: Rd a • P2: Wr a, 3 • P1: Rd a $$$ $$$ 1 DRAM P1,P2 are write-back caches

  43. Cache Coherence P1 P2 2 • Current a value in:P1$ P2$ DRAM • * * 7 • 1. P2: Rd a * 7 7 • 2. P2: Wr a, 5 * 5 • 3. P1: Rd a • P2: Wr a, 3 • P1: Rd a $$$ $$$ 1 DRAM P1,P2 are write-back caches

  44. Cache Coherence P1 P2 2 • Current a value in:P1$ P2$ DRAM • * * 7 • 1. P2: Rd a * 7 7 • 2. P2: Wr a, 5 * 5 7 • 3. P1: Rd a • P2: Wr a, 3 • P1: Rd a $$$ $$$ 1 DRAM P1,P2 are write-back caches

  45. Cache Coherence P1 P2 2 • Current a value in:P1$ P2$ DRAM • * * 7 • 1. P2: Rd a * 7 7 • 2. P2: Wr a, 5 * 5 7 • 3. P1: Rd a 5 5 5 • P2: Wr a, 3 • P1: Rd a $$$ $$$ 3 1 DRAM P1,P2 are write-back caches

  46. Cache Coherence P1 P2 2 4 • Current a value in:P1$ P2$ DRAM • * * 7 • 1. P2: Rd a * 7 7 • 2. P2: Wr a, 5 * 5 7 • 3. P1: Rd a 5 5 5 • P2: Wr a, 3 5 3 5 • P1: Rd a $$$ $$$ 3 1 DRAM AAAAAAAAAAAAAAAAAAAAAH! Inconsistency!

  47. Cache Coherence P1 P2 2 4 • Current a value in:P1$ P2$ DRAM • * * 7 • 1. P2: Rd a * 7 7 • 2. P2: Wr a, 5 * 5 7 • 3. P1: Rd a 5 5 5 • P2: Wr a, 3 5 3 5 • P1: Rd a $$$ $$$ 3 1 DRAM AAAAAAAAAAAAAAAAAAAAAH! Inconsistency! What will P1 receive from its load?

  48. Cache Coherence P1 P2 2 4 • Current a value in:P1$ P2$ DRAM • * * 7 • 1. P2: Rd a * 7 7 • 2. P2: Wr a, 5 * 5 7 • 3. P1: Rd a 5 5 5 • P2: Wr a, 3 5 3 5 • P1: Rd a $$$ $$$ 3 1 DRAM AAAAAAAAAAAAAAAAAAAAAH! Inconsistency! What will P1 receive from its load? 5 What should P1 receive from its load?

  49. Cache Coherence P1 P2 2 4 • Current a value in:P1$ P2$ DRAM • * * 7 • 1. P2: Rd a * 7 7 • 2. P2: Wr a, 5 * 5 7 • 3. P1: Rd a 5 5 5 • P2: Wr a, 3 5 3 5 • P1: Rd a $$$ $$$ 3 1 DRAM AAAAAAAAAAAAAAAAAAAAAH! Inconsistency! What will P1 receive from its load? 5 What should P1 receive from its load? 3

  50. Whatever are we to do? • Write-Invalidate • Invalidate that value in all others’ caches • Set the valid bit to 0 • Write-Update • Update the value in all others’ caches