1 / 89

Multi-core computing

. . Example 1. Two threads incrementing different counters. In the first program, the counters are on the same cache line.In the second program, they are on different cache lines.. Summary. The performance of one thread is highly dependent on what other threads are doingIt's important to understa

cerelia
Download Presentation

Multi-core computing

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


    1. Multi-core computing Tim Harris

    3. Summary The performance of one thread is highly dependent on what other threads are doing Its important to understand the h/w of the machine, and the techniques it uses For modest numbers of cores, it may be possible to identify large granularity tasks, with good data locality Understand the different resource requirements of a program; computation, communication, locality Consider how data accesses will interact with the memory system; will the computation done on additional cores pay for the data to be brought to them? Focus on the longest running parts of the program first; be realistic about possible speedups Aim to tackle larger workloads in constant time (c.f. Gustafsons law) Exploit asymmetric designs; larger cores for sequential perf, smaller cores where high degrees of parallelism are available Multiple parts of an algorithm may need to be parallelised with different techniques With modest numbers of cores, beware of overheads that must be recovered via parallelism

    4. Merge sort

    5. Merge sort, dual core

    7. T7300 dual-core laptop

    8. AMD Phenom 3-core

    9. Summary The performance of one thread is highly dependent on what other threads are doing Its important to understand the h/w of the machine, and the techniques it uses For modest numbers of cores, it may be possible to identify large granularity tasks, with good data locality Understand the different resource requirements of a program; computation, communication, locality Consider how data accesses will interact with the memory system; will the computation done on additional cores pay for the data to be brought to them? Focus on the longest running parts of the program first; be realistic about possible speedups Aim to tackle larger workloads in constant time (c.f. Gustafsons law) Exploit asymmetric designs; larger cores for sequential perf, smaller cores where high degrees of parallelism are available Multiple parts of an algorithm may need to be parallelised with different techniques With modest numbers of cores, beware of overheads that must be recovered via parallelism

    10. Introduction Why parallelism? Parallel hardware Amdahls law Parallel algorithms: T1 and T8

    12. A simple microprocessor model ~ 1985 Single h/w thread Instructions execute one after the other Memory access time ~ clock cycle time

    13. Pipelined design

    14. Superscalar design

    15. Realistic memory accesses... Dynamic out-of-order Pipelined memory accesses Speculation

    16. The power wall

    17. The power wall Moores law: area of transistor is about 50% smaller each generation A given chip area accommodates twice as many transistors Pdyn = a f V2 Shrinking process technology (constant field scaling) allows reducing V to partially counteract increasing f V cannot be reduced arbitrarily Halving V more than halves max f (for a given transistor) Physical limits Pleak = V(Isub + Iox) Reduce Isub (sub-threshold leakage): turn off component, increase threshold volatage (reduces max f) Reduce Iox (gate-oxide leakage): increase oxide thickness (but it needs to decrease with process scale)

    18. The memory wall

    19. The memory wall

    20. The ILP wall ILP = Instruction level parallelism Implicit parallelism between instructions in a single thread Identified by the hardware Speculate past memory accesses Speculate past control transfer Diminishing returns

    21. Power wall + ILP wall + memory wall = brick wall Power wall means we cant just clock processors faster any longer Memory wall means that many workloads perf is dominated by memory access times ILP wall means we cant find extra work to keep functional units busy while waiting for memory accesses

    22. Introduction Why parallelism? Parallel hardware Amdahls law Parallel algorithms: T1 and T8

    23. Multi-threaded h/w Multiple threads in a workload with: Poor spatial locality Frequent memory accesses

    24. Multi-threaded h/w Multiple threads with synergistic resource needs

    25. Multi-core h/w common L2

    26. Multi-core h/w common L2

    27. Multi-core h/w common L2

    28. Multi-core h/w common L2

    29. Multi-core h/w common L2

    30. Multi-core h/w common L2

    31. Multi-core h/w common L2

    32. Multi-core h/w common L2

    33. Multi-core h/w common L2

    34. Multi-core h/w common L2

    35. Multi-core h/w common L2

    36. Multi-core h/w common L2

    37. Multi-core h/w common L2

    38. Multi-core h/w common L2

    39. Multi-core h/w common L2

    40. Multi-core h/w common L2

    41. Multi-core h/w separate L2

    42. Multi-core h/w additional L3

    43. Multi-threaded multi-core h/w

    44. SMP multiprocessor

    45. NUMA multiprocessor

    46. Three kinds of parallel hardware Multi-threaded cores Increase utilization of a core or memory b/w Peak ops/cycle fixed Multiple cores Increase ops/cycle Dont necessarily scale caches and off-chip resources proportionately Multi-processor machines Increase ops/cycle Often scale cache & memory capacities and b/w proportionately

    47. Summary The performance of one thread is highly dependent on what other threads are doing Its important to understand the h/w of the machine, and the techniques it uses For modest numbers of cores, it may be possible to identify large granularity tasks, with good data locality Understand the different resource requirements of a program; computation, communication, locality Consider how data accesses will interact with the memory system; will the computation done on additional cores pay for the data to be brought to them? Focus on the longest running parts of the program first; be realistic about possible speedups Aim to tackle larger workloads in constant time (c.f. Gustafsons law) Exploit asymmetric designs; larger cores for sequential perf, smaller cores where high degrees of parallelism are available Multiple parts of an algorithm may need to be parallelised with different techniques With modest numbers of cores, beware of overheads that must be recovered via parallelism

    48. AMD Phenom

    49. Introduction Why parallelism? Parallel hardware Amdahls law Parallel algorithms: T1 and T8

    50. Amdahls law Sorting takes 70% of the execution time of a sequential program. You replace the sorting algorithm with one that scales perfectly on multi-core hardware. On a machine with n cores, how many cores do you need to use to get a 4x speed-up on the overall algorithm?

    51. Amdahls law, f=70%

    52. Amdahls law, f=70%

    53. Amdahls law, f=70%

    54. Amdahls law, f=10%

    55. Amdahls law, f=98%

    56. Summary The performance of one thread is highly dependent on what other threads are doing Its important to understand the h/w of the machine, and the techniques it uses For modest numbers of cores, it may be possible to identify large granularity tasks, with good data locality Understand the different resource requirements of a program; computation, communication, locality Consider how data accesses will interact with the memory system; will the computation done on additional cores pay for the data to be brought to them? Focus on the longest running parts of the program first; be realistic about possible speedups Aim to tackle larger workloads in constant time (c.f. Gustafsons law) Exploit asymmetric designs; larger cores for sequential perf, smaller cores where high degrees of parallelism are available Multiple parts of an algorithm may need to be parallelised with different techniques With modest numbers of cores, beware of overheads that must be recovered via parallelism

    57. Amdahls law & multi-core

    58. Perf of big & small cores

    59. Amdahls law, f=98%

    60. Amdahls law, f=75%

    61. Amdahls law, f=5%

    62. Asymmetric chips

    63. Amdahls law, f=75%

    64. Amdahls law, f=5%

    65. Amdahls law, f=98%

    66. Amdahls law, f=98%

    67. Amdahls law, f=98%

    68. Summary The performance of one thread is highly dependent on what other threads are doing Its important to understand the h/w of the machine, and the techniques it uses For modest numbers of cores, it may be possible to identify large granularity tasks, with good data locality Understand the different resource requirements of a program; computation, communication, locality Consider how data accesses will interact with the memory system; will the computation done on additional cores pay for the data to be brought to them? Focus on the longest running parts of the program first; be realistic about possible speedups Aim to tackle larger workloads in constant time (c.f. Gustafsons law) Exploit asymmetric designs; larger cores for sequential perf, smaller cores where high degrees of parallelism are available Multiple parts of an algorithm may need to be parallelised with different techniques With modest numbers of cores, beware of overheads that must be recovered via parallelism

    69. Introduction Why parallelism? Parallel hardware Amdahls law Parallel algorithms: T1 and T8

    70. Merge sort (2-core)

    71. Merge sort (4-core)

    72. Merge sort (8-core)

    73. T8 (span): critical path length

    74. T1 (work): time to run sequentially

    75. Tserial: optimized sequential code

    76. A good multi-core parallel algorithm T1/Tserial is low What we lose on sequential performance we must make up through parallelism Resource availability may limit the ability to do that T8 grows slowly with the problem size We tackle bigger problems by using more cores, not by running for longer

    77. Quick-sort on good input

    78. Quick-sort on good input

    79. Quick-sort on good input

    80. Summary The performance of one thread is highly dependent on what other threads are doing Its important to understand the h/w of the machine, and the techniques it uses For modest numbers of cores, it may be possible to identify large granularity tasks, with good data locality Understand the different resource requirements of a program; computation, communication, locality Consider how data accesses will interact with the memory system; will the computation done on additional cores pay for the data to be brought to them? Focus on the longest running parts of the program first; be realistic about possible speedups Aim to tackle larger workloads in constant time (c.f. Gustafsons law) Exploit asymmetric designs; larger cores for sequential perf, smaller cores where high degrees of parallelism are available Multiple parts of an algorithm may need to be parallelised with different techniques With modest numbers of cores, beware of overheads that must be recovered via parallelism

    81. Scheduling from a DAG

    82. Scheduling from a DAG

    83. In CILK: Tp T1 / p + c T8

    84. In CILK: Tp T1 / p + c T8

    85. Summary The performance of one thread is highly dependent on what other threads are doing Its important to understand the h/w of the machine, and the techniques it uses For modest numbers of cores, it may be possible to identify large granularity tasks, with good data locality Understand the different resource requirements of a program; computation, communication, locality Consider how data accesses will interact with the memory system; will the computation done on additional cores pay for the data to be brought to them? Focus on the longest running parts of the program first; be realistic about possible speedups Aim to tackle larger workloads in constant time (c.f. Gustafsons law) Exploit asymmetric designs; larger cores for sequential perf, smaller cores where high degrees of parallelism are available Multiple parts of an algorithm may need to be parallelised with different techniques With modest numbers of cores, beware of overheads that must be recovered via parallelism

    86. Example: thunk dependencies

    87. Potential parallelism

    88. Limit study: parallelism vs granularity

    89. Measured performance

    90. Further reading Computer architecture: a quantitative approach, Hennessy & Patterson The landscape of parallel computing research: a view from Berkeley, Asanovic et al (http://www.eecs.berkeley.edu/ Pubs/TechRpts/2006/EECS-2006-183.pdf) Amdahls law in the multicore era, Hill & Marty, (http://www.cs.wisc.edu/multifacet/papers/tr1593_amdahl_multicore.pdf) Parallel thinking, Blelloch (http://www.cs.cmu.edu/~blelloch/papers/PPoPP09.pdf)

More Related