1 / 83

A Tutorial on High Performance Computing Taxonomy

A Tutorial on High Performance Computing Taxonomy. By Prof. V. Kamakoti Department of Computer Science and Engineering Indian Institute of Technology, Madras Chennai – 600 036, India. Organization of the Tutorial. Session – 1 Instruction Level Parallelism (ILP) Pipelining concepts

kiet
Download Presentation

A Tutorial on High Performance Computing Taxonomy

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. A Tutorial onHigh Performance Computing Taxonomy By Prof. V. Kamakoti Department of Computer Science and Engineering Indian Institute of Technology, Madras Chennai – 600 036, India

  2. Organization of the Tutorial • Session – 1 • Instruction Level Parallelism (ILP) • Pipelining concepts • RTL and Speed-up • Superscalar/VLIW concepts • Static Instruction Scheduling • Dynamic Instruction Scheduling • Branch Prediction

  3. Organization of the Tutorial • Session – 2 • Amdahl’s law and its applications • Symmetric Multiprocessors (SMP) • The Cache Coherency problem • ESI Protocol • Distributed Memory Systems • Basics of Message Passing Systems • Parallel Models of Computing • Design of Algorithms for Parallel processors • Brent’s Lemma

  4. Why this Title? • Performance related issues at • circuit level (RTL) • instruction level (Processor level) • Shared Memory Multiprocessor level (SMP) • Distributed Memory Multiprocessor level (Cluster/Grid – Collection of SMPs)

  5. ILP - Pipelining Fetch + Inc. PC I1 I2 I1 I3 I2 I1 I4 I3 I2 I1 I5 I4 I3 I2 I1 First Instruction completes at end of 5th unit Second instruction at end of 6th unit 10000th instruction at end of 10004 units Decode Instrn With Pipelining 10000 Instructions No pipelining takes 50000 Units Fetch Data Execute Instrn Store Data

  6. Performance • With pipelining we get a speed up of close to 5 • This will not work always • Hazards • Data • Control • Structural • Non Ideal: Not every step take the same amount of time • Float Mult – 10 cycles • Float Div – 40 cycles • Performance come out of Parallelism. Let us try to understand parallelism

  7. Types of Parallelism • Recognizing parallelism • Example: To add 100 numbers • for j = 1 to 100 do • Sum = Sum + A[j]; //Inherently Sequential • A better solution in terms of parallelism, assuming 4 processors are available • Split 100 numbers into 4 parts each of 25 numbers and allot one part each to all the four processors • Each processor adds 25 numbers allotted to it and sends the answer to a head processor, which further adds and gives the sum.

  8. Types of Parallelism • Data Parallelism or SIMD • The above example • Same instruction “Add 25 numbers” but on multiple data • The parallelism is because of data.

  9. Types of Parallelism • Functional Parallelism or MIMD • Multiple functions to be performed on a data set or data sets. • Multiple Instructions and Multiple Data. • Example is the case of the pipeline discussed earlier

  10. Example of Pipelining • Imagine that 100 sets of data are to be processed in a sequence by the following system. Part 1 Part 2 Part 1 takes 10 ms and Part 2 takes 15 ms. To process 100 sets of data it takes 2500 ms.

  11. Example of Pipelining • Consider the following changes, with a storage element. When first data set is in part 2, the second data set can be in part 1. S T O R A G E Part 1 Part 2 First data set finishes at 30ms and after every 15 ms one data set will come out – total processing time is 1515 ms. – A tremendous speedup.

  12. Functional Parallelism • Different data sets and different instructions on them. • Hence, Multiple Instruction and Multiple data. • An interesting problem is to convert circuits with large delays into pipelined circuits to get very good throughput as seen earlier. • Useful in the context of using the same circuit for different sets of data.

  13. Pipelining Concepts • A combinational circuit can be easily modeled as a Directed Acyclic Graph (DAG). • Every node of the DAG is a subcircuit of the given circuit – forms a stage of a pipeline. • An edge of the DAG connects two nodes of the DAG. • Perform a topological sorting of the DAG.

  14. N2 N3 Level = 1 N1 N4 N5 Level = 2 N6 Level = 3 N7 N8 Level = 4

  15. A Pipelined Circuit • If an edge connects two nodes of levels j and k, j < k, then introduce k-j storage levels in between, in the edge. • Each edge can carry one or more bits.

  16. 1 1 1 N2 N3 N1 2 2 N4 N5 2 3 N6 N7 N8 4

  17. Optimization • Delay at every stage should be al most equal. • The stage with maximum delay dictates the throughput. • Number of bits transferred across nodes to be optimized, that shall reduce the storage requirements.

  18. S T O R A G E Part 1 Part 2 Stage Time Balancing If Part 1 takes 10 ms and Part 2 takes 15ms then, First data set finishes at 30ms and after every 15 ms one data set will come out – total processing time for 100 data set is 1515 ms. – A tremendous speedup. If Part 1 takes 12 ms and Part 2 takes 13 ms then, First data set finishes at 26 ms and after every 13 ms one data set will come out – total processing time for 100 data set is 1313 ms. – A significant improvement.

  19. RTL View of Circuits and Performance • Register Transfer Level Combo 3 ns Combo 5 ns Reg L3 Reg L1 Reg L2 Clock Frequency is (1/5)*109 Improve frequency by reducing maximum stage delay.

  20. High Speed Circuits • Carry Ripple to Carry Look ahead • Wallace Tree Multipliers • Increased Area and Power Consumption • Lower Time delays • Why Laptops have lesser frequency ratings than Desktops? • Reference: Cormen, Lieserson and Rivest, (First Edition) Introduction to Algorithms (or) Computer Architecture by Hamacher et al.

  21. ILP Continues…. • Data Hazards • LOAD [R2 + 10], R1 // Loads into R1 • ADD R3, R1, R2 //R3 = R1 + R2 • This is the “Read After Write (RAW)” Data Hazard for R1 • LD [R2+10], R1 • ADD R3, R1, R12 • LD [R2 + 14], R1 • ADD R12, R1, R2 • This shows the WAW for R1 and WAR for R12

  22. ILP – Pipelining Advanced Superscalar: CPI < 1 Success: Different Instrns take different cycle time Fetch + Inc. PC Decode Instrn Fetch Data Execute Unit 1 Execute Unit 2 Execute Unit K Store Data Four FMULs while one FDIV Implies – Out-of-Order Execution

  23. Difficulties in SuperscalarConstruction • Ensuring no Data Hazards among several instructions executing in the different execution units at a same point of time. • If this is done by compiler – then Static Instruction Scheduling – VLIW - Itanium • Done by the hardware – then Dynamic Instruction Scehduling – Tomasulo – MIPS Embedded Processor

  24. Static Instruction Scheduling • Compiler make bundles of “K” instructions that can be put at the same time to the execution units such that there are no data dependencies between them. • Very Long Instruction Word (VLIW) to accommodate “K’ instructions at a time • Lot of “NOPS” if the bundle cannot be filled with relevant instructions • Size of the executable • Does not complicate the Hardware • Source code portability – if I make the next gen processor with K+5 units (say) – then? • Solved by having a software/firmware emulator which has a negative say in the performance.

  25. Thorn in the Flesh for Static Instruction Scheduling • The famous “Memory Aliasing Problem” • LD [R1+20], R2 //Load R2 into • ST R3, [R4+40] //Store R3 with • This implies a RAW if (R1 + 20 = R4 + 40) and this cannot be detected at compile time • Such combinations of memory operations are not put in same bundle and memory operations are strictly scheduled in program order.

  26. Dynamic Instruction Scheduling • The data hazards are handled by the hardware • RAW using Operand Forwarding Technique • WAR and WAW using Register Renaming Technique

  27. Processor Overview Why should result of LD go to R2 in Reg file and then reload to ALU? Forward the same on its way to reg file Processor ALU/Control Multiple function Units Register File Bus Memory RAW LD [R1+20],R2 ADD R3,R2,R4

  28. Register Renaming Dependencies due to Reg R1 RAW: (1,2), (1,4), (1,5) (3,4) (3,5) WAR: (2,3), (2,6), (4,6), (5,6) WAW: (1,3), (1,6), (3,6) • ADD R1, R2, R3 • ST R1, [R4+50] • ADD R1, R5, R6 • SUB R7,R1,R8 • ST R1, [R4 + 54] • ADD R1, R9, R10

  29. Register Renaming: Static Scheduling • ADD R1, R2, R3 • ST R1, [R4+50] • ADD R12, R5, R6 • SUB R7,R12,R8 • ST R12, [R4 + 54] • ADD R1, R9,R10 Rename R1 to R12 after Instruction 3 till Instruction 6 Dependency only within a window and not the whole program. Only WAR and WAW are between (1,6) and (2,6) which are far away in the program order Increases Register pressure for the compiler

  30. Dynamic Scheduling - Tomasulo Instruction Fetch Unit To Reg file/Mem Register Status Indicator Reservation Station Exec 1 Exec 2 Exec 3 Exec 4 Common Data Bus (CDB) Every Execution unit writes the result along with the unit number on to the CDB which is forwarded to all reservation stations, Reg-file and Memory Instructions are fetched one by one and decoded to find the type of operation and the source of operands Register Status Indicator indicates whether the latest value of the register is in the reg file or currently being computed by some execution unit and if the latter it states the execution unit number If all operands available then operation proceeds in the allotted execution unit, else, it waits in the reservation station of the allotted execution unit pinging the CDB

  31. ADD R1, R2, R3 • ST R1, [R4+50] • ADD R1, R5, R6 • SUB R7,R1,R8 • ST R1, [R4 + 54] • ADD R1, R9, R10 An Example: Instruction Fetch Register Status Indicator Empty Empty Empty Empty Empty Empty

  32. -- • ST R1, [R4+50] • ADD R1, R5, R6 • SUB R7,R1,R8 • ST R1, [R4 + 54] • ADD R1, R9, R10 An Example: Instruction Fetch ADD R1, R2, R3 Register Status Indicator Ins 1 Empty Empty Empty Empty Empty

  33. --- • --- • ADD R1, R5, R6 • SUB R7,R1,R8 • ST R1, [R4 + 54] • ADD R1, R9, R10 An Example: Instruction Fetch ST R1, [R4+50] Register Status Indicator I 1, E I 2, W 1 Empty Empty Empty Empty

  34. --- • --- • --- • SUB R7,R1,R8 • ST R1, [R4 + 54] • ADD R1, R9, R10 An Example: Instruction Fetch ADD R1, R5, R6 Register Status Indicator I 1, E I 2, W 1 I 3, E Empty Empty Empty Note: Reservation Station stores the number of the execution unit that shall yield the latest value of a register.

  35. --- • --- • --- • --- • ST R1, [R4 + 54] • ADD R1, R9, R10 An Example: Instruction Fetch SUB R7,R1,R8 Register Status Indicator I 1, E I 2, W 1 I 3, E I 4, W 3 Empty Empty

  36. --- • --- • --- • --- • --- • ADD R1, R9, R10 An Example: Instruction Fetch ST R1, [R4 + 54] Register Status Indicator I 1, E I 2, W 1 I 3, E I 4, W 3 I 5, W 3 Empty

  37. --- • --- • --- • --- • --- • --- An Example: Instruction Fetch ADD R1, R9, R10 Register Status Indicator I 1, E I 2, W 1 I 3, E I 4, W 3 I 5, W 3 I 6, E

  38. ADD R1, R2, R3 • ST U1, [R4+50] • ADD R1, R5, R6 • SUB R7, U3, R8 • ST U3, [R4 + 54] • ADD R1, R9, R10 An Example: Instruction Fetch ADD R1, R9, R10 Register Status Indicator I 1, E I 2, W 1 I 3, E I 4, W 3 I 5, W 3 I 6, E Effectively three Instructions are executing and others waiting for the appropriate results. The whole program is converted as shown above. Execution unit 6, on completion will make R1 entry in Register Status Indicator 0. Similarly unit 4 will make R7 entry 0. See that Operand Forwarding and Register Renaming is done automatically

  39. Memory Aliasing • Every Memory location is a register • Conceptually the same method can be used • The size of memory status indicator will be prohibitively large. • An Associative memory used to record the memory address to be written to and the unit number doing it.

  40. Other Hazards • Control Hazards • Conditional Jumps – which instruction to fetch next in the pipeline • Branch predictors are used – which shall predict whether a branch is taken or not. • Misprediction leads to undo-ing some actions increasing the penalty but nothing much can be done

  41. Branch Prediction • Different types of predictors • Tournament • Correlation • K-bit • Reference: Henessey and Patterson – Computer Architecture.

  42. Other Hazards • Structural Hazards • Non availability of a functional unit • Say would like to schedule the seventh instruction in our example • The new instruction has to wait. • Separate Integer, FPU and Load Store units are made available • Load-Store Architecture – What is it?

  43. ?

  44. End of Session 1

  45. Architectural Enhancements Amdahl’s Law Exec time without Enhancement Speedup(Overall) = Exec time with Enhancement A = Fraction of computation time in the original architecture that can be converted to take advantage of enhancement. Exec_time(New) = (1 – A) Exec_time(old) + Exec_time of Enhanced portion -- (1)

  46. Exec_time of enhanced portion(old) Speedup(enhanced) = Exec_time of enhanced portion(new) A* Exec time (old) = Exec_time of enhanced portion(new) A* Exec time (old) Exec_time of enhanced portion(new) = Speedup(enhanced) Substituting in (1) above we get

  47. ( ( ) ) A A 1 – A + 1 – A + Speedup(Enhanced) Speedup(Enhanced) Final form of Amdahl’s Law Exec_time(new) = Exec_time(old) * 1 Speedup Overall =

  48. Application of Amdahl’s Law: Always 50% FP operations - 20% FP Square root, 30% others Choice 1: Use hardware and improve FP Square root to get speedup of 10 Choice 2: Use software and improve all FP operations by a speedup of 1.6 Speedup in Choice 1 is 1/(1 – 0.2 + 0.2/10) = 1.22 Speedup in Choice 2 is 1/(1 – 0.5 + 0.5/1.6) = 1.23 Choice 2 is better than Choice 1

  49. Shared Memory Architectures • Sharing one memory space among several processors. • Maintaining coherence among several copies of a data item.

  50. Shared Memory Multiprocessor Processor Processor Processor Processor Registers Registers Registers Registers Caches Caches Caches Caches Snoopy Chipset Memory Disk & Other IO

More Related