1 / 57

Lecture 1 An Overview of High-Performance Computer Architecture

Lecture 1 An Overview of High-Performance Computer Architecture ECE 463/521 Spring 2006 Edward F. Gehringer Station 1 Station 2 Station 3 Station 4 Station 5 Connect doors Connect wheels & transmission Connect headlights Embed engine Build frame

Ava
Download Presentation

Lecture 1 An Overview of High-Performance Computer Architecture

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Lecture 1An Overview of High-Performance Computer Architecture ECE 463/521 Spring 2006 Edward F. Gehringer ECE 463/521 Lecture Notes, Spring 2006

  2. Station 1 Station 2 Station 3 Station 4 Station 5 Connect doors Connect wheels & transmission Connect headlights Embed engine Build frame Automobile Factory (note: non-animated version) ECE 463/521 Lecture Notes, Spring 2006

  3. Automobile Factory (note: non-animated version) Station 1 Station 2 Station 3 Station 4 Station 5 Connect doors Connect wheels & transmission Connect headlights Embed engine Build frame ECE 463/521 Lecture Notes, Spring 2006

  4. Automobile Factory (note: non-animated version) Station 1 Station 2 Station 3 Station 4 Station 5 Connect doors Connect wheels & transmission Connect headlights Embed engine Build frame ECE 463/521 Lecture Notes, Spring 2006

  5. Automobile Factory (note: non-animated version) Station 1 Station 2 Station 3 Station 4 Station 5 Connect doors Connect wheels & transmission Connect headlights Embed engine Build frame ECE 463/521 Lecture Notes, Spring 2006

  6. Automobile Factory (note: non-animated version) Station 1 Station 2 Station 3 Station 4 Station 5 Connect doors Connect wheels & transmission Connect headlights Embed engine Build frame ECE 463/521 Lecture Notes, Spring 2006

  7. Automobile Factory (note: non-animated version) Station 1 Station 2 Station 3 Station 4 Station 5 Connect doors Connect wheels & transmission Connect headlights Embed engine Build frame ECE 463/521 Lecture Notes, Spring 2006

  8. Basic Assembly Line • Unchangeable truth • It takes a long time to build one car • Example: Time spent in assembly line is 1 hour (12 min. per station) • Basic assembly line • Throughput = 1 car per hour • We wait until first car is fully assembled before starting the next one: •  only 1 car in assembly line at a time •  only 1 station is active at a time; other 4 are idle ECE 463/521 Lecture Notes, Spring 2006

  9. Automobile Factory (note: non-animated version) Station 1 Station 2 Station 3 Station 4 Station 5 Connect doors Connect wheels & transmission Connect headlights Embed engine Build frame ECE 463/521 Lecture Notes, Spring 2006

  10. Automobile Factory (note: non-animated version) Station 1 Station 2 Station 3 Station 4 Station 5 Connect doors Connect wheels & transmission Connect headlights Embed engine Build frame ECE 463/521 Lecture Notes, Spring 2006

  11. Automobile Factory (note: non-animated version) Station 1 Station 2 Station 3 Station 4 Station 5 Connect doors Connect wheels & transmission Connect headlights Embed engine Build frame ECE 463/521 Lecture Notes, Spring 2006

  12. Automobile Factory (note: non-animated version) Station 1 Station 2 Station 3 Station 4 Station 5 Connect doors Connect wheels & transmission Connect headlights Embed engine Build frame ECE 463/521 Lecture Notes, Spring 2006

  13. Automobile Factory (note: non-animated version) Station 1 Station 2 Station 3 Station 4 Station 5 Connect doors Connect wheels & transmission Connect headlights Embed engine Build frame ECE 463/521 Lecture Notes, Spring 2006

  14. Automobile Factory (note: non-animated version) Station 1 Station 2 Station 3 Station 4 Station 5 Connect doors Connect wheels & transmission Connect headlights Embed engine Build frame ECE 463/521 Lecture Notes, Spring 2006

  15. Pipelined Assembly Line • Unchangeable truth • It still takes a long time to build one car • Pipelining • Time to fill pipeline = 1 hour • Once filled, throughput = 1 car per 12 minutes • Speedup due to pipelining is (unusual definition)... ECE 463/521 Lecture Notes, Spring 2006

  16. Simple Processor Pipeline IF ID EX MEM WB (instruction decode) (execute) (memory) (write-back) (instruction fetch) Stage 1 Stage 2 Stage 3 Stage 4 Stage 5 Decode & read registers Write result Execute Access memory Fetch instruction ECE 463/521 Lecture Notes, Spring 2006

  17. Example Instruction • ADD r1, r2, r3 • r1r2 + r3 • IF: Fetch the ADD instruction from memory using the current PC (program counter), then PC  PC + 1 • ID: Decode the ADD instruction to determine the opcode, read values of r2 and r3 from the register file • EX: Perform r2 + r3 in the ALU (arithmetic/logic unit) • MEM: Do nothing (only loads/stores access memory) • WB: Write result of r2 + r3 into r1, in the register file ECE 463/521 Lecture Notes, Spring 2006

  18. Pipeline Performance Problems (1) • Data dependences • ADD r1, r2, r3 • SUB r4, r1, r9 • SUB must wait (“stall”) in ID stage until ADD completes • ADD writes the result r1 into register file in WB • SUB reads the result r1 from register file in ID ECE 463/521 Lecture Notes, Spring 2006

  19. Data Dependence Stalls ADD IF ID EX MEM WB (instruction decode) (execute) (memory) (write-back) (instruction fetch) Stage 1 Stage 2 Stage 3 Stage 4 Stage 5 Decode & read registers Write result Execute Access memory Fetch instruction ECE 463/521 Lecture Notes, Spring 2006

  20. Data Dependence Stalls SUB ADD IF ID EX MEM WB (instruction decode) (execute) (memory) (write-back) (instruction fetch) Stage 1 Stage 2 Stage 3 Stage 4 Stage 5 Decode & read registers Write result Execute Access memory Fetch instruction ECE 463/521 Lecture Notes, Spring 2006

  21. Data Dependence Stalls Register file r1 r1 SUB ADD (stalled) IF ID EX MEM WB (instruction decode) (execute) (memory) (write-back) (instruction fetch) Stage 1 Stage 2 Stage 3 Stage 4 Stage 5 Decode & read registers Write result Execute Access memory Fetch instruction ECE 463/521 Lecture Notes, Spring 2006

  22. Data Dependence Stalls Register file r1 r1 SUB ADD (bubble) (stalled) IF ID EX MEM WB (instruction decode) (execute) (memory) (write-back) (instruction fetch) Stage 1 Stage 2 Stage 3 Stage 4 Stage 5 Decode & read registers Write result Execute Access memory Fetch instruction ECE 463/521 Lecture Notes, Spring 2006

  23. Data Dependence Stalls Register file r1 r1 SUB (bubble) ADD (bubble) (stalled) IF ID EX MEM WB (instruction decode) (execute) (memory) (write-back) (instruction fetch) Stage 1 Stage 2 Stage 3 Stage 4 Stage 5 Decode & read registers Write result Execute Access memory Fetch instruction ECE 463/521 Lecture Notes, Spring 2006

  24. Data Dependence Stalls SUB (bubble) (bubble) IF ID EX MEM WB (instruction decode) (execute) (memory) (write-back) (instruction fetch) Stage 1 Stage 2 Stage 3 Stage 4 Stage 5 Decode & read registers Write result Execute Access memory Fetch instruction ECE 463/521 Lecture Notes, Spring 2006

  25. Speedup with data dependences • What is the speedup of this pipeline (Tsequential/Tpipelined) if 1/10th of all instructions contain a data dependence? • Can you give a general formula for a k-stage pipeline? What other information do you need to know? ECE 463/521 Lecture Notes, Spring 2006

  26. Reducing Data Dependence Stalls • We could directly forward results from producer to consumer, bypassing the register file. • The hardware is called “data bypass,” “result bypass,” or “register file bypass.” • The technique is called “bypassing” or “forwarding.” ECE 463/521 Lecture Notes, Spring 2006

  27. Data Bypass ADD IF ID EX MEM WB (instruction decode) (execute) (memory) (write-back) (instruction fetch) Stage 1 Stage 2 Stage 3 Stage 4 Stage 5 Decode & read registers Write result Execute Access memory Fetch instruction ECE 463/521 Lecture Notes, Spring 2006

  28. Data Bypass SUB ADD IF ID EX MEM WB (instruction decode) (execute) (memory) (write-back) (instruction fetch) Stage 1 Stage 2 Stage 3 Stage 4 Stage 5 Decode & read registers Write result Execute Access memory Fetch instruction ECE 463/521 Lecture Notes, Spring 2006

  29. Data Bypass Register file r1 (garbage) SUB ADD IF ID EX MEM WB (instruction decode) (execute) (memory) (write-back) (instruction fetch) Stage 1 Stage 2 Stage 3 Stage 4 Stage 5 Decode & read registers Write result Execute Access memory Fetch instruction ECE 463/521 Lecture Notes, Spring 2006

  30. Data Bypass Register file r1 (garbage) data bypass r1 (correct) SUB ADD IF ID EX MEM WB (instruction decode) (execute) (memory) (write-back) (instruction fetch) Stage 1 Stage 2 Stage 3 Stage 4 Stage 5 Decode & read registers Write result Execute Access memory Fetch instruction ECE 463/521 Lecture Notes, Spring 2006

  31. Pipeline Performance Problems (2) • Branches ADD r1, r2, r3 BEQ X, r5, r7 SUB r4, r1, r9 LD r4, 10(r4) …… X: AND r4, r10, r11 • Which instruction should be fetched after the branch? • IF stage stalls until BEQ reaches EX stage. • EX stage evaluates branch condition (r5 = = r7). “taken” ECE 463/521 Lecture Notes, Spring 2006

  32. Branch Stalls • ADD r1, r2, r3 • BEQ X, r5, r7 • SUB r4, r1, r9 • LD r4, 10(r4) • …… • X: AND r4, r10, r11 ADD IF ID EX MEM WB (instruction decode) (execute) (memory) (write-back) (instruction fetch) Stage 1 Stage 2 Stage 3 Stage 4 Stage 5 Decode & read registers Write result Execute Access memory Fetch instruction ECE 463/521 Lecture Notes, Spring 2006

  33. Branch Stalls • ADD r1, r2, r3 • BEQ X, r5, r7 • SUB r4, r1, r9 • LD r4, 10(r4) • …… • X: AND r4, r10, r11 BEQ ADD IF ID EX MEM WB (instruction decode) (execute) (memory) (write-back) (instruction fetch) Stage 1 Stage 2 Stage 3 Stage 4 Stage 5 Decode & read registers Write result Execute Access memory Fetch instruction ECE 463/521 Lecture Notes, Spring 2006

  34. Branch Stalls • ADD r1, r2, r3 • BEQ X, r5, r7 • SUB r4, r1, r9 • LD r4, 10(r4) • …… • X: AND r4, r10, r11 BEQ ADD (bubble) IF ID EX MEM WB (instruction decode) (execute) (memory) (write-back) (instruction fetch) Stage 1 Stage 2 Stage 3 Stage 4 Stage 5 Decode & read registers Write result Execute Access memory Fetch instruction ECE 463/521 Lecture Notes, Spring 2006

  35. Branch Stalls • ADD r1, r2, r3 • BEQ X, r5, r7 • SUB r4, r1, r9 • LD r4, 10(r4) • …… • X: AND r4, r10, r11 Branch outcome: taken (bubble) BEQ ADD (bubble) IF ID EX MEM WB (instruction decode) (execute) (memory) (write-back) (instruction fetch) Stage 1 Stage 2 Stage 3 Stage 4 Stage 5 Decode & read registers Write result Execute Access memory Fetch instruction ECE 463/521 Lecture Notes, Spring 2006

  36. Branch Stalls • ADD r1, r2, r3 • BEQ X, r5, r7 • SUB r4, r1, r9 • LD r4, 10(r4) • …… • X: AND r4, r10, r11 AND (bubble) BEQ ADD (bubble) IF ID EX MEM WB (instruction decode) (execute) (memory) (write-back) (instruction fetch) Stage 1 Stage 2 Stage 3 Stage 4 Stage 5 Decode & read registers Write result Execute Access memory Fetch instruction ECE 463/521 Lecture Notes, Spring 2006

  37. Reducing Branch Stalls • Branch prediction • “Learn” which way a given branch tends to go. • Like predicting the economy, branch prediction is based on past history. • Even simple predictors can be 80% accurate. • If correct: no branch stalls. • In incorrect: • “Quash” instructions in previous pipeline stages. • Performance degrades to the stall case. • May have additional penalties to “clean up” the pipeline. ECE 463/521 Lecture Notes, Spring 2006

  38. Branch Prediction (correct) • ADD r1, r2, r3 • BEQ X, r5, r7 • SUB r4, r1, r9 • LD r4, 10(r4) • …… • X: AND r4, r10, r11 ADD IF ID EX MEM WB (instruction decode) (execute) (memory) (write-back) (instruction fetch) Stage 1 Stage 2 Stage 3 Stage 4 Stage 5 Decode & read registers Write result Execute Access memory Fetch instruction ECE 463/521 Lecture Notes, Spring 2006

  39. Branch Prediction (correct) • ADD r1, r2, r3 • BEQ X, r5, r7 • SUB r4, r1, r9 • LD r4, 10(r4) • …… • X: AND r4, r10, r11 BEQ ADD IF ID EX MEM WB (instruction decode) (execute) (memory) (write-back) (instruction fetch) Stage 1 Stage 2 Stage 3 Stage 4 Stage 5 Decode & read registers Write result Execute Access memory Fetch instruction ECE 463/521 Lecture Notes, Spring 2006

  40. Branch Prediction (correct) • ADD r1, r2, r3 • BEQ X, r5, r7 • SUB r4, r1, r9 • LD r4, 10(r4) • …… • X: AND r4, r10, r11 Predict taken AND BEQ ADD IF ID EX MEM WB (instruction decode) (execute) (memory) (write-back) (instruction fetch) Stage 1 Stage 2 Stage 3 Stage 4 Stage 5 Decode & read registers Write result Execute Access memory Fetch instruction ECE 463/521 Lecture Notes, Spring 2006

  41. Branch Prediction(incorrect) • ADD r1, r2, r3 • BEQ X, r5, r7 • SUB r4, r1, r9 • LD r4, 10(r4) • …… • X: AND r4, r10, r11 ADD IF ID EX MEM WB (instruction decode) (execute) (memory) (write-back) (instruction fetch) Stage 1 Stage 2 Stage 3 Stage 4 Stage 5 Decode & read registers Write result Execute Access memory Fetch instruction ECE 463/521 Lecture Notes, Spring 2006

  42. Branch Prediction(incorrect) • ADD r1, r2, r3 • BEQ X, r5, r7 • SUB r4, r1, r9 • LD r4, 10(r4) • …… • X: AND r4, r10, r11 BEQ ADD IF ID EX MEM WB (instruction decode) (execute) (memory) (write-back) (instruction fetch) Stage 1 Stage 2 Stage 3 Stage 4 Stage 5 Decode & read registers Write result Execute Access memory Fetch instruction ECE 463/521 Lecture Notes, Spring 2006

  43. Branch Prediction(incorrect) • ADD r1, r2, r3 • BEQ X, r5, r7 • SUB r4, r1, r9 • LD r4, 10(r4) • …… • X: AND r4, r10, r11 Predict not taken SUB BEQ ADD IF ID EX MEM WB (instruction decode) (execute) (memory) (write-back) (instruction fetch) Stage 1 Stage 2 Stage 3 Stage 4 Stage 5 Decode & read registers Write result Execute Access memory Fetch instruction ECE 463/521 Lecture Notes, Spring 2006

  44. Branch outcome: taken Branch Prediction(incorrect) • ADD r1, r2, r3 • BEQ X, r5, r7 • SUB r4, r1, r9 • LD r4, 10(r4) • …… • X: AND r4, r10, r11 LD SUB BEQ ADD IF ID EX MEM WB (instruction decode) (execute) (memory) (write-back) (instruction fetch) Stage 1 Stage 2 Stage 3 Stage 4 Stage 5 Decode & read registers Write result Execute Access memory Fetch instruction ECE 463/521 Lecture Notes, Spring 2006

  45. Branch Prediction(incorrect) • ADD r1, r2, r3 • BEQ X, r5, r7 • SUB r4, r1, r9 • LD r4, 10(r4) • …… • X: AND r4, r10, r11 AND LD SUB BEQ ADD IF ID EX MEM WB (instruction decode) (execute) (memory) (write-back) (instruction fetch) Stage 1 Stage 2 Stage 3 Stage 4 Stage 5 Decode & read registers Write result Execute Access memory Fetch instruction ECE 463/521 Lecture Notes, Spring 2006

  46. Speedup with branch stalls • What is the speedup of the pipeline if 1/5 of the instructions are branches, and 4/5 of those are correctly predicted? • Can you give a general formula for a k-stage pipeline? What other information do you need to know? ECE 463/521 Lecture Notes, Spring 2006

  47. Sears Tower Repairman • Repair shop is in the basement • Has many tools. • A few are used frequently, • e.g., hammer, crescent wrench, screwdriver • Most are used infrequently, • e.g., socket wrenches ECE 463/521 Lecture Notes, Spring 2006

  48. Sears Tower Repairman • Problem • Sears Tower has 110 stories! • Today, you are working on the top floor. • Can’t bring entire shop with you. • Don’t know exactly which tools to bring with you from the basement. ECE 463/521 Lecture Notes, Spring 2006

  49. Sears Tower Repairman • Solution • Carry frequently used tools in your tool belt. • Tool-belt becomes a “cache” of tools — drastically reduces the number of trips down to the basement. • When you have to fetch ¼" socket wrench, common sense says to also fetch ½", ¾", etc., just in case. ECE 463/521 Lecture Notes, Spring 2006

  50. Caches • The processor-memory speed gap • Processor is very fast • Intel Pentium-4: 1 GHz, 1 clock cycle = 1 ns • Large memory is slow! • Main memory: 50 ns to access, 50 times slower than Pentium-4! • Processor wants large and fast memory. • LARGE: O/S and applications consume lots of memory • FAST: Otherwise, processor stalls nearly 100% of time waiting for memory. ECE 463/521 Lecture Notes, Spring 2006

More Related