1 / 32

Understanding the TigerSHARC ALU pipeline

Understanding the TigerSHARC ALU pipeline. Determining the speed of one stage of IIR filter – Part 2 Understanding the pipeline. Understanding the TigerSHARC ALU pipeline. TigerSHARC has many pipelines If these pipelines stall – then the processor speed goes down

Download Presentation

Understanding the TigerSHARC ALU pipeline

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Understanding the TigerSHARC ALU pipeline Determining the speed of one stage of IIR filter – Part 2Understanding the pipeline

  2. Understanding the TigerSHARC ALU pipeline • TigerSHARC has many pipelines • If these pipelines stall – then the processor speed goes down • Need to understand how the ALU pipeline works • Learn to use the pipeline viewer • Understanding what the pipeline viewer tells in detail • Avoiding having to use the pipeline viewer • Improving code efficiency • Excel and Project (Gantt charts) are useful tool Speed IIR -- stage 1, M. Smith, ECE, University of Calgary, Canada

  3. Register File and COMPUTE Units Speed IIR -- stage 1, M. Smith, ECE, University of Calgary, Canada

  4. S0 S1 S2 Simple ExampleIIR -- Biquad • For (Stages = 0 to 3) Do • S0 = Xin * H5 + S2 * H3 + S1 * H4 • Yout = S0 * H0 + S1 * H1 + S2 * H2 • S2 = S1 • S1 = S0 Horrible IIR codeexample as can’t re-use in a loop Works as asimple example for understanding TigerSHARCpipeline Speed IIR -- stage 1, M. Smith, ECE, University of Calgary, Canada

  5. Code return float when using XR8 register – NOTE NOT XFR8 Speed IIR -- stage 1, M. Smith, ECE, University of Calgary, Canada

  6. Step 2 – Using C++ code as comments set up the coefficients XFR0 = 0.0;; Does not exist XR0 = 0.0;; DOES EXIST Bit-patternsrequireintegerregisters Leave what youwanted to dobehind ascomments Speed IIR -- stage 1, M. Smith, ECE, University of Calgary, Canada

  7. Expect to take8 cycles to execute Speed IIR -- stage 1, M. Smith, ECE, University of Calgary, Canada

  8. PIPELINE STAGESSee page 8-34 of Processor manual • 10 pipeline stages, but may be completely desynchronized (happen semi-independently) • Instruction fetch -- F1, F2, F3 and F4 • Integer ALU – PreDecode, Decode, Integer, Access • Compute Block – EX1 and EX2 Speed IIR -- stage 1, M. Smith, ECE, University of Calgary, Canada

  9. Pipeline Viewer Result XR0 = 1.0 enters PD stage @ 39025, enters E2 stage at cycle 39830 is stored into XR0 at cycle 39831 -- 7 cycles execution time Speed IIR -- stage 1, M. Smith, ECE, University of Calgary, Canada

  10. Pipeline Viewer Result XR6 = 5.5 enters PD stage at cycle 39032 enters E2 stage at cycle 39837 is stored into XR6 at cycle 39838 -- 7 cycles execution time Each instruction takes 7 cycles but one new result each cycle Result – ONCE pipeline filled 8 cycles = 8 register transfer operations Key – don’t break pipeline with any jumps Speed IIR -- stage 1, M. Smith, ECE, University of Calgary, Canada

  11. Doing filter operations – generates different results XR8 = XR6 enters PD at 39833, enters EX2 at 39838, stored 39839 – 7 cyclesXFR23 = R9 * R4 enters PD at 39834, enters EX2 at 39839, stored 39840 – 7 cyclesXFR0 = R0 + R23 enters PD at 39835, enters EX2 at 39841, stored 39842 – 8 cycles WHY? – FIND OUT WITH MOUSE CLICK ON S MARKER THEN CONTROL Speed IIR -- stage 1, M. Smith, ECE, University of Calgary, Canada

  12. Instruction 0x17e XFR8 = R8 + R23 is STALLED (waiting) for instruction 0x17d XFR23 = R8 * R4 to complete Bubble B means that the pipeline is doing “nothing”Meaning that the instruction shown is “place holder” (garbage) Speed IIR -- stage 1, M. Smith, ECE, University of Calgary, Canada

  13. Information on Window Event Icons Speed IIR -- stage 1, M. Smith, ECE, University of Calgary, Canada

  14. Result of Analysis • Can’t use Float result immediately after calculation • WritingXFR23 = R8 * R4;; XFR8 = R8 + R23;; // MUST WAIT FOR XFR23 // calculation to be completedIs the same as codingXFR23 = R8 * R4;; NOP;;  Note DOUBLE ;; -- extra cycle because of stallXFR8 = R8 + R23;; • Proof – write the code with the stalls shown in it • Writing this way means we don’t have to use the pipeline viewer all the time • Pipeline viewer is only available with (slow) simulator • #define SHOW_ALU_STALL nop Speed IIR -- stage 1, M. Smith, ECE, University of Calgary, Canada

  15. Code withstalls shown • 8 code lines • 5 expected stalls • Expect 13 cyclesto completeif theory is correct Speed IIR -- stage 1, M. Smith, ECE, University of Calgary, Canada

  16. Analysis approach IS correctSame speed with and without nops Speed IIR -- stage 1, M. Smith, ECE, University of Calgary, Canada

  17. Process for coding for improved speed – code re-organization • Make a copy of the code so can test iirASM( ) and iirASM_Optimized( ) to make sure get correct result • Make a table of code showing ALU resource usage (paper, EXCEL, Project (Gantt chart) ) • Identify data dependencies • Make all “temp operations” use different register • Move instructions “forward” to fill delay slots, BUT don’t break data dependencies Speed IIR -- stage 1, M. Smith, ECE, University of Calgary, Canada

  18. Copy and paste to makeIIRASM_Optimized( ) Speed IIR -- stage 1, M. Smith, ECE, University of Calgary, Canada

  19. Need to re-order instructionsto fill delay slots with useful instructions • After refactoring code to fill delay slots, must run tests to ensure that still have the correct result • Change – and “retest” • NOT EASY TO DO • MUST HAVE ASYSTEMATIC PLAN TO HANDLEOPTIMIZATION • I USE EXCEL Speed IIR -- stage 1, M. Smith, ECE, University of Calgary, Canada

  20. Show resource usage and data dependencies All temporaryregister usageinvolves theSAME XFR23register This typically stallsout the processor Speed IIR -- stage 1, M. Smith, ECE, University of Calgary, Canada

  21. Change all temporary registers to use different register namesThen check code produces correct answer All temporaryregister usageinvolves a DIFFERENT Register ALWAYS FOLLOWTHIS PROCESSWHENOPTIMIZING Speed IIR -- stage 1, M. Smith, ECE, University of Calgary, Canada

  22. Move instructions forward, without breaking data dependencies What appears possible! DO one thing at a time and then check that code still works Speed IIR -- stage 1, M. Smith, ECE, University of Calgary, Canada

  23. Check that code still operates1 cycle saved Have put “our” marker stall instructionin parallel with moved instructionusing ; rather than ;; Move this instruction up in code sequence to fill delay slot Check that code still runsafter this optimization stage Speed IIR -- stage 1, M. Smith, ECE, University of Calgary, Canada

  24. Move next multiplication up. NOTE certain stalls remain, although reason for STALL changes from why they were inserted before Speed IIR -- stage 1, M. Smith, ECE, University of Calgary, Canada

  25. Move up the R10 and R9 assignment operations -- check 4 cycle improvement? Speed IIR -- stage 1, M. Smith, ECE, University of Calgary, Canada

  26. CHECK THE PIPELINE AFTER TESTING Speed IIR -- stage 1, M. Smith, ECE, University of Calgary, Canada

  27. Are there still more improvements possible (I can see 4 more moves) Speed IIR -- stage 1, M. Smith, ECE, University of Calgary, Canada

  28. Problems with approach • Identifying all the data dependencies • Keep track of how the data dependencies change as you move the code around • Handling all of this “automatically” • I started the following design tool as something that might work, but it actually turned out very useful.M. R. Smith and J. Miller, "Microprocessor Scheduling -- the irony of using Microsoft Project","Don’t say “CAN’T do it - Say “Gantt it”! The irony of organizing microprocessors with a big business tool" Circuit Cellar magazine, Vol. 184, pp 26 - 35, November 2005. Speed IIR -- stage 1, M. Smith, ECE, University of Calgary, Canada

  29. Using Microsoft Project – Step 1 Speed IIR -- stage 1, M. Smith, ECE, University of Calgary, Canada

  30. Add dependencies and resource usage – then activate level Speed IIR -- stage 1, M. Smith, ECE, University of Calgary, Canada

  31. Microsoft Project as a microprocessor design tool • Will look at this in more detail when we start using memory operations to fill the coefficient and state arrays Speed IIR -- stage 1, M. Smith, ECE, University of Calgary, Canada

  32. Understanding the TigerSHARC ALU pipeline • TigerSHARC has many pipelines • If these pipelines stall – then the processor speed goes down • Need to understand how the ALU pipeline works • Learn to use the pipeline viewer • Understanding what the pipeline viewer tells in detail • Avoiding having to use the pipeline viewer • Improving code efficiency • Excel and Project (Gantt charts) are useful tool Speed IIR -- stage 1, M. Smith, ECE, University of Calgary, Canada

More Related