understanding the tigersharc alu pipeline n.
Download
Skip this Video
Loading SlideShow in 5 Seconds..
Understanding the TigerSHARC ALU pipeline PowerPoint Presentation
Download Presentation
Understanding the TigerSHARC ALU pipeline

Loading in 2 Seconds...

play fullscreen
1 / 28

Understanding the TigerSHARC ALU pipeline - PowerPoint PPT Presentation


  • 117 Views
  • Uploaded on

Understanding the TigerSHARC ALU pipeline. Determining the speed of one stage of IIR filter – Part 4 IIR operation with Memory. Understanding the TigerSHARC ALU pipeline. TigerSHARC has many pipelines Review of the COMPUTE pipeline works

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

Understanding the TigerSHARC ALU pipeline


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
understanding the tigersharc alu pipeline

Understanding the TigerSHARC ALU pipeline

Determining the speed of one stage of IIR filter – Part 4IIR operation with Memory

understanding the tigersharc alu pipeline1
Understanding the TigerSHARC ALU pipeline
  • TigerSHARC has many pipelines
  • Review of the COMPUTE pipeline works
  • Interaction of memory (data) operations with COMPUTE operations
    • What we want to be able to do?
    • The problems we are expecting to have to solve
    • Using the pipeline viewer to see what really happens
  • Changing code practices to get better performance
    • Specialized C++ compiler options and #pragmas (Will be covered by individual student presentation)
    • Optimized assembly code and optimized C++

Speed IIR -- stage 4 M. Smith, ECE, University of Calgary, Canada

processor architecture
Processor Architecture
  • 3 128-bitdata busses
  • 2 Integer ALU
  • 2 ComputationalBlocks
    • ALU (Float and integer)
    • SHIFTER
    • MULTIPLIER
    • COMMUNICATIONSCLU

Speed IIR -- stage 4 M. Smith, ECE, University of Calgary, Canada

simple example iir biquad

S0

S1

S2

Simple ExampleIIR -- Biquad
  • For (Stages = 0 to 3) Do
    • S0 = Xin * H5 + S2 * H3 + S1 * H4
    • Yout = S0 * H0 + S1 * H1 + S2 * H2
    • S2 = S1
    • S1 = S0

Speed IIR -- stage 4 M. Smith, ECE, University of Calgary, Canada

rewrite tests so that iir function can take parameters
Rewrite Tests so that IIR( ) function can take parameters

Speed IIR -- stage 4 M. Smith, ECE, University of Calgary, Canada

rewrite the c code
Rewrite the “C++ code”

I leave the old “fixed” values in until I can get the code to work.

Proved useful this time as the code failed

Why did it fail to return the correct value?

Speed IIR -- stage 4 M. Smith, ECE, University of Calgary, Canada

explore design issues memory ops probable memory stalls expected
Explore design issues – memory opsProbable memory stalls expected

XR0 = 0.0; // Set Fsum = 0;

XR1 = [J1 += 1]; // Fetch a coefficient from memory

XFR2 = R1 * R4; // Multiply by Xinput (XR4)

XFR0 = R0 + R2; // Add to sum

XR3 = [J1 += 1]; // Fetch a coefficient from memory

XR5 = [J2 += 1]; // Fetch a state value from memory

XFR5 = R3 * R5; // Multiply coeff and state

XFR0 = R0 + R5; // Perform a sum

XR5 = XR12; // Update a state variable (dummy)

XR12 = XR13 // Update a state variable (dummy)

[J3 += 1] = XR12; // Store state variable to memory

[J3 += 1] = XR5; // Store state variable to memory

Speed IIR -- stage 4 M. Smith, ECE, University of Calgary, Canada

looking much better
Looking much better.

Use 10 nops to flush the instruction pipeline

Speed IIR -- stage 4 M. Smith, ECE, University of Calgary, Canada

pipeline performance predicted
Pipeline performance predicted

When you start reading values from memory, 1 cycle delay for value fetched available for use within the COMPUTE

COMPUTE operations – 1 cycle delay expected if next instruction needs the result of previous instruction

When you have adjacent memory accesses (read or write) does the pipeline work better with [J1 += 1];; or with[J1 += J4];; where J4 has been set to 1?

[J1 += 1];; works just fine here (no delay).Worry about [J1 += J4];; another day

Speed IIR -- stage 4 M. Smith, ECE, University of Calgary, Canada

use c iir code as comments
Use C++ IIR code as comments

Things to think about

Register name reorganization

Keep XR4 for xInput – save a cycle

Put S1 and S2 into XR0 and XR1 -- chance to fetch 2 memory values in one cycle using L[ ]

Put H0 to H5 in XR12 to XR16 -- chance to fetch 4 memory values in one cycle using Q[ ] followed by one normal fetch

-- Problems – if more than one IIR stage then the second stage fetches are not quad aligned

There are two sets of multiplications using S1 and S2. Can these by done in X and Y compute blocks in one cycle?

float *copyStateStartAddress = state;S1 = *state++;S2 =*state++;

*copyStateStartAddress++ = S1;*copyStateStartAddress++ = S2;

Speed IIR -- stage 4 M. Smith, ECE, University of Calgary, Canada

new assembly code step 1
Make copy of COMPUTE optimized codefloat IIRASM_Memory(void);

Change the register names and make sure that it still works

New assembly code – step 1

Things to think about

Register name reorganization

Keep XR4 for xInput – save a cycle

Put S1 and S2 into XR10 and XR11 -- chance to fetch 2 memory values in one cycle using L[ ]

Put H0 to H5 in XR12 to XR16 -- chance to fetch 4 memory values in one cycle using Q[ ] followed by one normal fetch

-- Problems – if more than one IIR stage then the second stage fetches are not quad aligned

There are two sets of multiplications using S1 and S2. Can these by done in X and Y compute blocks in one cycle?

Speed IIR -- stage 4 M. Smith, ECE, University of Calgary, Canada

slide12
Write new testsNOTE: New register names don’t overlap with old namesMakes the name conversion very straight forward

Speed IIR -- stage 4 M. Smith, ECE, University of Calgary, Canada

register name conversion done in steps
Register name conversion done in steps

Setting Xin – XR4and Yout = XR8saves one cycle

Bulk conversionwith no error

So many errors made during bulk conversion that went to Find/replace/ test for each register individually

Speed IIR -- stage 4 M. Smith, ECE, University of Calgary, Canada

update tests to use iirasm memory version with real memory access
Update tests to use IIRASM_Memory( ) version with real memory access

Speed IIR -- stage 4 M. Smith, ECE, University of Calgary, Canada

fix bringing state variables in
Fix bringing state variables in

QUESTION

We haveXR18 = [J6 += 1] (load S1)

andR19 = [J6 += 1] (load S2)

Both are valid What is the difference?

Speed IIR -- stage 4 M. Smith, ECE, University of Calgary, Canada

send state variables out go for the gusto use l 64 bit
Need to recalculate the test resultstate[1] is NOT YoutSend state variables outGo for the gusto – use L[ ] (64-bit)

Speed IIR -- stage 4 M. Smith, ECE, University of Calgary, Canada

redo calculation for value stored as s1
S0 = Xin 5.5 + S1 *H4 + 2 * 5 + S2 * H3 + 3 * 4

S1 = S0

Expect stored value of 27.5

Need to fix testof state values after function

CHECK(state[0] == 27.5);

Redo calculation for value stored as S1

Speed IIR -- stage 4 M. Smith, ECE, University of Calgary, Canada

working solution i
Working solution -- I

Speed IIR -- stage 4 M. Smith, ECE, University of Calgary, Canada

working solution part 2
Working Solution -- Part 2

Speed IIR -- stage 4 M. Smith, ECE, University of Calgary, Canada

working solution part 3
Working solution – Part 3

I could not spot where any extra stalls would occur because of memory pipeline reads and writes

All values were in place when needed

Need to check with pipeline viewer

Speed IIR -- stage 4 M. Smith, ECE, University of Calgary, Canada

lets look at data memory and compute pipeline issues 1
Lets look at DATA MEMORY and COMPUTE pipeline issues -- 1

No problems here

Speed IIR -- stage 4 M. Smith, ECE, University of Calgary, Canada

lets look at data memory and compute pipeline issues 2
Lets look at DATA MEMORY and COMPUTE pipeline issues -- 2

No problems here

Speed IIR -- stage 4 M. Smith, ECE, University of Calgary, Canada

weird stuff happening with instruction pipeline
Weird stuff happening with INSTRUCTION pipeline

Only 9 instructions being fetched but we are executing 21!

Why all these instruction stalls?

Speed IIR -- stage 4 M. Smith, ECE, University of Calgary, Canada

adjust pipeline view for closer look adjust dis assembler window
Adjust pipeline view for closer look.Adjust dis-assembler window

Speed IIR -- stage 4 M. Smith, ECE, University of Calgary, Canada

analysis
Analysis
  • We are seeing the impact of the processor doing quad-fetches of instructions (128-bits) into IAB (instruction alignment buffer)
  • Once in the IAB, then the instructions (32-bits) are issued to the various executionunits as needed.

Speed IIR -- stage 4 M. Smith, ECE, University of Calgary, Canada

note the fetch into the next subroutine despite return cjmp
Note the fetch into the next subroutine despite return (CJMP)

Speed IIR -- stage 4 M. Smith, ECE, University of Calgary, Canada

note that processor continues to fetch the wrong instructions
Note that processor continues to fetch “the wrong” instructions

Speed IIR -- stage 4 M. Smith, ECE, University of Calgary, Canada

understanding the tigersharc alu pipeline2
Understanding the TigerSHARC ALU pipeline
  • TigerSHARC has many pipelines
  • Review of the COMPUTE pipeline works
  • Interaction of memory (data) operations with COMPUTE operations
    • What we want to be able to do?
    • The problems we are expecting to have to solve
    • Using the pipeline viewer to see what really happens
  • Changing code practices to get better performance
    • Specialized C++ compiler options and #pragmas (Will be covered by individual student presentation)
    • Optimized assembly code and optimized C++

Speed IIR -- stage 4 M. Smith, ECE, University of Calgary, Canada