1 / 17

Presentation 1 MAD MAC 525

Presentation 1 MAD MAC 525. Farhan Mohamed Ali (W2-1) Jigar Vora (W2-2) Sonali Kapoor (W2-3) Avni Jhunjhunwala (W2-4) Shiven Seth (W2-5). W2. 1 st February, 2006 Architecture Proposal. Project Objective:

emilie
Download Presentation

Presentation 1 MAD MAC 525

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Presentation 1 MAD MAC 525 Farhan Mohamed Ali (W2-1)Jigar Vora (W2-2)Sonali Kapoor (W2-3) Avni Jhunjhunwala (W2-4) Shiven Seth (W2-5) W2 1st February, 2006 Architecture Proposal Project Objective: Design a crucial part of a GPU called the Multiply Accumulate Unit (MAC) which will revolutionize graphics.

  2. MAD MAC 525 Status: • Project chosen • Specifications defined • Architecture • Design • Behavioral Verilog • Testbenches • To be done • Verilog : Gate Level Design • Schematic • Floor plan • Layout • Extraction, LVS, post-layout simulation

  3. Overview - MAD MAC 525 • Multiply Accumulate unit (MAC) • Executes function AB+C on 16 bit floating point inputs • Multiply and add in parallel to greatly speed up operation • Rounding is only performed only once so greater accuracy than individual multiply and add functions. • MAD MAC accelerates FP16 blending to enable true HDR graphics • Bright things can be really bright • Dark things can be really dark • And the details can be seen in both

  4. Quick Overview of FP • A = 1.11010 x 22 • B = 1.01110 x 25 • C = 1.11000 x 28 • Step 1: A*B • Multiply the Significands: 1.1101 * 1.01110 = 10.011010110 • Exponent of result is expA + expB = 7 • A*B = 10.011010110 x 27 • Step 2: Align C • To add two FP’s, their exponents must be the same • Shift by expA + expB – expC = 2 + 5 – 8 = -1 • Shift the significand of C left by 1 • 1.11000 -> 11.1000

  5. Quick Overview of FP (contd.) • Step 3: Depending on signs of A*B and C, add or subtract the two • Suppose A, B, and C are all positive • A*B + C = 10.011010110 + 11.1000 = 101.111010110 • Step 4: Normalize the Result • Currently the significand is 101.111010110 and the exponent is expA + expB = 7 • Normalized to 1.01111010110 x 29 • Step 5: Round the Result • The significand needs to be fit in 10 bits • Based on bits 11 through 13, the signficand is rounded and fit in 10 bits

  6. Block Diagram Input Input Input 16 16 16 5 RegArray A RegArray B RegArray C 10 10 5 10 5 Multiplier Exp Calc Align 5 22 14 35 Control Logic & Sign Dtrmin Leading 0 Anticipator Adder/Subtractor 36 4 Normalize 14 5 Round 10 5 1 Reg Y 16 Output

  7. Design Decisions (Week 2): • Implementing a 16 bit (fp16) format • 1 bit sign, 10 bit significand and 5 bit exponent • Compatible with OpenEXR format used in latest games • Enable Ultra-Threading • Implements high speed register arrays and fast thread switching logic to instantaneously switch to another available thread if the executing thread runs out of data • Implementation: High speed register-arrays for each input

  8. Design Decisions (contd.): • Multiplier Implementation • 11 x 11 Carry-Save Multiplier • Reasons: • Fast because it avoids having ripple carry in every stage • Enables Compact Layout

  9. Design Decisions (contd.): • 2’s Complement Adder/Subtractor • Variable Length Carry-Select Adder • Reason: Reduces delay through Muxes • Use the signs of the inputs to determine addition or subtraction • Output: 35-bits from Align + 1 Carry Out = 36 bits

  10. Design Decisions (contd.): • Leading Zero Counter • Carry-Save Adder to count the leading zeroes of C • Reason: To pre-compute the amount of shifting the result of A*B+C to normalize it • This will speed up our design because the Leading Zero Counter will not be in the critical path (which is through our multiplier)

  11. Design Decisions (contd.): • Align Exponent • Always align the exponent of C to expA + expB • Shift the significand of C by (expA + expB – expC) • If negative, shift left because C is bigger than A*B • If positive, shift right because C is smaller than A*B • Implementation: n-Pass Shifter • Normalize • Format the result of A*B + C to IEEE Format (i.e. change the significand from 101.011… to 1.01011…) • Align the exponent of the result as necessary • n-Pass Shifter to shift the result of the adder by the amount given by the Leading Zero Counter • Round • The result needs to be fit into 16 bits • To preserve precision, we round the result based on the last 3 bits • Implementation: Incrementer and Shifter

  12. Behavioral Verilog

  13. Behavioral Verilog (contd.)

  14. Behavioral Verilog (Output)

  15. Updated Estimated Transistor Count • Registers (input, output, pipelining) 2500 • Threading Logic 3000 • Carry-Save Multiplier 5000 • Carry-Select Adder 2000 • Alignment Shifter 1500 • Leading 0 Anticipator 700 • Normalize 2000 • Rounding 1500 • Special Cases and Control Logic 2000 • Total 20200

  16. Problems and Questions? • Difficulty finding a high-level simulator to exhaustively test our behavioral verilog because both Matlab and C use the IEEE 32-bit format. Currently we are thoroughly testing our behavioral verilog and coming up with different test cases by hand. • Suggested Solutions: • - Make a scalable 32-bit version of our behavioral verilog and test it against C • - Finding code written for software simulation by the VAX, PDP microprocessors.

  17. Questions?

More Related