1 / 56

Group Talk

Group Talk. Charlie Brej APT Group University of Manchester. Part 1: The Future According to Me. Charlie Brej APT Group University of Manchester. Razor Blades. 1998. Scheme 1: “Name” [Number] Plus/Extreme/Ultra/Turbo/?X Trac II Plus Core Quad Extreme Athlon 64 FX GeForce 8800 Ultra.

Download Presentation

Group Talk

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Group Talk Charlie Brej APT Group University of Manchester Async Forum

  2. Part 1:The Future According to Me Charlie Brej APT Group University of Manchester Async Forum

  3. Razor Blades 1998 Scheme 1: “Name” [Number] Plus/Extreme/Ultra/Turbo/?X Trac II Plus Core Quad Extreme Athlon 64 FX GeForce 8800 Ultra 1971 1901 Scheme 2: “Company Name” Fusion/Quattro/Mach Gillette Fusion, AMD Fusion, Ford Fusion Schick Quattro, NVIDIA Quadro, Audi Quattro Gillette Mach, ATI Mach, Ford Mustang Mach 1 Maybe more soon… 2005 2004 Async Forum

  4. Razor Blade History Async Forum

  5. Prediction:2007 Jan-Sept 15 Blade Apple iShave Async Forum

  6. Why did this not happen? • Because you don’t need more than five blades on your razor • Unless we grow larger faces • Which hasn’t happened before, so we wont need them for some time • We don’t need more than four processors • Unless we invent an automagic parallelism extractor • Which we haven’t since the 60s, so we wont need them for some time • People will still demand faster single thread performance Async Forum

  7. Real Future • Moore’s law will continue • Transistor count doubles every 18 months • Moving into 3rd dimension • Intelligent transistors placed per person will remain constant • Not copy-paste • Verification becomes problematic • Designs become very complicated Async Forum

  8. Productivity Managers 40% Grunt Coder 80% Can we make it pink? Sales 0% Hero Coder 100% Marketting -20% Maintainers 60% Admin 20% How about “Intel Terrano” Async Forum

  9. Brej’s Law • Person years per design doubles every 18 months • Most transistors are copy-paste • Verification becomes much more complex • Hero coders become more rare • People get stupider • Marketing becomes more important Async Forum

  10. Brej’s Law • 1985: 5 person years • ARM • 1997: 2560 person years • Pentium II (about right) • 2007: 81920 person years • Intel has 94,000 employees • AMD has 16,000 • A new design every 7 years Async Forum

  11. Brej’s Law • 2028: Entire population of the USA are employed by Intel • 2031: Entire population of China employed by AMD • 2034: Entire world population working on creating Pentium 12 • 2090: Project to build Pentium 15 starts but hits a snag as universe finishes before the project does Async Forum

  12. “The most powerful force in the universe is compound interest” Albert Einstein “And we didn't have any fancy Sony Playstation video games We had the Atari 2600! There were no multiple levels or screens. It was just ONE screen, forever, and you could never win. The game just kept getting harder and faster and until you died. Just like LIFE!” Ernest Cline Async Forum

  13. Back to the Future • Transistors will be free • Mostly consumed in memory • Diminishing returns • Single thread grinds to a halt • Increase performance by 1% get 100% more money • Fewer designs • Very expensive and long lead up times • Extend rather than redesign Async Forum

  14. Part 2:Wagging Logic: Non Throughput-Bound Design Methodology Charlie Brej APT Group University of Manchester Async Forum

  15. Introduction • Async performance • Asynchronous logic is slow • Wagging Logic • Example circuits • Red Star • Design • Results • Conclusions Async Forum

  16. Data propagation Logic C C C C C C C C Latency Cycle Time 0 1 2 3 4 5 6 7 8 9 10 11 12 Async Forum

  17. Control propagation Logic C C C C C C C C C C C C Latency Cycle Time 0 1 2 3 4 5 6 7 8 9 10 11 12 Async Forum

  18. Control propagation Logic C C C C C C C C C C C C Latency Cycle Time 0 1 2 3 4 5 6 7 8 9 10 11 12 Async Forum

  19. And then it gets worse • Latency is at least six times lower than the cycle time • Assumes all data arrives at arrive at the same time • Assumes all acknowledgements arrive at the same time • Actual number is somewhere between 10 and 100 Async Forum

  20. What can we do • Use two-phase signalling • Halve the control delay • Loose all average case advantages • Fine grain pipelining • Need to add 10+ latches per stage • Adds latency • Faster completion • Anti-tokens, Early-drop latches… • Careful timing analysis Async Forum

  21. Wagging Latches • Alternate latch read/write • Capacity of two latches • Depth of one latch Async Forum

  22. Wagging Logic • Apply same method to the logic • Rotate logic allowing one to set while others reset Set Reset Reset Async Forum

  23. Single Channel Mixer Async Forum

  24. LCM Channels Mixer Async Forum

  25. Direct Connection Mixer Async Forum

  26. 32bit Incrementer Example Reg +1 Slice 0 Reg +1 Slice 1 HB +1 Slice 2 HB +1 Async Forum

  27. 32bit Incrementer Optimal Design: 3288 Operations 3.04 GDs per operation Original Design: 77 Operations 130 GDs per operation Async Forum

  28. 32bit Accumulator Example • Load or Accumulate Async Forum

  29. 32bit Accumulator Example Load Accumulate Accumulate Load Accumulate Load Async Forum

  30. 32bit Accumulator Example Async Forum

  31. Transistors are “Free” • What is expensive? • Design effort • Time to market • Yield • What we want • Simple • Copy-Paste • Redundancy Async Forum

  32. Redundancy Slice Slice Slice Slice Slice Slice Async Forum

  33. Arrangement Slice 0 Slice 0 Slice 0 Slice 2 Slice 1 Slice 5 Slice 3 Slice 1 Slice 2 Slice 1 Slice 3 Slice 4 Slice 4 Slice 2 Slice 5 Slice 3 Async Forum

  34. Teaching Monkeys • Dynamic extraction of parallelism • Implicit data dependency tracking • No locking • No polling • No handshakes • Average case performance Async Forum

  35. Red Star • MIPS ISA • 32bit RISC • Fast and simple development • Use synchronous design methodology • Complicated features without complicated design effort • OOO execution, banked caching… Async Forum

  36. Red Star Async Forum

  37. Register Bank Async Forum

  38. ADD R1, R1, #1 1401 Operations 7.14 GDs per operation Async Forum

  39. Branch Logic PC +1 + Additional unnecessary stages to extend the branch shadow Async Forum

  40. Overlapping Instructions Fetch Decode Execute Memory Dummy WriteBack Branch Shadow Fetch Decode Execute Memory Dummy WriteBack Fetch Decode Execute Memory Dummy WriteBack Fetch Decode Execute Memory Dummy WriteBack Fetch Decode Execute Memory Dummy WriteBack Fetch Decode Execute Memory Dummy WriteBack Fetch Decode Execute Memory Dummy WriteBack Fetch Decode Execute Memory Dummy WriteBack Async Forum

  41. Nine Instruction Loop Async Forum

  42. Caching: 4 Instruction Loop RAM Slice 0 Cache Slice 1 Cache 0 0 1 1 2 2 3 3 4 5 6 7 Slice 2 Cache 0:Instruction 1:Instruction 2:Instruction 3:Branch 0 Slice 3 Cache Async Forum

  43. Caching: 3 Instruction Loop RAM Slice 0 Cache Slice 1 Cache 0 0 0 0 1 1 1 1 2 2 2 3 4 5 6 7 Slice 2 Cache 0:Instruction 1:Instruction 2:Branch 0 Slice 3 Cache Async Forum

  44. Caching: Delayed Branch RAM Slice 0 Cache If (PC%WagLevel != Slice) Execute a NOP Don’t increment the PC Slice 1 Cache 0 0 1 1 2 2 3 4 5 6 7 Slice 2 Cache 0:Instruction 1:Instruction 2:Branch 0 Slice 3 Cache NOP Async Forum

  45. Caching • Instead of one large 16Kb cache • 12bit address • 16 small 1Kb caches • 8bit address • Approximately 50% faster lookup • No data duplication Async Forum

  46. Area • ~4 times larger than synchronous • Times the number of slices • Currently 45,000 gates per slice • 15,000 gates without the register bank • Approx 6 million transistors (16 way) • 2 million without the register bank • Final design target: 4 million transistors • Don’t wag the register bank (66% of area) • Simplify completion detection (50% of area) • Technology mapper • Complete the ISA Async Forum

  47. How much is 4 million? Async Forum

  48. How much is 4 million? Async Forum

  49. How much is 4 million? Async Forum

  50. How much is 4 million? Async Forum

More Related