Group Talk

Group Talk Charlie Brej APT Group University of Manchester Async Forum

Part 1:The Future According to Me Charlie Brej APT Group University of Manchester Async Forum

Razor Blades 1998 Scheme 1: “Name” [Number] Plus/Extreme/Ultra/Turbo/?X Trac II Plus Core Quad Extreme Athlon 64 FX GeForce 8800 Ultra 1971 1901 Scheme 2: “Company Name” Fusion/Quattro/Mach Gillette Fusion, AMD Fusion, Ford Fusion Schick Quattro, NVIDIA Quadro, Audi Quattro Gillette Mach, ATI Mach, Ford Mustang Mach 1 Maybe more soon… 2005 2004 Async Forum

Razor Blade History Async Forum

Prediction:2007 Jan-Sept 15 Blade Apple iShave Async Forum

Why did this not happen? • Because you don’t need more than five blades on your razor • Unless we grow larger faces • Which hasn’t happened before, so we wont need them for some time • We don’t need more than four processors • Unless we invent an automagic parallelism extractor • Which we haven’t since the 60s, so we wont need them for some time • People will still demand faster single thread performance Async Forum

Real Future • Moore’s law will continue • Transistor count doubles every 18 months • Moving into 3rd dimension • Intelligent transistors placed per person will remain constant • Not copy-paste • Verification becomes problematic • Designs become very complicated Async Forum

Productivity Managers 40% Grunt Coder 80% Can we make it pink? Sales 0% Hero Coder 100% Marketting -20% Maintainers 60% Admin 20% How about “Intel Terrano” Async Forum

Brej’s Law • Person years per design doubles every 18 months • Most transistors are copy-paste • Verification becomes much more complex • Hero coders become more rare • People get stupider • Marketing becomes more important Async Forum

Brej’s Law • 1985: 5 person years • ARM • 1997: 2560 person years • Pentium II (about right) • 2007: 81920 person years • Intel has 94,000 employees • AMD has 16,000 • A new design every 7 years Async Forum

Brej’s Law • 2028: Entire population of the USA are employed by Intel • 2031: Entire population of China employed by AMD • 2034: Entire world population working on creating Pentium 12 • 2090: Project to build Pentium 15 starts but hits a snag as universe finishes before the project does Async Forum

“The most powerful force in the universe is compound interest” Albert Einstein “And we didn't have any fancy Sony Playstation video games We had the Atari 2600! There were no multiple levels or screens. It was just ONE screen, forever, and you could never win. The game just kept getting harder and faster and until you died. Just like LIFE!” Ernest Cline Async Forum

Back to the Future • Transistors will be free • Mostly consumed in memory • Diminishing returns • Single thread grinds to a halt • Increase performance by 1% get 100% more money • Fewer designs • Very expensive and long lead up times • Extend rather than redesign Async Forum

Part 2:Wagging Logic: Non Throughput-Bound Design Methodology Charlie Brej APT Group University of Manchester Async Forum

Introduction • Async performance • Asynchronous logic is slow • Wagging Logic • Example circuits • Red Star • Design • Results • Conclusions Async Forum

Data propagation Logic C C C C C C C C Latency Cycle Time 0 1 2 3 4 5 6 7 8 9 10 11 12 Async Forum

Control propagation Logic C C C C C C C C C C C C Latency Cycle Time 0 1 2 3 4 5 6 7 8 9 10 11 12 Async Forum

And then it gets worse • Latency is at least six times lower than the cycle time • Assumes all data arrives at arrive at the same time • Assumes all acknowledgements arrive at the same time • Actual number is somewhere between 10 and 100 Async Forum

What can we do • Use two-phase signalling • Halve the control delay • Loose all average case advantages • Fine grain pipelining • Need to add 10+ latches per stage • Adds latency • Faster completion • Anti-tokens, Early-drop latches… • Careful timing analysis Async Forum

Wagging Latches • Alternate latch read/write • Capacity of two latches • Depth of one latch Async Forum

Wagging Logic • Apply same method to the logic • Rotate logic allowing one to set while others reset Set Reset Reset Async Forum

Single Channel Mixer Async Forum

LCM Channels Mixer Async Forum

Direct Connection Mixer Async Forum

32bit Incrementer Example Reg +1 Slice 0 Reg +1 Slice 1 HB +1 Slice 2 HB +1 Async Forum

32bit Incrementer Optimal Design: 3288 Operations 3.04 GDs per operation Original Design: 77 Operations 130 GDs per operation Async Forum

32bit Accumulator Example • Load or Accumulate Async Forum

32bit Accumulator Example Load Accumulate Accumulate Load Accumulate Load Async Forum

32bit Accumulator Example Async Forum

Transistors are “Free” • What is expensive? • Design effort • Time to market • Yield • What we want • Simple • Copy-Paste • Redundancy Async Forum

Redundancy Slice Slice Slice Slice Slice Slice Async Forum

Arrangement Slice 0 Slice 0 Slice 0 Slice 2 Slice 1 Slice 5 Slice 3 Slice 1 Slice 2 Slice 1 Slice 3 Slice 4 Slice 4 Slice 2 Slice 5 Slice 3 Async Forum

Teaching Monkeys • Dynamic extraction of parallelism • Implicit data dependency tracking • No locking • No polling • No handshakes • Average case performance Async Forum

Red Star • MIPS ISA • 32bit RISC • Fast and simple development • Use synchronous design methodology • Complicated features without complicated design effort • OOO execution, banked caching… Async Forum

Red Star Async Forum

Register Bank Async Forum

ADD R1, R1, #1 1401 Operations 7.14 GDs per operation Async Forum

Branch Logic PC +1 + Additional unnecessary stages to extend the branch shadow Async Forum

Overlapping Instructions Fetch Decode Execute Memory Dummy WriteBack Branch Shadow Fetch Decode Execute Memory Dummy WriteBack Fetch Decode Execute Memory Dummy WriteBack Fetch Decode Execute Memory Dummy WriteBack Fetch Decode Execute Memory Dummy WriteBack Fetch Decode Execute Memory Dummy WriteBack Fetch Decode Execute Memory Dummy WriteBack Fetch Decode Execute Memory Dummy WriteBack Async Forum

Nine Instruction Loop Async Forum

Caching: 4 Instruction Loop RAM Slice 0 Cache Slice 1 Cache 0 0 1 1 2 2 3 3 4 5 6 7 Slice 2 Cache 0:Instruction 1:Instruction 2:Instruction 3:Branch 0 Slice 3 Cache Async Forum

Caching: 3 Instruction Loop RAM Slice 0 Cache Slice 1 Cache 0 0 0 0 1 1 1 1 2 2 2 3 4 5 6 7 Slice 2 Cache 0:Instruction 1:Instruction 2:Branch 0 Slice 3 Cache Async Forum

Caching: Delayed Branch RAM Slice 0 Cache If (PC%WagLevel != Slice) Execute a NOP Don’t increment the PC Slice 1 Cache 0 0 1 1 2 2 3 4 5 6 7 Slice 2 Cache 0:Instruction 1:Instruction 2:Branch 0 Slice 3 Cache NOP Async Forum

Caching • Instead of one large 16Kb cache • 12bit address • 16 small 1Kb caches • 8bit address • Approximately 50% faster lookup • No data duplication Async Forum

Area • ~4 times larger than synchronous • Times the number of slices • Currently 45,000 gates per slice • 15,000 gates without the register bank • Approx 6 million transistors (16 way) • 2 million without the register bank • Final design target: 4 million transistors • Don’t wag the register bank (66% of area) • Simplify completion detection (50% of area) • Technology mapper • Complete the ISA Async Forum

How much is 4 million? Async Forum

Group Talk

Group Talk

Presentation Transcript

A Talk Without Words: Visualizing Group Theory

Zettl Group Safety Talk ~Fume Hoods~

Group Talk

TALK!

“Giving a Talk” Talk

Talk the Talk

talk, talk ,talk

Sessler Group Safety Talk

Mein Wohnort – Group Talk !

Group Talk

Let’s Talk… Test Talk!

Talk

Talk Phones, Talk Maths

TALK

small talk BIG TALK

From susy group talk of last Wednesday

A CHALK TALK TALK

“Let’s Talk” Facilitating Thinking Through Group Discussions

Talk

TALK-TALK Hand Signs

Let’s Talk. . . Text Talk!!!

Let’s Talk. . . Text Talk!!!