1 / 36

Traditional Processing

Adaptive Computing: The Next Big Leap in Processing Systems 2002 MAPLD Paul Master, CTO QuickSilver Technology. Traditional Processing. Computational Power Reduction to instructions is very inefficient. µP, DSP = 5% efficiency. Most of the chip area is used to run & decode instructions;

tilden
Download Presentation

Traditional Processing

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Adaptive Computing: The Next Big Leap in Processing Systems2002 MAPLDPaul Master, CTOQuickSilver Technology

  2. Traditional Processing • Computational Power • Reduction to instructions is very inefficient. µP, DSP = 5% efficiency Most of the chip area is used to run & decode instructions; effectively simulating the hw design Machine code A1.101011101010 A2. 010001000110 A3. 000100100111 A4. 100100010010 B1. 101000010001 Assembly Code A1. 05AC42232689ADFFFF A2. FFFF4468420000DABB A3. 3333888AADFCC14811 A4. AAB1CCDDEEFF0010C B1. 014432289AAEEFFCCD ALU area (where the work is done) is very small A B C D E C Prgm. Paul Master – Presentation E8

  3. C E A B D What You Really Want • Computational Power • Binaries expand into efficient hardware in real time. 50-75% efficiency An Adaptive Computing matrix adapts to become the exact hardware required at any and every point in time. A. 01010 B. 00100 C. 10001 D. 01001 E. 01010 A B C D E C-like Prgm. Paul Master – Presentation E8

  4. C E A B D The Real Metric - CPE • Computational Power Efficiency(CPE) • Measurement of work accomplished within a given timeframe against the amount of power consumed. µP, DSP CPE = 5% Simulate hw ASIC CPE = 20-30% Rigid hw ACM CPE = 50-75% Hw on demand Machine Code D A B A1.101011101010 A2. 010001000110 A3. 000100100111 A4. 100100010010 B1. 101000010001 E C Assembly Code A1. 05AC42232689ADFFFF A2. FFFF4468420000DABB A3. 3333888AADFCC14811 A4. AAB1CCDDEEFF0010C B1. 014432289AAEEFFCCD A. 01010 B. 00100 C. 10001 D. 01001 E. 01010 B C D E A C-like Prgm. C Prgm. Paul Master – Presentation E8

  5. Virtual Hardware – Time Sequencing Binaries… …become hardware ACM V1.0 1010101011100010 1100010001000001 1101000100100010 0100100100010110 0010101000010001 Codebook Search T0: START ACM Pitch Search T1: Run & Load ACM ACM LSP Computation T2: Run & Load ACM Recursive T3: Run & Load Convolution ACM Filter1 T4: Run & Load ACM Filter2 T5: Run & Load ACM Filter3 T6: Run & Load Vocoder Hardware Elements ACM T7: Run & Load te xt te xt te Paul Master – Presentation E8

  6. ASIC Blocks 3G MIPS 14,000 8,000 DSP 400 Not a “Processor” Problem • Pick a application at the bleeding edge – 3G • For a true software defined radio/backend • Need roughly 14,000 MIPs of horsepower • Must run on batteries with long life • Be consumer priced • End product can’t be bigger than last year’s model. • DSP/Micro only solves a small percentage of the real problem. • The Gold Standard is the ASIC. • We must outdo an ASIC at it’s own game. Paul Master – Presentation E8

  7. Conventional Approach Interleaver FEC Encoder Power Mgmt. Packet Controller Modulator RF System Timer Flash X Searcher Vocoder A/D Searcher Searcher Audio Codec Frequency Synthesizer Filters D/A UI / Graphics USB X Channel Coder RAKE Receiver SRAM Channel Decoder AGC Algorithm Protocol Stack ASIC Error Correction DSP Interleaver MMU RISC CDMA Handset Example Paul Master – Presentation E8

  8. ACM Approach Conventional Approach Power Mgmt. Interleaver FEC Encoder Ext I/O Power Mgmt. Packet Controller Flash ACM Modulator RF System Timer Audio Codec Flash X Searcher Vocoder A/D Searcher USB Searcher Audio Codec Frequency Synthesizer Filters D/A RF UI / Graphics SRAM USB X Channel Coder RAKE Receiver SRAM Channel Decoder AGC Algorithm Protocol Stack ASIC Error Correction DSP Interleaver MMU MP3 RISC MPEG-4 Bluetooth GPS CDMA GSM/GPRS cdma2000 wCDMA Handset Example • Power consumption < 87% • Die size reduces > 60% • Performance increases ~ 9x • Design time drops ~ 50% Paul Master – Presentation E8

  9. 0.487 mm2 MEM < 0.016 mw MEM 13.12 mm2 4.53 mw 0.175 mm2 < 0.016 mw MEM MEM 4.79 mm2 1.91 mw 0.261 mm2 0.36 mw DSP IDCT 0.189 mm2 0.053 mw MC ACM 0.77 mm2 13.75 mw 5.00 mm2 0.104 mm2 8.6 mw FIR RRC 16.72 mw 0.0569 mm2 0.08 mw IIR AGC OVSF 0.041 mw 0.00824 mm2 LFSR 3G Partial Results* - smaller size *WCDMA Tx, MPEG4 decode, AMR-EFR • DSP+ASICACM • Power: 30.40 mW 15.66 mW • Area: 19.23mm2 6.25 mm2 • Utilization: 15% + 100% 26.25% • Rigid Flexible Paul Master – Presentation E8

  10. ASIC Blocks 3G 14,000 8,000 DSP 400 MIPS TTM / portable / reuse Must be right the first time Few engineering resources Very expensive to create Fast TTM / portable / reuse Mistakes are free Vast engineering resources Inexpensive to create If “c productivity” = 1 Assembly = 1/10 Verilog / VHDL = 1/100 Schematics = 1/1000 Key is Ease of Development Paul Master – Presentation E8

  11. Adaptive ComputingArchitecture

  12. If You Run Fast Enough… • There is a difference between MODAL reuse (reconfigurability) and Algorithmic ELEMENT reuse (adaptability) • MODAL Reuse vs. a Velcro Design • At a system level if the Sum of total area of each non-concurrent application is GREATER than the total area of a reconfigurable system you win in area == $$$ • There are a fair number of applications that have this multi-modal property • Algorithmic Element Reuse • If the reconfiguration rate is high enough then you can reuse algorithmic elements • Within 1 application if the Sum of the total area of each non-concurrent algorithmic element is GREATER than total area of a reconfigurable algorithmic element you win in area === $$$ • Individual applications have many examples of this multi-algorithmic element property • Algorithmic Element Reuse is additive to the MODAL Reuse benefits • Fast adaptations of small elements will beat a few reconfigurations running slower Paul Master – Presentation E8

  13. The Data Sets You Free • Work is defined by algorithms that decompose into sets of algorithmic elements • Process at the algorithmic element level - an Algorithmic Element Processor • Real word problems consists of heterogeneous algorithmic elements • Low level hardware building blocks better be heterogeneous • All algorithmic elements exhibit structure • Exploit time and space “locality of reference” it exposes • Most applications are dominated by memory • Use a uniform mixture of memory transistors and computational transistors so you can scale Paul Master – Presentation E8

  14. “Something More Wicked… • “…This Way Comes” = Adaptive Computing Machines • Real time adaptation of hardware in a single clock cycle • By application • By applet • By algorithmic element • Dynamic reallocation of hardware both spatially and temporally • Processing power directed to specific problems as needed • Ultra-low power consumption • Small die size; low cost • Hardware looks and acts like software • Utilizes DSP programming model • Lets engineers innovate quickly and easily • Compares favorably to ASIC gold standard solutions Paul Master – Presentation E8

  15. WLAN CDMA2000 IS-95A Voice Compression GPS MPEG4 GSM MPEG2 GPRS Music Compression EDGE XM Radio Sirius Algorithm Space W-CDMA TDMA Paul Master – Presentation E8

  16. Matrix Interconnect Network (MIN) Node Adaptive Computing Machine • The ACM consists of a heterogeneous matrix of similar, but different, nodes interconnected by a scalable, homogenous, communications network. • Heterogeneity more efficient for complex, multitask, systems. • Distributed Memory • Locality of reference both in space and time. • Adaptable clock cycle by clock cycle. • Scalable from 1 node to thousands. Paul Master – Presentation E8

  17. Basic ACM Node Types • Arithmetic node • Implements different, linear, variable-width, arithmetic functions clock-cycle-by-clock-cycle • Implements different, non-linear, variable-width, arithmetic functions clock-cycle-by-clock-cycle • Bit-manipulation node • Implements different, variable-width, bit-manipulation functions clock-cycle-by-clock-cycle • Finite state machine node • Implements different, high-speed, complicated, finite-state machines clock-cycle-by-clock-cycle • Scalar node • Implements different, complicated control sequences • Configurable input/output node • Implements different interfaces to external interfaces such as buses Paul Master – Presentation E8

  18. Arithmetic Node Structure Paul Master – Presentation E8

  19. mini-matrix mini-matrix Interconnection Interconnection dma dma Network Network engines engines Distributed configuration memory DAG DAG DAG DAG CU CU Type Type CU iMemory iMemory CU CU CU CU CU CU CU type 2 type 1 type 1 type 1 type 1 type 1 type 1 type 1 RAM Highway Level 0 Highway Level 1 Highway Mini-Matrix Controller Level 2 Highway Boolean Highway Node Adaptation E=(A+B)*(C+D) + + x Paul Master – Presentation E8

  20. MC IDCT ME Data In Memory IDCT MC ME MPEG4 Example Typical ASIC – move the data Data In Memory Once engineers can quickly & easily experiment ACM – move the logic Paul Master – Presentation E8

  21. ACM Tool Flow Algorithmic Description Legacy C User Input New Application Environment Translator Application Porting Environment Wizards & Assistants SilverC Interactive Profiling Translator SilverC Compiler Simulator Debug Linker & Resource Scheduler Libraries SilverWare Generation SilverWare Simulator ACM Emulator Paul Master – Presentation E8

  22. Example: C FIR Filter fir(int input[], int coef[], int nCoef, int output[], int nOut ) { int i, j; int sum; for (j = 0; j < nOut; j++) { sum = 0; for (i = 0; i < nCoef; i++){ sum += input[j+i] * coef[i]; } output[j] = sum >> 15; } } Paul Master – Presentation E8

  23. Example: SilverC FIR Filter void run (void) { fract16 sum; loop (int l=0; l<nOut; l++) dataflow { sample = input.read(); sum = 0.0; unroll (int i=0; i<nCoef; i++) { sum = sum + coefReg[i] * sample[nCoef-i]; } output.write(sum); } } Simple to experiment with hardware functionality Eliminates need for hardware/software co-design Paul Master – Presentation E8

  24. Working ACMs

  25. New Processor Paradigm • 1. Silicon IC • Scalable, heterogeneous • Adaptable clock cycle by clock cycle • Low power consumption, small die area • 2. Development tools • Pathway for existing legacy (SPW, Matlab, C code) • Path for adaptive algorithms • SilverC language to bypass limitations of old sequential programming models • 3. Operating system • Forward binary compatibility across different silicon versions and across differently sized silicon • Manage swapping of hardware & downloads of new hardware modules • 4. Applications • Base set of wizards, templates & function calls • Adaptive algorithms Paul Master – Presentation E8

  26. ACM Eye-Openers • Benchmark against ASIC performance “gold standards” • CDMA2000 system acquisition • CDMA2000 rake finger • CDMA2000 set maintenance • W-CDMA system acquisition, stage III • Always designed in ASIC’s to meet performance • Now completely designed in software • In 1 IC operate critical pieces of WCDMA and CDMA2000 • 1st instance of silicon for a Software Defined Radio handset • Show very high speed dynamic adaptations at ultra-low power consumption • Out perform the best fixed-function silicon accelerators. Paul Master – Presentation E8

  27. ACM IC Demos • 1) Find Strongest BS to talk to: • load Sys Acquisition module • Run Sys Acquisition algorithm • Find the strongest pilot • 2) Receive Neighbor List from BS: • load Finger module • Run Finger algorithm using pervious result • Demodulate the neighbor list cdma2000 • 3) Search & report strong neighbor BSs (continuous): • load Set Maintenance module • Run Set Maint. algorithm using neighbor list • Repeatedly find neighbors above threshold • 4) Find Strongest Cell ID in a given group (continuous): • Stop cdma2000 operation • load W-CDMA stage 3 searcher module • Run searcher algorithm • Repeatedly find neighbors above threshold W-CDMA Paul Master – Presentation E8

  28. Rake Finger Adaptations >57,000 hardware adaptations/second Every 52ms the ACM builds the hw, runs the app, and tears down Input Bits Host (PC) Demodulated bits I/Q Gen PCIInterface ACM Node #1 Output bits Rx I/Q Walsh MIN (Matrix Interconnect Network) Repeat every 52ms Traffic and Pilot Despreading ACM Node #2 Channel Estimate Phase Correction Paul Master – Presentation E8

  29. 3.4x 14x 3.3x 13x 2.3x 9x Performance Benchmarks ASICACM 25MHz 200 MHz (measured) (projected) (projected) CDMA2000 searcher4 nodes 16 nodes 16 nodes 2X sampling, using 512-chip complex correlations, with captured data processed at 8Xchip rate (equivalent to 16 parallel correlators running at real time) CDMA2000 Pilot search 2X sampling, using 512-chip complex correlations, with captured data processed at 8Xchip rate (equivalent to 16 parallel correlators running at real time) W-CDMA searcher 1X sampling, using 256-chip correlations with streaming data 3.4 Secs 184ms. 533 usecs. 1.0 Secs 55ms. 232 usecs. 0.25 Secs 14ms. 58 usecs. 0.032 Secs 108x faster 1.7 ms. 108x faster 7.25 usecs. 74x faster Paul Master – Presentation E8

  30. ACM Met Its Goals • Dynamically reconfigured ICs at very high speed with ultra-low power consumption. • 57,000 adaptations/sec while doing useful work • Measured performance versus ASIC gold standards. • Runs faster than best-in-class ASIC solutions. • Created a low cost solution. • Very small die area Paul Master – Presentation E8

  31. ACM Versus FPGA • ACM (2000) • Built to efficiently compute & manipulate information • Algorithmic element reuse • Fractal wiring plane • Heterogeneous array • Adaptive compute elements • Data stays “resident” • 1 clock cycle adaptability • Ultra-low power consumption • Machine state binaries • DSP/sw algorithm tool flow • Real-time kernel • Adaptive algorithm designs FPGA (1980) Built for TTL & ASIC bug fixes Algorithm model reuse XY wiring plane Homogeneous array Fine grain CLB blocks Data moves across die Slow reconfigurable rates Very high power consumption Single, large configuration file HDL, ASIC tool flow No operating kernel Typical hardware designs Paul Master – Presentation E8

  32. Age of ACMs

  33. Products are not Silicon Limited • Dynamic algorithms mapped to dynamic hardware resources • Low cost, high performance, low power consumption • One product covers many “hardware” applications • MP3 player (MP3 codec, modem), GPS (receiver & accelerator), universal roaming (which protocols), data modems, network accesses (what network), MPEG 3/4/5/6, recorders, handwriting & speech recognition, mapping, … Paul Master – Presentation E8

  34. Uncertainty? Who cares? Escalating Automotive Systems Uncertainty Telematics Games Wireless Internet Cell phone Attributes of ALL electronic markets DVD Rear Seat Multimedia DAB / DVB GPS VR MP3 XM Sirius iboq Collision Avoidance Systems “Smart” Anti lock brakes and traction systems Night Vision Systems Hazardous Condition Detection Emergency Road Assistance Alarm and anti theft CD CD Cassette FM Changing Air Bags Evolving 8-track 8-track Standard On board engine computers Safety and Pollution standards Steering Braking Ventilation Ignition Systems Hydraulic Systems AM Radio 1950 1990 2050 2010 2040 2030 2000 2020 1970 1900 1980 1940 1960 1910 1930 1920 Paul Master – Presentation E8

  35. ACM Grand Challenges • Technically • “Beat ASICs at their own game” • Higher performance; lower power consumption • Smaller area & lower cost • Software adaptable over broad market segments • Become the next generation IC platform • Put the design back into hardware design & allow engineers to engineer • Lower overall engineering & product costs • Business • Create higher margins for everyone • Software margins extend to hardware • Annuity revenues on hardware designs • Recreate the “PC” business model in mobile products Paul Master – Presentation E8

  36. Ray Bradbury Revisited SciFi but stated differently: SciFi is real… “SiF(i)” Silicon is only a function of its input Paul Master – Presentation E8

More Related