1 / 94

Al Davis Binu Mathew, Ali Ibrahim, Mike Parker, Karthik Ramani

big app’s in small packages. Al Davis Binu Mathew, Ali Ibrahim, Mike Parker, Karthik Ramani. Roadmap. challenges in embedded computing details of our current solution future current issues vision of where this is headed typical concluding remarks. Embedded Computing Characteristics.

Download Presentation

Al Davis Binu Mathew, Ali Ibrahim, Mike Parker, Karthik Ramani

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. big app’s in small packages Al Davis Binu Mathew, Ali Ibrahim, Mike Parker, Karthik Ramani

  2. Roadmap • challenges in embedded computing • details of our current solution • future • current issues • vision of where this is headed • typical concluding remarks

  3. Embedded Computing Characteristics • Historically • narrow application specific focus • typically cheap, low-power, provide just enough compute power • niche filled by small microcontroller/dsp devices • AND often ASIC component(s) • New Pressures • world goes bonkers on mobility and the web • expects ubiquitous wireless & tethered networks • expects better and cheaper everything • sensors, microphones & cameras become free • now we’re talking real computing

  4. Embedded Environments • Intimate connection between the environment and the electronics • Problems with diverse environments • usually • thermally limited • energy constrained • physical accessibility  stability is not guaranteed • temporary failure is guaranteed • catastrophic failure must be viewed as probable • malicious attacks should be viewed as possible

  5. New Look for ECS • Sophisticated application suites • not single algorithms – e.g. • 3G and 4G cellular handsets • multiple channels and multiple encoding models • plus the usual DSP stuff • process what is streaming in from the net • includes real time media & web access • process the sensor, microphone, and camera streams • plus network information from the neighborhood • since things are starting to happen in groups • wide range of services • dynamic selection •  no single app will do

  6. ECS Economics • Traditional reliance on the ASIC design cycle • lengthy IC design - > 1 year typical • little re-use • IP import works but there are many pitfalls • soft macro  HDL code  synthesize  ed inefficiency • hard macro  Macroblock  forces process and layout issues • turning an IC is costly • even when it works the first time • ECS product cycles • lifetime similar to a mayfly • need next improved version “real soon now” • Result • sell monster volumes or lose

  7. New ECS Quandry • Need unprecedented levels of ed efficiency • 1-3 orders of magnitude common • examples shortly • Also need more generality • ASIC reliance is problematic • Neither Moore’s law or IC technology will help • no Moore’s law for batteries • new IC technology has some problems • fast leaky transistors create an energy problem • leakage power starting to overtake active power • noise floor is rising rapidly as Vdd and Vth close

  8. Better Living Through Physics and Chemistry • Historical performance • 65% from process (physics and chemistry) • 35% from architectural innovation • New marching orders • true for mainstream but even more important for ECS • process won’t help • pressure on the architecture to compensate • need ASIC like ed efficiency but need CPU like generality •  NEW ARCHITECTURES • we’ll take whatever process help we can get

  9. The Hook • IF you buy what you’ve heard so far • THEN you’ve been set up • for the next segment • embedded architecture research at Utah • short story • architectural style has evolved • investigated in 2 big app arenas • perception (focus) • 3 and 4G cellular telephony (ignored in this talk)

  10. What is Perception Processing ? • Ubiquitous computing needs natural human interfaces • Processor support for perceptual applications • Gesture recognition • Object detection, recognition, tracking • Speech recognition • Biometrics • Applications • Multi-modal human friendly interfaces • Intelligent digital assistants • Robotics, unmanned vehicles • Perception prosthetics

  11. The Problem with Perception Processing consider always on aspect!!

  12. The Problem with Perception Processing • Too slow, too much power for embedded space! • 2.4 GHz Pentium 4 ~ 60 Watts • 400 MHz Xscale ~ 800 mW • 10x or more difference in performance • Inadequate memory bandwidth • Sphinx requires 1.2 GB/s memory bandwidth • Xscale delivers 64 MB/s ~ 1/19th • Our methodology • Characterize applications to find the problems • Derive acceleration architecture • History of FPUs is an analogy

  13. The Problem w/ GPP’s • caches & speculation • consume significant area and energy • great when they work – a liability when they don’t • rigid communication model • data moves from memory to registers • register  execution unit  register • inability to support efficient computational pipelines • ASIC advantage • bottom line • can process anything • but not efficiently in many cases

  14. High Level Architecture Processor Coprocessor Interface Memory Controller Input SRAMs Custom Accelerator Output SRAM DRAM Interface Scratch SRAMS Stream model basis

  15. The FaceRec Application

  16. FaceRec In Action Bobby Evans

  17. Application Structure ANN based Rowley Face Detector • Flesh toning: Soriano et al, Bertran et al • Segmentation: Text book approach • Rowley detector, voter: Henry Rowley, CMU • Viola & Jones’ detector: Published algorithm + Carbonetto, UBC • Eigenfaces: Re-implementation by Colorado State University Neural Net Eye Locator Eigenfaces Face Recognizer Segment Image Flesh tone Viola & Jones Face Detector Image Identity, Coordinates ~200 stage Adaboost

  18. FaceRec Characterization • ML-RSIM out of order processor simulator • SPARC V8 ISA, Unmodified SunOS binaries

  19. Application Profile

  20. Memory System Characteristics – L1 D Cache ECS bonus – small cache footprint and low miss rate

  21. Memory System Characteristics – L2 Cache • L2D$ is a waste low L1D$ miss rate and those pass through

  22. IPC

  23. Why is IPC low ? Neural Network Evaluation: Sum = Σn i=0 Weight[i] * Image[ Input[i] ] Result = Tanh(Sum) • Dependences – e.g.: no single cycle floating point accumulate • Indirect accesses • Several array accesses per operator • Load store ports saturate • Need architectures that can move data efficiently

  24. Real Time Performance

  25. Example App: CMU Sphinx 3.2 • Speech recognition engine • Speaker and language independent • Acoustic model: Triphone based, continuous • Hidden Markov Model (HMM) based • Grammar: Trigram with back-off • Open source HUB4 speech model • Broadcast news model (ABC news, NPR etc) • 64000 word vocabulary

  26. CMU Sphinx 3.2 Profile • Feature Vector = 13 Mel + 1st and 2nd derivative • 10 ms of speech is compressed into 39 SP floats • iMic possibility (opt)

  27. L1 D-cache Miss Rate

  28. L2 Cache Miss Rate (128B line)

  29. DRAM Bandwidth

  30. IPC

  31. Speech Conclusions • DRAM bandwidth starves the XU’s • GAU makes 100 sequential passes/sec over a 14MB table • speech signal only creates an additional 16KB/s • Lots of optimizations possible • break HMM and GAU dependency • GAU goes to 2x but parallelism exposed • interleave 10 samples and process GAU table once • reduction of 10x in bandwidth • but need 10x more intermediate storage • Still no chance for both speech and visual recognizers • high performance mproc’s too watty • best EPU’s too slow

  32. High Level Architecture Processor Coprocessor Interface Memory Controller Input SRAMs Custom Accelerator Output SRAM DRAM Interface Scratch SRAMS

  33. Simple ASIC Design Example: Matrix Multiply def matrix_multiply(A, B, C): # C is the result matrix for i in range(0, 16): for j in range(0, 16): C[i][j] = inner_product(A, B, i, j) def inner_product(A, B, row, col): sum = 0.0 for i in range(0,16): sum = sum + A[row][i] * B[i][col] return sum

  34. ASIC Accelerator Design: Matrix Multiply Control Pattern def matrix_multiply(A, B, C): # C is the result matrix for i in range(0, 16): for j in range(0, 16): C[i][j] = inner_product(A, B, i, j) def inner_product(A, B, row, col): sum = 0.0 for i in range(0,16): sum = sum + A[row][i] * B[i][col] return sum

  35. ASIC Accelerator Design: Matrix Multiply Access Pattern def matrix_multiply(A, B, C): # C is the result matrix for i in range(0, 16): for j in range(0, 16): C[i][j] = inner_product(A, B, i, j) def inner_product(A, B, row, col): sum = 0.0 for i in range(0,16): sum = sum + A[row][i] * B[i][col] return sum

  36. ASIC Accelerator Design: Matrix Multiply Compute Pattern def matrix_multiply(A, B, C): # C is the result matrix for i in range(0, 16): for j in range(0, 16): C[i][j] = inner_product(A, B, i, j) def inner_product(A, B, row, col): sum = 0.0 for i in range(0,16): sum = sum + A[row][i] * B[i][col] return sum

  37. ASIC Accelerator Design: Matrix Multiply def matrix_multiply(A, B, C): # C is the result matrix for i in range(0, 16): for j in range(0, 16): C[i][j] = inner_product(A, B, i, j) def inner_product(A, B, row, col): sum = 0.0 for i in range(0,16): sum = sum + A[row][i] * B[i][col] return sum

  38. ASIC Accelerator Design: Matrix Multiply 7 cycle latency def matrix_multiply(A, B, C): # C is the result matrix for i in range(0, 16): for j in range(0, 16): C[i][j] = inner_product(A, B, i, j) def inner_product(A, B, row, col): sum = 0.0 for i in range(0,16): sum = sum + A[row][i] * B[i][col] return sum

  39. ASIC Accelerator Design: Matrix Multiply Interleave >= 7 inner products Complicates address generation def matrix_multiply(A, B, C): # C is the result matrix for i in range(0, 16): for j in range(0, 16): C[i][j] = inner_product(A, B, i, j) def inner_product(A, B, row, col): sum = 0.0 for i in range(0,16): sum = sum + A[row][i] * B[i][col] return sum

  40. How can we generalize ? • Decompose loop into: • Control pattern • Access pattern • Compute pattern Programmable h/w acceleration for each pattern

  41. The Perception Processor Architecture Family

  42. Perception Processor Pipeline

  43. Function Unit Organization

  44. Interconnect

  45. Loop Unit

  46. Address Generator A[(i+k1)<<k2+k3][(j+k4)<<k5+k6] A[B[i]]

  47. Experimental Method • Measure processor power on • 2.4 GHz Pentium 4, 0.13u process • 400 MHz XScale, 0.18u process • Perception Processor • 1 GHz, 0.13u process (Berkeley Predictive Tech Model) • Verilog, MCL HDLs • Synthesized using Synopsys Design Compiler • Fanout based heuristic wire loads • Spice (Nanosim) simulation yields current waveform • Numerical integration to calculate energy • ASICs in 0.25u process • Normalize 0.18u, 0.25u energy and delay numbers

  48. Benchmarks • Visual feature recognition • Erode, Dilate: Image segmentation opertators • Fleshtone: NCC flesh tone detector • Viola, Rowley: Face detectors • Speech recognition • HMM: 5 state Hidden Markov Model • GAU: 39 element, 8 mixture Gaussian • DSP • FFT: 128 point, complex to complex, floating point • FIR: 32 tap, integer • Encryption • Rijndael: 128 bit key, 576 byte packets

  49. Results: IPC Mean IPC = 3.3x R14K

  50. Results: Throughput Mean Throughput = 1.75x Pentium 0.41x ASIC

More Related