940 likes | 946 Views
big app’s in small packages. Al Davis Binu Mathew, Ali Ibrahim, Mike Parker, Karthik Ramani. Roadmap. challenges in embedded computing details of our current solution future current issues vision of where this is headed typical concluding remarks. Embedded Computing Characteristics.
E N D
big app’s in small packages Al Davis Binu Mathew, Ali Ibrahim, Mike Parker, Karthik Ramani
Roadmap • challenges in embedded computing • details of our current solution • future • current issues • vision of where this is headed • typical concluding remarks
Embedded Computing Characteristics • Historically • narrow application specific focus • typically cheap, low-power, provide just enough compute power • niche filled by small microcontroller/dsp devices • AND often ASIC component(s) • New Pressures • world goes bonkers on mobility and the web • expects ubiquitous wireless & tethered networks • expects better and cheaper everything • sensors, microphones & cameras become free • now we’re talking real computing
Embedded Environments • Intimate connection between the environment and the electronics • Problems with diverse environments • usually • thermally limited • energy constrained • physical accessibility stability is not guaranteed • temporary failure is guaranteed • catastrophic failure must be viewed as probable • malicious attacks should be viewed as possible
New Look for ECS • Sophisticated application suites • not single algorithms – e.g. • 3G and 4G cellular handsets • multiple channels and multiple encoding models • plus the usual DSP stuff • process what is streaming in from the net • includes real time media & web access • process the sensor, microphone, and camera streams • plus network information from the neighborhood • since things are starting to happen in groups • wide range of services • dynamic selection • no single app will do
ECS Economics • Traditional reliance on the ASIC design cycle • lengthy IC design - > 1 year typical • little re-use • IP import works but there are many pitfalls • soft macro HDL code synthesize ed inefficiency • hard macro Macroblock forces process and layout issues • turning an IC is costly • even when it works the first time • ECS product cycles • lifetime similar to a mayfly • need next improved version “real soon now” • Result • sell monster volumes or lose
New ECS Quandry • Need unprecedented levels of ed efficiency • 1-3 orders of magnitude common • examples shortly • Also need more generality • ASIC reliance is problematic • Neither Moore’s law or IC technology will help • no Moore’s law for batteries • new IC technology has some problems • fast leaky transistors create an energy problem • leakage power starting to overtake active power • noise floor is rising rapidly as Vdd and Vth close
Better Living Through Physics and Chemistry • Historical performance • 65% from process (physics and chemistry) • 35% from architectural innovation • New marching orders • true for mainstream but even more important for ECS • process won’t help • pressure on the architecture to compensate • need ASIC like ed efficiency but need CPU like generality • NEW ARCHITECTURES • we’ll take whatever process help we can get
The Hook • IF you buy what you’ve heard so far • THEN you’ve been set up • for the next segment • embedded architecture research at Utah • short story • architectural style has evolved • investigated in 2 big app arenas • perception (focus) • 3 and 4G cellular telephony (ignored in this talk)
What is Perception Processing ? • Ubiquitous computing needs natural human interfaces • Processor support for perceptual applications • Gesture recognition • Object detection, recognition, tracking • Speech recognition • Biometrics • Applications • Multi-modal human friendly interfaces • Intelligent digital assistants • Robotics, unmanned vehicles • Perception prosthetics
The Problem with Perception Processing consider always on aspect!!
The Problem with Perception Processing • Too slow, too much power for embedded space! • 2.4 GHz Pentium 4 ~ 60 Watts • 400 MHz Xscale ~ 800 mW • 10x or more difference in performance • Inadequate memory bandwidth • Sphinx requires 1.2 GB/s memory bandwidth • Xscale delivers 64 MB/s ~ 1/19th • Our methodology • Characterize applications to find the problems • Derive acceleration architecture • History of FPUs is an analogy
The Problem w/ GPP’s • caches & speculation • consume significant area and energy • great when they work – a liability when they don’t • rigid communication model • data moves from memory to registers • register execution unit register • inability to support efficient computational pipelines • ASIC advantage • bottom line • can process anything • but not efficiently in many cases
High Level Architecture Processor Coprocessor Interface Memory Controller Input SRAMs Custom Accelerator Output SRAM DRAM Interface Scratch SRAMS Stream model basis
FaceRec In Action Bobby Evans
Application Structure ANN based Rowley Face Detector • Flesh toning: Soriano et al, Bertran et al • Segmentation: Text book approach • Rowley detector, voter: Henry Rowley, CMU • Viola & Jones’ detector: Published algorithm + Carbonetto, UBC • Eigenfaces: Re-implementation by Colorado State University Neural Net Eye Locator Eigenfaces Face Recognizer Segment Image Flesh tone Viola & Jones Face Detector Image Identity, Coordinates ~200 stage Adaboost
FaceRec Characterization • ML-RSIM out of order processor simulator • SPARC V8 ISA, Unmodified SunOS binaries
Memory System Characteristics – L1 D Cache ECS bonus – small cache footprint and low miss rate
Memory System Characteristics – L2 Cache • L2D$ is a waste low L1D$ miss rate and those pass through
Why is IPC low ? Neural Network Evaluation: Sum = Σn i=0 Weight[i] * Image[ Input[i] ] Result = Tanh(Sum) • Dependences – e.g.: no single cycle floating point accumulate • Indirect accesses • Several array accesses per operator • Load store ports saturate • Need architectures that can move data efficiently
Example App: CMU Sphinx 3.2 • Speech recognition engine • Speaker and language independent • Acoustic model: Triphone based, continuous • Hidden Markov Model (HMM) based • Grammar: Trigram with back-off • Open source HUB4 speech model • Broadcast news model (ABC news, NPR etc) • 64000 word vocabulary
CMU Sphinx 3.2 Profile • Feature Vector = 13 Mel + 1st and 2nd derivative • 10 ms of speech is compressed into 39 SP floats • iMic possibility (opt)
Speech Conclusions • DRAM bandwidth starves the XU’s • GAU makes 100 sequential passes/sec over a 14MB table • speech signal only creates an additional 16KB/s • Lots of optimizations possible • break HMM and GAU dependency • GAU goes to 2x but parallelism exposed • interleave 10 samples and process GAU table once • reduction of 10x in bandwidth • but need 10x more intermediate storage • Still no chance for both speech and visual recognizers • high performance mproc’s too watty • best EPU’s too slow
High Level Architecture Processor Coprocessor Interface Memory Controller Input SRAMs Custom Accelerator Output SRAM DRAM Interface Scratch SRAMS
Simple ASIC Design Example: Matrix Multiply def matrix_multiply(A, B, C): # C is the result matrix for i in range(0, 16): for j in range(0, 16): C[i][j] = inner_product(A, B, i, j) def inner_product(A, B, row, col): sum = 0.0 for i in range(0,16): sum = sum + A[row][i] * B[i][col] return sum
ASIC Accelerator Design: Matrix Multiply Control Pattern def matrix_multiply(A, B, C): # C is the result matrix for i in range(0, 16): for j in range(0, 16): C[i][j] = inner_product(A, B, i, j) def inner_product(A, B, row, col): sum = 0.0 for i in range(0,16): sum = sum + A[row][i] * B[i][col] return sum
ASIC Accelerator Design: Matrix Multiply Access Pattern def matrix_multiply(A, B, C): # C is the result matrix for i in range(0, 16): for j in range(0, 16): C[i][j] = inner_product(A, B, i, j) def inner_product(A, B, row, col): sum = 0.0 for i in range(0,16): sum = sum + A[row][i] * B[i][col] return sum
ASIC Accelerator Design: Matrix Multiply Compute Pattern def matrix_multiply(A, B, C): # C is the result matrix for i in range(0, 16): for j in range(0, 16): C[i][j] = inner_product(A, B, i, j) def inner_product(A, B, row, col): sum = 0.0 for i in range(0,16): sum = sum + A[row][i] * B[i][col] return sum
ASIC Accelerator Design: Matrix Multiply def matrix_multiply(A, B, C): # C is the result matrix for i in range(0, 16): for j in range(0, 16): C[i][j] = inner_product(A, B, i, j) def inner_product(A, B, row, col): sum = 0.0 for i in range(0,16): sum = sum + A[row][i] * B[i][col] return sum
ASIC Accelerator Design: Matrix Multiply 7 cycle latency def matrix_multiply(A, B, C): # C is the result matrix for i in range(0, 16): for j in range(0, 16): C[i][j] = inner_product(A, B, i, j) def inner_product(A, B, row, col): sum = 0.0 for i in range(0,16): sum = sum + A[row][i] * B[i][col] return sum
ASIC Accelerator Design: Matrix Multiply Interleave >= 7 inner products Complicates address generation def matrix_multiply(A, B, C): # C is the result matrix for i in range(0, 16): for j in range(0, 16): C[i][j] = inner_product(A, B, i, j) def inner_product(A, B, row, col): sum = 0.0 for i in range(0,16): sum = sum + A[row][i] * B[i][col] return sum
How can we generalize ? • Decompose loop into: • Control pattern • Access pattern • Compute pattern Programmable h/w acceleration for each pattern
Address Generator A[(i+k1)<<k2+k3][(j+k4)<<k5+k6] A[B[i]]
Experimental Method • Measure processor power on • 2.4 GHz Pentium 4, 0.13u process • 400 MHz XScale, 0.18u process • Perception Processor • 1 GHz, 0.13u process (Berkeley Predictive Tech Model) • Verilog, MCL HDLs • Synthesized using Synopsys Design Compiler • Fanout based heuristic wire loads • Spice (Nanosim) simulation yields current waveform • Numerical integration to calculate energy • ASICs in 0.25u process • Normalize 0.18u, 0.25u energy and delay numbers
Benchmarks • Visual feature recognition • Erode, Dilate: Image segmentation opertators • Fleshtone: NCC flesh tone detector • Viola, Rowley: Face detectors • Speech recognition • HMM: 5 state Hidden Markov Model • GAU: 39 element, 8 mixture Gaussian • DSP • FFT: 128 point, complex to complex, floating point • FIR: 32 tap, integer • Encryption • Rijndael: 128 bit key, 576 byte packets
Results: IPC Mean IPC = 3.3x R14K
Results: Throughput Mean Throughput = 1.75x Pentium 0.41x ASIC