Al Davis Binu Mathew, Ali Ibrahim, Mike Parker, Karthik Ramani

big app’s in small packages Al Davis Binu Mathew, Ali Ibrahim, Mike Parker, Karthik Ramani

Roadmap • challenges in embedded computing • details of our current solution • future • current issues • vision of where this is headed • typical concluding remarks

Embedded Computing Characteristics • Historically • narrow application specific focus • typically cheap, low-power, provide just enough compute power • niche filled by small microcontroller/dsp devices • AND often ASIC component(s) • New Pressures • world goes bonkers on mobility and the web • expects ubiquitous wireless & tethered networks • expects better and cheaper everything • sensors, microphones & cameras become free • now we’re talking real computing

Embedded Environments • Intimate connection between the environment and the electronics • Problems with diverse environments • usually • thermally limited • energy constrained • physical accessibility  stability is not guaranteed • temporary failure is guaranteed • catastrophic failure must be viewed as probable • malicious attacks should be viewed as possible

New Look for ECS • Sophisticated application suites • not single algorithms – e.g. • 3G and 4G cellular handsets • multiple channels and multiple encoding models • plus the usual DSP stuff • process what is streaming in from the net • includes real time media & web access • process the sensor, microphone, and camera streams • plus network information from the neighborhood • since things are starting to happen in groups • wide range of services • dynamic selection •  no single app will do

ECS Economics • Traditional reliance on the ASIC design cycle • lengthy IC design - > 1 year typical • little re-use • IP import works but there are many pitfalls • soft macro  HDL code  synthesize  ed inefficiency • hard macro  Macroblock  forces process and layout issues • turning an IC is costly • even when it works the first time • ECS product cycles • lifetime similar to a mayfly • need next improved version “real soon now” • Result • sell monster volumes or lose

New ECS Quandry • Need unprecedented levels of ed efficiency • 1-3 orders of magnitude common • examples shortly • Also need more generality • ASIC reliance is problematic • Neither Moore’s law or IC technology will help • no Moore’s law for batteries • new IC technology has some problems • fast leaky transistors create an energy problem • leakage power starting to overtake active power • noise floor is rising rapidly as Vdd and Vth close

Better Living Through Physics and Chemistry • Historical performance • 65% from process (physics and chemistry) • 35% from architectural innovation • New marching orders • true for mainstream but even more important for ECS • process won’t help • pressure on the architecture to compensate • need ASIC like ed efficiency but need CPU like generality •  NEW ARCHITECTURES • we’ll take whatever process help we can get

The Hook • IF you buy what you’ve heard so far • THEN you’ve been set up • for the next segment • embedded architecture research at Utah • short story • architectural style has evolved • investigated in 2 big app arenas • perception (focus) • 3 and 4G cellular telephony (ignored in this talk)

What is Perception Processing ? • Ubiquitous computing needs natural human interfaces • Processor support for perceptual applications • Gesture recognition • Object detection, recognition, tracking • Speech recognition • Biometrics • Applications • Multi-modal human friendly interfaces • Intelligent digital assistants • Robotics, unmanned vehicles • Perception prosthetics

The Problem with Perception Processing consider always on aspect!!

The Problem with Perception Processing • Too slow, too much power for embedded space! • 2.4 GHz Pentium 4 ~ 60 Watts • 400 MHz Xscale ~ 800 mW • 10x or more difference in performance • Inadequate memory bandwidth • Sphinx requires 1.2 GB/s memory bandwidth • Xscale delivers 64 MB/s ~ 1/19th • Our methodology • Characterize applications to find the problems • Derive acceleration architecture • History of FPUs is an analogy

The Problem w/ GPP’s • caches & speculation • consume significant area and energy • great when they work – a liability when they don’t • rigid communication model • data moves from memory to registers • register  execution unit  register • inability to support efficient computational pipelines • ASIC advantage • bottom line • can process anything • but not efficiently in many cases

High Level Architecture Processor Coprocessor Interface Memory Controller Input SRAMs Custom Accelerator Output SRAM DRAM Interface Scratch SRAMS Stream model basis

The FaceRec Application

FaceRec In Action Bobby Evans

Application Structure ANN based Rowley Face Detector • Flesh toning: Soriano et al, Bertran et al • Segmentation: Text book approach • Rowley detector, voter: Henry Rowley, CMU • Viola & Jones’ detector: Published algorithm + Carbonetto, UBC • Eigenfaces: Re-implementation by Colorado State University Neural Net Eye Locator Eigenfaces Face Recognizer Segment Image Flesh tone Viola & Jones Face Detector Image Identity, Coordinates ~200 stage Adaboost

FaceRec Characterization • ML-RSIM out of order processor simulator • SPARC V8 ISA, Unmodified SunOS binaries

Application Profile

Memory System Characteristics – L1 D Cache ECS bonus – small cache footprint and low miss rate

Memory System Characteristics – L2 Cache • L2D$ is a waste low L1D$ miss rate and those pass through

IPC

Why is IPC low ? Neural Network Evaluation: Sum = Σn i=0 Weight[i] * Image[ Input[i] ] Result = Tanh(Sum) • Dependences – e.g.: no single cycle floating point accumulate • Indirect accesses • Several array accesses per operator • Load store ports saturate • Need architectures that can move data efficiently

Real Time Performance

Example App: CMU Sphinx 3.2 • Speech recognition engine • Speaker and language independent • Acoustic model: Triphone based, continuous • Hidden Markov Model (HMM) based • Grammar: Trigram with back-off • Open source HUB4 speech model • Broadcast news model (ABC news, NPR etc) • 64000 word vocabulary

CMU Sphinx 3.2 Profile • Feature Vector = 13 Mel + 1st and 2nd derivative • 10 ms of speech is compressed into 39 SP floats • iMic possibility (opt)

L1 D-cache Miss Rate

L2 Cache Miss Rate (128B line)

DRAM Bandwidth

IPC

Speech Conclusions • DRAM bandwidth starves the XU’s • GAU makes 100 sequential passes/sec over a 14MB table • speech signal only creates an additional 16KB/s • Lots of optimizations possible • break HMM and GAU dependency • GAU goes to 2x but parallelism exposed • interleave 10 samples and process GAU table once • reduction of 10x in bandwidth • but need 10x more intermediate storage • Still no chance for both speech and visual recognizers • high performance mproc’s too watty • best EPU’s too slow

High Level Architecture Processor Coprocessor Interface Memory Controller Input SRAMs Custom Accelerator Output SRAM DRAM Interface Scratch SRAMS

Simple ASIC Design Example: Matrix Multiply def matrix_multiply(A, B, C): # C is the result matrix for i in range(0, 16): for j in range(0, 16): C[i][j] = inner_product(A, B, i, j) def inner_product(A, B, row, col): sum = 0.0 for i in range(0,16): sum = sum + A[row][i] * B[i][col] return sum

ASIC Accelerator Design: Matrix Multiply Control Pattern def matrix_multiply(A, B, C): # C is the result matrix for i in range(0, 16): for j in range(0, 16): C[i][j] = inner_product(A, B, i, j) def inner_product(A, B, row, col): sum = 0.0 for i in range(0,16): sum = sum + A[row][i] * B[i][col] return sum

ASIC Accelerator Design: Matrix Multiply Access Pattern def matrix_multiply(A, B, C): # C is the result matrix for i in range(0, 16): for j in range(0, 16): C[i][j] = inner_product(A, B, i, j) def inner_product(A, B, row, col): sum = 0.0 for i in range(0,16): sum = sum + A[row][i] * B[i][col] return sum

ASIC Accelerator Design: Matrix Multiply Compute Pattern def matrix_multiply(A, B, C): # C is the result matrix for i in range(0, 16): for j in range(0, 16): C[i][j] = inner_product(A, B, i, j) def inner_product(A, B, row, col): sum = 0.0 for i in range(0,16): sum = sum + A[row][i] * B[i][col] return sum

ASIC Accelerator Design: Matrix Multiply def matrix_multiply(A, B, C): # C is the result matrix for i in range(0, 16): for j in range(0, 16): C[i][j] = inner_product(A, B, i, j) def inner_product(A, B, row, col): sum = 0.0 for i in range(0,16): sum = sum + A[row][i] * B[i][col] return sum

ASIC Accelerator Design: Matrix Multiply 7 cycle latency def matrix_multiply(A, B, C): # C is the result matrix for i in range(0, 16): for j in range(0, 16): C[i][j] = inner_product(A, B, i, j) def inner_product(A, B, row, col): sum = 0.0 for i in range(0,16): sum = sum + A[row][i] * B[i][col] return sum

ASIC Accelerator Design: Matrix Multiply Interleave >= 7 inner products Complicates address generation def matrix_multiply(A, B, C): # C is the result matrix for i in range(0, 16): for j in range(0, 16): C[i][j] = inner_product(A, B, i, j) def inner_product(A, B, row, col): sum = 0.0 for i in range(0,16): sum = sum + A[row][i] * B[i][col] return sum

How can we generalize ? • Decompose loop into: • Control pattern • Access pattern • Compute pattern Programmable h/w acceleration for each pattern

The Perception Processor Architecture Family

Perception Processor Pipeline

Function Unit Organization

Interconnect

Loop Unit

Address Generator A[(i+k1)<<k2+k3][(j+k4)<<k5+k6] A[B[i]]

Experimental Method • Measure processor power on • 2.4 GHz Pentium 4, 0.13u process • 400 MHz XScale, 0.18u process • Perception Processor • 1 GHz, 0.13u process (Berkeley Predictive Tech Model) • Verilog, MCL HDLs • Synthesized using Synopsys Design Compiler • Fanout based heuristic wire loads • Spice (Nanosim) simulation yields current waveform • Numerical integration to calculate energy • ASICs in 0.25u process • Normalize 0.18u, 0.25u energy and delay numbers

Benchmarks • Visual feature recognition • Erode, Dilate: Image segmentation opertators • Fleshtone: NCC flesh tone detector • Viola, Rowley: Face detectors • Speech recognition • HMM: 5 state Hidden Markov Model • GAU: 39 element, 8 mixture Gaussian • DSP • FFT: 128 point, complex to complex, floating point • FIR: 32 tap, integer • Encryption • Rijndael: 128 bit key, 576 byte packets

Results: IPC Mean IPC = 3.3x R14K

Results: Throughput Mean Throughput = 1.75x Pentium 0.41x ASIC

Al Davis Binu Mathew, Ali Ibrahim, Mike Parker, Karthik Ramani

Al Davis Binu Mathew, Ali Ibrahim, Mike Parker, Karthik Ramani

Presentation Transcript

Mike Davis

By: Ali al jaidah

Badriya Ali Ibrahim 200911481

Mike Parker Chair 7 th December 2005

Noora Saeed Al hameli 201105506 Aisha Ali Ibrahim 201231424

Razan yousef AL-Ali

Rodney Wilts Matt Humphries Mike Parker

By: Michael Bailey James Roe Mike Parker

Ali Al Saqqa General Manager, MEA ali@boost.no

Saif Ali AL Kuwaiti

Parker-Davis Project Post-2008 Resource Pool

Parker-Davis Project Post-2008 Re-marketing

Parker-Davis Project FES Contract Amendment

Ali al-Sistani

Haider Al huliali Haider Al Ali

Parker-Davis Project

THEORY OF MEASUREMENTS Mike Davis

Mike Davis

Al Parker

Scaling Ramani Huria

Coach Mike Davis September 2014

Ibrahim Al Banna Advocates & Legal Consultants