Elad alon krste asanovic director
This presentation is the property of its rightful owner.
Sponsored Links
1 / 37

Elad Alon , Krste Asanovic (Director) , PowerPoint PPT Presentation


  • 129 Views
  • Uploaded on
  • Presentation posted in: General

A lgorithms and S pecializers for P rovably Optimal I mplementations with R esiliency and E fficiency. Elad Alon , Krste Asanovic (Director) , Jonathan Bachrach , Jim Demmel , Armando Fox, Kurt Keutzer , Borivoje Nikolic , David Patterson, Koushik Sen , John Wawrzynek

Download Presentation

Elad Alon , Krste Asanovic (Director) ,

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Elad alon krste asanovic director

Algorithms and Specializers for Provably Optimal Implementations with Resiliency and Efficiency

EladAlon, Krste Asanovic (Director),

Jonathan Bachrach, Jim Demmel, Armando Fox, Kurt Keutzer, BorivojeNikolic, David Patterson, KoushikSen, John Wawrzynek

[email protected]

http://aspire.eecs.berkeley.edu


Future application drivers

Future Application Drivers

Augmented Reality

Pervasive Speech

Robotics

BIG DATA

Social Networks

Environment

Personalized Medicine


Compute energy iron law

Compute Energy “Iron Law”

  • When power is constrained, need better energy efficiency for more performance

  • Where performance is constrained (real-time), want better energy efficiency to lower power

    Improving energy Efficiency is critical goal for all future systems and workloads

Performance = Power * Energy Efficiency

(Tasks/Second) (Joules/Second) (Tasks/Joule)


Good news moore s law continues

Good News: Moore’s Law Continues

Cheaper!

More Transistors/Chip

“Cramming more components onto integrated circuits”, Gordon E. Moore, Electronics, 1965


Bad news dennard voltage scaling over

Bad News:Dennard (Voltage) Scaling Over

Moore, ISSCC Keynote, 2003

Dennard Scaling

Post-Dennard Scaling


1 st impact of end of scaling end of sequential processor era

1st Impact of End of Scaling:End of Sequential Processor Era


Parallelism a one time gain

Parallelism:A one-time gain

Use more, slower cores for better energy efficiency.

Either

  • simpler cores, or

  • run cores at lower Vdd/frequency

  • Even simpler general-purpose microarchitectures?

    • Limited by smallest sensible core

  • Even Lower Vdd/Frequency?

    • Limited by Vdd/Vt scaling, errors

  • Now what?


2 nd impact of end of scaling dark silicon cannot switch all transistors at full frequency

2nd Impact of End of Scaling: “Dark Silicon” Cannot switch all transistors at full frequency!

[Muller, ARM CTO, 2009]

  • No savior device technology on horizon.

  • Future energy-efficiency innovations must be above transistor level.


The end of general purpose processors

The End of General-Purpose Processors?

  • Most computing happens in specialized, heterogeneous processors

    • Can be 100-1000X more efficient than general-purpose processor

  • Challenges:

    • Hardware design costs

    • Software development costs

NVIDIA Tegra2


The real scaling challenge communication

The Real Scaling Challenge:Communication

As transistors become smaller and cheaper, communication dominates performance and energy

All scales:

  • Across chip

  • Up and down memory hierarchy

  • Chip-to-chip

  • Board-to-board

  • Rack-to-rack


As pi re from better to best

ASPIRE: From Better to Best

Specialize and optimize communication and computation across whole stack from applications to hardware

  • What is the best we can do?

    • For a fixed target technology (e.g., 7nm)

  • Can we prove a bound?

  • Can we design implementation approaching bound?

     Provably Optimal Implementations


Communication avoiding a lgorithms algorithm cost measures

Communication-Avoiding Algorithms: Algorithm Cost Measures

CPU

Cache

CPU

DRAM

CPU

DRAM

DRAM

CPU

DRAM

CPU

DRAM

  • Arithmetic (FLOPS)

  • Communication: moving data between

    • levels of a memory hierarchy (sequential case)

    • processors over a network (parallel case).


Modeling runtime energ y

Modeling Runtime & Energy


A few examples of speedups

A few examples of speedups

  • Matrix multiplication

    • Up to 12x on IBM BG/P for n=8K on 64K cores; 95% less communication

  • QR decomposition (used in least squares, data mining, …)

    • Up to 8x on 8-core dual-socket Intel Clovertown, for 10M x 10

    • Up to 6.7x on 16-proc. Pentium III cluster, for 100K x 200

    • Up to 13x on Tesla C2050 / Fermi, for 110k x 100

    • Up to 4x on Grid of 4 cities (Dongarra, Langou et al)

    • “infinite speedup” for out-of-core on PowerPC laptop

      • LAPACK thrashed virtual memory, didn’t finish

  • Eigenvalues of band symmetric matrices

    • Up to 17x on Intel Gainestown, 8 core, vs MKL 10.0 (up to 1.9x sequential)

  • Iterative sparse linear equations solvers (GMRES)

    • Up to 4.3x on Intel Clovertown, 8 core

  • N-body (direct particle interactions with cutoff distance)

    • Up to 10x on Cray XT-4 (Hopper), 24K particles on 6K procs.


Modeling energy dynamic

Modeling Energy: Dynamic


Modeling energy memory retention

Modeling Energy: Memory Retention


Modeling energy background power

Modeling Energy: Background Power


Energy lower bounds

Energy Lower Bounds


Early result perfect strong scaling in time and energy

Early Result:Perfect Strong Scaling in Time and Energy

  • Every time you add processor, use its memory M too

  • Start with minimal number of procs: PM = 3n2

  • Increase P by factor c  total memory increases by factor c

  • Notation for timing model:

    • γt , βt , αt = secs per flop, per word_moved, per message of size m

      T(cP) = n3/(cP) [ γT+ βt/M1/2 + αt/(mM1/2) ] = T(P)/c

  • Notation for energy model:

    • γe , βe , αe = Joules for same operations

    • δe = Joules per word of memory used per sec

    • εe = Joules per sec for leakage, etc.

      E(cP) = cP { n3/(cP) [ γe+ βe/M1/2 + αe/(mM1/2) ] + δeMT(cP)

      + εET(cP) } = E(P)

  • Perfect scaling extends to n-body, Strassen, …

[IPDPS, 2013]


C a a lgorithms not just for hpc

C-A Algorithms Not Just for HPC

  • In ASPIRE, apply to other key application areas: machine vision, databases, speech recognition, software-defined radio, …

  • Initial results on lower bounds of database join algorithms


From c a algorithms to provably optimal systems

From C-A Algorithms to Provably Optimal Systems?

  • 1) Prove lower bounds on communication for a computation

  • 2) Develop algorithm that achieves lower bound on a system

  • 3) Find that communication time/energy cost is >90% of resulting implementation

  • 4) We know we’re within 10% of optimal!

  • Supporting technique: Optimizing software stack and compute engines to reduce compute costs and expose unavoidable communication costs


Esp an applications processor architecture for aspire

ESP: An Applications Processor Architecture for ASPIRE

Intel Ivy Bridge (22nm)

  • Future server and mobile SoCs will have many fixed-function accelerators and a general-purpose programmable multicore

  • Well-known how to customize hardware engines for specific task

  • ESP challenge is using specialized engines for general-purpose code

ESP

Qualcomm Snapdragon MSM8960 (28nm)

ESP


Esp ensembles of specialized processors

ESP: Ensembles of Specialized Processors

  • General-purpose hardware, flexible but inefficient

  • Fixed-function hardware, efficient but inflexible

  • Par Lab Insight: Patterns capture common operations across many applications, each with unique communication& computation structure

  • Build an ensemble of specialized engines, each individually optimized for particular pattern but collectively covering application needs

  • Bet: Will give us efficiency plus flexibility

    • Any given core can have a different mix of these depending on workload


Par lab motifs common across apps

Par Lab: Motifs common across apps

Audio Recognition

Scene Analysis

Object Recognition

Applications

Dense

Sparse

Graph

Berkeley View “Dwarfs” or Motifs

24


Motif nee dwarf popularity red hot blue cool

Motif (nee “Dwarf”) Popularity (Red Hot/Blue Cool)

Computing Domains

Par Lab Apps

25


Architecting parallel software

Architecting Parallel Software

Application

Identify the Key Computations

Identify the Software Structure

  • Graph Algorithms

  • Dynamic programming

  • Dense/Spare Linear Algebra

  • Un/Structured Grids

  • Graphical Models

  • Finite State Machines

  • Backtrack Branch-and-Bound

  • N-Body Methods

  • Circuits

  • Spectral Methods

  • Monte-Carlo

  • Pipe-and-Filter

  • Agent-and-Repository

  • Event-based

  • Bulk Synchronous

  • Map-Reduce

  • Layered Systems

  • Model-view controller

  • Arbitrary Task Graphs

  • Puppeteer

  • Model-View-Controller


Mapping software to esp s pecializers

Mapping Software to ESP: Specializers

Scene Analysis

Audio Recognition

Object Recognition

Applications

  • Capture desired functionality at high-level using patterns in a productive high-level language

  • Use pattern-specific compilers (Specializers) with autotuners to produce efficient low-level code

  • ASP specializer infrastructure, open-source download

Berkeley View “Dwarfs” or Motifs

Dense

Sparse

Graph

Specializers with SEJITS Implementations and Autotuning

ESP Code

Glue Code

Dense Code

SparseCode

Graph Code

ESP Core

ILP Engine

Dense Engine

Sparse Engine

Graph Engine


Replacing fixed accelerators with programmable fabric

Replacing Fixed Accelerators with Programmable Fabric

Intel Ivy Bridge (22nm)

  • Future server and mobile SoCs will have many fixed-function accelerators and a general-purpose programmable multicore

  • Fabric challenge is retaining extreme energy efficiency while retaining programmability

Fabric

Fabric

Qualcomm Snapdragon MSM8960 (28nm)

Fabric

Fabric


Strawman fabric architecture

Strawman Fabric Architecture

  • Will never have a C compiler

  • Only programmed using pattern-based DSLs

  • More dynamic, less static than earlier approaches

    • Dynamic dataflow-driven execution

    • Dynamic routing

    • Large memory support

R

R

R

R

R

R

R

R

R

R

R

R

R

R

R

R

M

M

M

M

M

M

M

M

M

M

M

M

M

M

M

M

A

A

A

A

A

A

A

A

A

A

A

A

A

A

A

A


Agile hardware development

“Agile Hardware” Development

  • Current hardware design slow and arduous

  • But now have huge design space to explore

  • How to examine many design points efficiently?

  • Build parameterized generators, not point designs!

  • Adopt and adapt best practices from Agile Software

    • Complete LVS-DRC clean physical design of current version every ~ two weeks (“tapein”)

    • Incremental feature addition

    • Test & Verification first step


Chisel constructing hardware in a scala embedded language

Chisel: Constructing Hardware In a Scala Embedded Language

  • Embed a hardware-description language in Scala, using Scala’s extension facilities

  • A hardware module is just a data structure in Scala

  • Different output routines can generate different types of output (C, FPGA-Verilog, ASIC-Verilog) from same hardware representation

  • Full power of Scala for writing hardware generators

    • Object-Oriented: Factory objects, traits, overloading etc

    • Functional: Higher-order funcs, anonymous funcs, currying

    • Compiles to JVM: Good performance, Java interoperability


Chisel design flow

Chisel Design Flow

Chisel Program

Scala/JVM

FPGA Verilog

ASIC Verilog

C++ code

C++ Compiler

FPGA Tools

ASIC Tools

Software Simulator

FPGA Emulation

GDS Layout


Chisel is much more than an hdl

Chisel is much more than an HDL

  • The base Chisel system allows you to use the full power of Scala to describe the RTL of a design, then generate Verilog or C++ output from the RTL

  • But Chisel can be extended above with domain-specific languages (e.g., signal processing) for fabric

  • Importantly, Chisel can also be extended below with new backends or to add new tools or features (e.g., quantum computing circuits)

  • Only ~6,000 lines of code in current version including libraries!

  • BSD-licensed open source at:

    chisel.eecs.berkeley.edu


Many processor tapeouts in few years with small group 45nm 28nm

Many processor tapeouts in few years with small group (45nm, 28nm)

Processor Site

Clock test site

CORE 0

VC0

CORE 2

VC2

DCDC test site

512KB

L2

VFIXED

Test

Sites

CORE 1

VC1

CORE 3

VC3

SRAM test site


R esilient circuits modeling

Resilient Circuits & Modeling

  • Future scaled technologies have high variability but want to run with lowest-possible margins to save energy

  • Significant increase in soft errors, need resilient systems

  • Technology modeling to determine tradeoff between MTBF and energy per task for logic, SRAM, & interconnect.

Techniques to reduce operating voltage can be worse for energy due to rapid rise in errors


Elad alon krste asanovic director

Algorithms and Specializers for Provably Optimal Implementations with Resiliency and Efficiency

Audio Recognition

Scene Analysis

Object Recognition

Applications

Pipe&Filter

Computational and Structural Patterns

Software

Dense

Sparse

Graph

C-A GEMM

C-ASpMV

C-A BFS

Communication-Avoiding Algorithms

Specializers with SEJITS Implementations and Autotuning

Map-Reduce

ESP Code

Glue Code

Dense Code

SparseCode

Graph Code

ESP (Ensembles of Specialized Processors) Architecture

ESP Core

ILP Engine

Dense Engine

Sparse Engine

Graph Engine

Hardware Cache Coherence

Local Stores + DMA

Hardware

Hardware Generators using Chisel HDL

Deep HW/SW Design-Space Exploration

C++ Simulation

FPGA Emulation

FPGA Computer

ASIC

SoC

Validation/Verification

Implementation Technologies


Aspire project

ASPIRE Project

  • Initial $15.6M/5.5 year funding from DARPA PERFECT program

    • Started 9/28/2012

    • Located in Par Lab space + BWRC

  • Looking for industrial affiliates (see Krste!)

  • Open House today, 5th floor Soda Hall

Research funded by DARPA Award Number HR0011-12-2-0016. Approved for public release; distribution is unlimited. The content of this presentation does not necessarily reflect the position or the policy of the US government and no official endorsement should be inferred.


  • Login