Breaking the memory wall for scalable microprocessor platforms
This presentation is the property of its rightful owner.
Sponsored Links
1 / 30

Breaking the Memory Wall for Scalable Microprocessor Platforms PowerPoint PPT Presentation


  • 40 Views
  • Uploaded on
  • Presentation posted in: General

Breaking the Memory Wall for Scalable Microprocessor Platforms. Wen-mei Hwu with John W. Sias, Erik M. Nystrom, Hong-seok Kim, Chien-wei Li, Hillery C. Hunter, Shane Ryoo, Sain-Zee Ueng, James W. Player, Ian M. Steiner, Chris I. Rodrigues, Robert E. Kidd,

Download Presentation

Breaking the Memory Wall for Scalable Microprocessor Platforms

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Breaking the memory wall for scalable microprocessor platforms

Breaking the Memory Wall for Scalable Microprocessor Platforms

Wen-mei Hwu

with

John W. Sias, Erik M. Nystrom, Hong-seok Kim, Chien-wei Li,

Hillery C. Hunter, Shane Ryoo, Sain-Zee Ueng, James W. Player,

Ian M. Steiner, Chris I. Rodrigues, Robert E. Kidd,

Dan R. Burke, Nacho Navarro, Steve S. Lumetta

University of Illinois at Urbana-Champaign


Semiconductor computing platform challenges

Semiconductor computing platform challenges

S/W inertia

O/S limitations

reliability

feature set

performance

security

accelerators

power

cost

Reconfigurability

Microprocessors

Mem. Latency/Bandwidth

Power Constraints

Intelligent RAM

DSP/ASIP

wire load

fab cost

leakage

process variation

billion transistors


Asic asip economics

ASIC/ASIP economics

  • Optimistically, ASIC/ASSP revenues growing 10–20% / year

    • Engineering portion of budget is supposed to be trimmed every year (but never is)

    • Chip development costs rising faster than increased revenues and decreased engineering costs can make up the difference

    • Implies 40% fewer IC designs (doing more applications) - every process generation!!

Total ASIC/ASSP

Revenues

10-20%

5-20%

Engineering

Costs

×

Number of IC

Designs

40%

Per-chip

Development

Cost

30-100%


Asips non traditional programmable platforms

Micro

engine

Micro

engine

Micro

engine

Micro

engine

SPI4 / CSIX

RFIFO

TFIFO

Micro

engine

Micro

engine

Micro

engine

Micro

engine

XScale

Core

PCI

Hash

Engine

Micro

engine

Micro

engine

Micro

engine

Micro

engine

Scratch-

pad

SRAM

RDRAM

RDRAM

RDRAM

Micro

engine

Micro

engine

Micro

engine

Micro

engine

CSRs

QDR

SRAM

QDR

SRAM

QDR

SRAM

QDR

SRAM

ASIPs: non-traditional programmable platforms

Level of concurrency mustbe comparable to ASICs

ASIPs will be on-chip, high-performance multi-processors


Example embedded assp implementations

Example embedded ASSP implementations

VLIW

MIPS

Philips Nexperia (Viper)

Intel IXP1200 Network Processor


What about the general purpose world

What about the general purpose world

  • Clock frequency increase of computing engines is slowing down

    • Power budget hinders higher clock frequency

    • Device variation limits deeper pipelining

    • Most future perf. improvement will come from concurrency and specialization

  • Size increase of single-thread computing engines is slowing down

    • Power budget limits number of transistors activated by each instruction

    • Need finer-grained units for defect containment

    • Wire delay is becoming a primary limiter in large, monolithic designs

  • The approach to covering all applications with a primarily single execution model is showing limitations


Impact of transistor variations

Impact of Transistor Variations

1.4

Frequency

~30%

Leakage

Power

~5X

30%

1.3

1.2

130nm

Normalized Frequency

1.1

1.0

5X

0.9

1

2

3

4

5

Normalized Leakage (Isb)

Source: Shekhar Borkar, Intel


Metal interconnects

Metal Interconnects

1000

1

100

Low-K ILD

Line Res (Relative)

Line Cap (Relative)

0.5

10

1

0

500

250

130

65

32

500

250

130

65

32

100

10000

Interconnect RC Delay

1000

Clock Period

RC Delay (Relative)

10

100

Delay (ps)

Copper Interconnect

0.7x Scaled RC Delay

10

RC delay of 1mm interconnect

1

1

500

250

130

65

32

350

250

180

130

90

65

Source: Shekhar Borkar, Intel


Measured specint2000 performance on real hardware with same fabrication technology

Measured SPECint2000 Performanceon real hardware with same fabrication technology

Date: October 2003


Convergence of future computing platforms

General processor cores

  • Very low power compute and memory structures

  • O/S provides lightweight access to custom features

Application processors

  • Lightweight compute engines

  • High-bandwidth, distributed storage (RAM, registers)

  • High-bandwidth, scalable interconnect

Memory system

  • Data delivery to processor

  • O/S and virtual memory issues

  • Intelligent memory controllers

Acceleration logic

  • Application specific logic

  • High-bandwidth, distributed storage (RAM, registers)

  • To developer, behave like software components

Convergence of future computing platforms


Breaking the memory wall with distributed memory and data movement

Breaking the memory wall withdistributed memory and data movement


Parallelization with deep analysis deconstructing von neumann iwls2004

Parallelization with deep analysis: Deconstructing von Neumann [IWLS2004]

  • Memory dataflow that enables

    • Extraction of independent memory access streams

    • Conversion of implicit flows through memory into explicit communication

  • Applicability to mass software base requires pointer analysis, control flow analysis, array dependence analysis

CPU

CPU

DRAM

PE’s

DRAM

Az_4

PE’s

Az_4

Weight_Ai

(Az, F_ga3, Ap3)

Weight_Ai

(Az, F_g4, Ap4)

synth

synth

Residu

(Ap3, &syn_subfr[i],)

res2

res2

Copy

(Ap3, h, 11)

Weight_Ai

Weight_Ai

Set_zero

(&h[11], 11)

m_syn

m_syn

(Ap4, h, h, 22, &h)

Syn_filt

Copy+

F_g3

Residu

F_g3

Set_zero

tmp = h[0] * h[0];

for (i = 1 ; i < 22 ; i++)

tmp = tmp + h[i] * h[i];

F_g4

F_g4

tmp1 = tmp >> 8;

Syn_filt

tmp = h[0] * h[1];

for (i = 1 ; i < 21 ; i++)

syn

syn

D

R

A

M

tmp = tmp + h[i] * h[i+1];

tmp2 = tmp >> 8;

Corr0/Corr1

if (tmp2 <= 0)

Ap3

Ap3

tmp2 = 0;

else

tmp2 = tmp2 * MU;

Ap4

preemph

Ap4

tmp2 = tmp2/tmp1;

preemphasis

(res2, temp2, 40)

h

h

Syn_filt

Syn_filt

(Ap4, res2, &syn_p),

tmp

tmp

40, mem_syn_pst, 1);

tmp1

tmp1

agc

(&syn[i_subfr], &syn)

agc

29491, 40)

tmp2

tmp2


Memory bottleneck example g 724 decoder post filter c code

+

+

preemphasis

*

res

Memory bottleneck example(G.724 Decoder Post-filter, C code)

  • Problem: Production/consumption occur with different patterns across 3 kernels

    • Anti-dependence in preemphasis function (loop reversal not applicable)

    • Consumer must wait until producer finishes

  • Goal: Convert memory access to inter-cluster communication

Residu

Syn_filt

*

*

*

*

*

*

*

[39:0]

[39:0]

[0:39]

[0:39]

MEM

time


Breaking the memory bottleneck

+

+

*

Breaking the memory bottleneck

  • Remove anti-dependence by array renaming

  • Apply loop reversal to match producer/consumer I/O

  • Convert array access to inter-component communication

Residu

*

*

*

preemphasis

res

Syn_filt

res2

*

*

*

*

time

Interprocedural pointer analysis + array dependence test +

array access pattern summary+ interprocedural memory data flow


Breaking the memory wall for scalable microprocessor platforms

Full system environment

Linux running on PowerPC

Lean system with custom Linux (Nacho Navarro, UIUC/UPC)

Virtex 2 Pro FPGA logic treated as software components

Removing memory bottleneck

Random memory access converted to dataflow

Memory objects assigned to distributed Block RAM

SW / HW communication

PLB vs. OCM interface

A prototyping experience with the Xilinx ML300


Initial results from our ml300 testbed

Projected filter latency

16000

15000

14000

Cycles

3000

~8x

2000

~32x

1000

0

Software

Naïve

Optimized

Initial results from our ML300 testbed

  • Case study: GSM vocoder

    • Main filter in FPGA

    • Rest in software running under Linux with customized support

    • Straightforward software/ accelerator communications pattern

    • Fits in available resources on Xilinx ML300 V2P7

    • Performance compared to all-software execution, with communication overhead

Hardwareimplementation


Grand challenge

Grand challenge

  • Moving the mass-market software base to heterogeneous computing architectures

    • Embedded computing platforms in the near term

    • General purpose computing platforms in the long run

Applications

and

Systems

Software

Platforms

OS support

Programming

models

Accelerator architectures

Restructuring

compilers

Communications and

storage management


Slicing through software layers

Slicing through software layers


Taking the first step pointer analysis

Taking the first step: pointer analysis

  • To what can this variable point? (points-to)

    • Can these two variables point to the same thing? (alias)

    • Fundamental to unraveling communications through memory: programmers like modularity and pointers!

  • Pointer analysis is abstract execution

    • Model all possible executions of the program

    • Has to include important facets, or result won’t be useful

    • Has to ignore irrelevant details, or result won’t be timely

    • Unrealizable dataflow = artifacts of “corners cut” in the model

  • Typically, emphasis has been on timeliness, not resolution, because expensive algorithms cause unstable analysis time – for typical alias uses, may be OK…

  • …but we have new applications that can benefit from higher accuracy

    • Data flow unraveling for logic synthesis and heterogeneous systems


How to be fast safe and accurate

How to be fast, safe and accurate?

  • An efficient, accurate, and safe pointer analysis based on the following two key ideas

Efficient analysis of a large program necessitates that only relevant details are forwarded to a higher level component

The algorithm can locally cut its losses (like a bulkhead) …

… to avoid a global explosion in problem size


One facet context sensitivity

One facet: context sensitivity

Example

  • Context sensitivity – avoids unrealizable data flow by distinguishing proper calling context

  • What assignments to a and g receive?

    • CI: a and g each receive 1 and 3

    • CS: g receives only 1 and a receives only 3

  • Typical reactions to CS costs

    • Forget it, live with lots of unrealizable dataflow

    • Combine it with a “cheapener” like the lossy compression of a Steensgaard analysis

  • We want to do better, but we may sometimes need to mix CS and CI to keep analysis fast

Desired results


Context insensitive ci

Context Insensitive (CI)

  • Collecting all the assignments in the program and solving them simultaneously yields a context insensitive solution

  • Unfortunately, this leads to three spurious solutions.


Context sensitive cs na ve process

Context Sensitive (CS): Naïve process

Excess statements unnecessary and costly

Retention of side effect still leads to spurious results


Cs accurate and efficient approach

CS: “Accurate and Efficient” approach

Compact summary of jade used

Summary accounts for all side-effects. DELETE assignment to prevent contamination

Now, only correct result derived


Analyzing large complex programs sas2004

Analyzing large, complex programs[SAS2004]

Originally, problem size exploded as more contexts were encountered

1012

This results in an efficient analysis process without loss of accuracy

104

New algorithm contains problem size with each additional context


Example application and current challenges paste2004

Example application and current challenges[PASTE2004]

Improved efficiency increases the scope over which unique, heap-allocated objects can be discovered

Example: Improved analysis algorithms provide more accurate call graphs (below) instead of a blurred view (above) for use by program transformation tools


From benchmarks to broad application code base

systems

systems

Applications

Operating systems

systems

Compiler

systems

systems

Runtime and Tools

Libraries

systems

Hardware

From benchmarks to broad application code base

  • The long term trend is for all code to go through a compiler and be managed by a runtime system

    • Microsoft code base to go through Phoenix – OpenIMPACT participation

    • Open source code base to go through GCC/OpenIMPACT under Gelato

  • The compiler and runtime will perform deep analysis to allow tool to have visibility into software

    • Parallelizers, debuggers, verifiers, models, validation, instrumentation, configuration, memory managers, runtime, etc.


Global memory dataflow analysis

Global memory dataflow analysis

  • Integrates analyses to deconstruct memory “black box”

    • Interprocedural pointer analysis: allow programmer to use language and modularity without losing transformability

    • Array access pattern analysis: figure out communication among loops that communicate through arrays

    • Control and data flow analyses: enhance resolution by understanding program structure

    • Heap analysis extends analysis to much wider software base

  • SSA-based inductor detection and dependence test have been integrated into IMPACT environment


Example on deriving memory data flow

Example on deriving memory data flow

main(...)

{ int A[100];

foo(A, 64);

bar(A+1, 64)

}

foo writes A[0:63] stride 1

bar reads A[1:64]

stride 1

procedure call

Data flow analysis determines that A[64] is not from foo

parameter

mapping

foo (int *s, int L)

{ int *p=s, i;

for (i=0; i<L; i++)

*p = ...;

p++;

}

Write from *(s)

to *(s+L)

with stride 1

Procedure body

Read from *(t)

to *(t+M)

with stride 1

summary for

the whole loop

Pointer relation analysis restates p/q in terms of s/t

bar (int *t, int M)

{ int *q=t, i;

for (i=0; i<M; i++)

… = *q;

q++;

}

Write *p

loop body

Read *q


Conclusions and outlook

Conclusions and outlook

  • Heterogeneous multiprocessor systems will be the model for both general purpose and embedded computing platforms in the future

    • Both are motivated by powerful trends

    • Shorter term adoption for embedded systems

    • Longer term for general purpose systems

  • Programming models and parallelization of traditional programs to channel software to these new platforms

    • Feasibility of deep pointer analysis demonstrated

    • Many need to participate in solving this grand challenge problem


  • Login