greg astfalk woon yung chung woon-yung_chung@hp - PowerPoint PPT Presentation

Slide1 l.jpg
1 / 37

high-end computing technology: where is it heading? greg astfalk woon yung chung prologue this is not a talk about hewlett-packard ’ s product offering(s) the context is hpc (high performance computing) somewhat biased to scientific computing

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.

Download Presentation

greg astfalk woon yung chung woon-yung_chung@hp

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript

Slide2 l.jpg

high-end computing technology: where is it heading?

greg astfalk

woon yung chung

Prologue l.jpg


this is not a talk about hewlett-packard’s product offering(s)

the context is hpc (high performance computing)

somewhat biased to scientific computing

also applies to commercial computing

Backdrop l.jpg


end-users of hpc systems have needs and “wants” from hpc systems

the computer industry delivers the hpc systems

there exists a gap between the two wrt





in this talk we (weakly) quantify the gaps in these 4 areas

End users programming wants l.jpg

end-users of hpc machines would ideally like to think and code sequentially

have a compiler and run-time system that produces portable and (nearly) optimal parallel code

regardless of processor count

regardless of architecture type

yes, i am being a bit facetious but the idea remains true

end-users’ programming “wants”

Parallelism methodologies l.jpg

parallelism methodologies

there exists 5 methodologies to achieve parallelism

automatic parallelization via compilers

explicit threading






explicitly parallel languages

upc, et al.

Parallel programming l.jpg

parallel programming

parallel programming is a cerebral effort

if lots of neurons plus mpi constitutes “prime-time” then parallel programming has arrived

no major technologies on the horizon to change this status quo

Discontinuities l.jpg


the ease of parallel programming has not progressed at the same rate that parallel systems have become available

performance gains require compiler optimization or pbo

most parallelism requires hand-coding

in the real-world many users don’t use any compiler optimizations

Parallel efficiency l.jpg

parallel efficiency

mindful that the bounds on parallel efficiency are, in general, far apart

50% efficiency on 32 processors is good

10% efficiency on (100) processors is excellent

>2% efficiency on (1000) processors is heroic

a little communication can “knee over” the efficiency vs. processor count curve

Apps with sufficient parallelism l.jpg

apps with sufficient parallelism

few existing applications can utilize (1000), or even (100), processors with any reasonable degree of efficiency

to date have generally required heroic effort

new algorithms (i.e., data and control decompositions) or nearly complete are necessary

such large-scale parallelism will have “arrived” when msc/nastran and oracle exist on such systems and utilize the processors

Latency tolerant algorithms l.jpg

latency tolerance will be a increasingly important theme for the future

hardware will not solve this problem

more on this point later

developing algorithms that have significant latency tolerance will be necessary

this means thinking “outside the box” about the algorithms

simple modifications to existing algorithms generally won’t suffice

latency tolerant algorithms

Operating systems l.jpg

operating systems

development environments will move to nt

heavy-lifting will remain with unix

four unix’s to survive (alphabetically)



aix 5l


linux will be important at the lower-end but will not significantly encroach on the high-end

End users proc arch wants l.jpg

end-users’ proc/arch “wants”

all things being equal high-end users would likely want a classic cray vector supercomputer

no caches

multiple pipes to memory

single word access

hardware support for gather/scatter


it is true however that for some applications contemporary risc processors perform better

Processors l.jpg


the “processor of choice” is now, and will be, for some time to come the risc processor

risc processors have caches

caches are good

caches are bad

if your code fits in cache, you aren’t supercomputing! 

Risc processor performance l.jpg

a rule of thumb is that a risc processor, any risc processor, gets on average, on a sustained basis,

10% of its peak performance

the 3 on this is large

achieved performance varies with





dataset size

anything else you can think of

risc processor performance

Semiconductor processes l.jpg

semiconductor processes

semiconductor processes change every 2-3 years

assuming that “technology scaling” applies to subsequent generations then per generation

frequency increase of ~40%

transistor density increase of ~100%

energy per transition decrease of ~60%

Semiconductor processes17 l.jpg

semiconductor processes

What to do with gates l.jpg

what to do with gates

it is not a simple question of what the best use of the gates is

larger caches

multiple cores

specialized functional units


the impact of soft errors with decreasing design rule size will be a important topic

what happens if a alpha particles flips a bit in a register?

Processor futures l.jpg

processor futures

you can expect, for the short term, moore’s law like gains in processor’s peak performance

doubling of “performance” every 18-24 months

does not necessarily apply to application performance

moore’s law will not last forever

4-5 more turns (maybe?)

Customer spending m l.jpg














customer spending ($m)










idc, february 2000

  • technology disruptions

  • risc crossed over cisc in 1996

  • itanium will cross over risc in 2004

Present high end architectures l.jpg

present high-end architectures

today’s high-end architecture is either



cluster of smp nodes

cluster of ccnuma nodes

japanese vector system

all of these architectures work

efficiency varies with application type

Architectural issues l.jpg

architectural issues

of the choices available the smp is preferred, however

smp processor count is limited

cost of scalability is prohibitive

ccnuma addresses these limitations but induces its own

disparate latencies

better, but still limited, scalability

ras limitations

clusters too have pros and cons

huge latencies

low cost


Physics l.jpg


limitations imposed by physics have led us to architectures that have a deep memory hierarchy

the algorithmist and programmer must deal with, and exploit, the hierarchy to achieve good performance

this is part of the cerebral effort of parallel programming we mentioned earlier

Memory hierarchy l.jpg

memory hierarchy

typical latencies for today’s technology

Balanced system ratios l.jpg

balanced system ratios

a “ideal” high-end system should be balanced wrt its performance metrics

for each peak flop/second

0.5–1 byte of physical memory

10–100 byte of disk capacity

4–16 byte/sec of cache bandwidth

1–3 byte/sec of memory bandwidth

0.1–1 bit/sec of interconnect bandwidth

0.02–0.2 byte/sec of disk bandwidth

Balanced system l.jpg

balanced system

applying the balanced system ratios to a unnamed contemporary 16 processor smp

Storage l.jpg


data volumes are growing at a extremely rapid pace

disk capacity sold doubled from 1997 to 1998

storage is a increasingly large percent of the total server sale

disk technology is advancing too slowly

per generation, of 1-1.5 years;

access time decreases 10%

spindle bandwidth increases 30%

capacity increases 50%

Networks l.jpg


only the standards will be widely deployed

gigabit ethernet

gigabyte ethernet

fibre channel (2x and 10x later)



dwdm backbones

the “last mile” problem remains with us

inter-system interconnect for clustering will not keep pace with the demands (for latency and bandwidth)

Vendor s constraints l.jpg

vendor’s constraints

rule #1: be profitable to return value to the shareholders

you don’t control the market size

you can only spend ~10% of your revenue on r&d

don’t fab your own silicon (hopefully)

you must be more than just a “technical computing” company

to not do this is to fail to meet rule #1 (see above)

Market sizes l.jpg

market sizes

according to the industry analysts the technical market is, depending on where you draw the cut-line, $4-5 billion annually

the bulk of the market is small-ish systems (data from forest baskett at sgi)

A perspective l.jpg

a perspective

commercial computing is not a enemy

without the commercial market’s revenue our ability to build hpc-like systems would be limited

the commercial market benefits from the technology innovation in the hpc market

is performance “left on the table” in designing a system to serve both the commercial and technical markets


Slide32 l.jpg


lack of a cold war

performance of hpc systems has been marginalized

in the mid-70s how many applications ran faster on a vax 11/780 than the cray-1


how many applications today run faster on a pentium than the cray t90?


current demand for hpc systems is elastic

Future prognostication l.jpg

future prognostication

computing in the future will be all about data and moving data

the growth in data volumes is incredible

richer media types (i.e., video) means more data

distributed collaborations imply moving data

e-whatever requires large, rapid data movement

more flops  more data

Data movement l.jpg

data movement

the scope of data movement encompasses:

register to functional unit

cache to register

cache to cache

memory to cache

disk to memory

tape to disk

system to system

pda to client to server

continent to continent

all of these are going to be important

Epilogue l.jpg


for hpc in the future

it is going to be risc processors

smp and ccnuma architectures

smp processor count relatively constant

technology trends are reasonably predictable

mpi, pthreads and openmp for parallelism

latency management will be crucial

it will be all about data

Epilogue cont d l.jpg

epilogue (cont’d)

for the computer industry in the future

trending toward “e-everything”



brokered services

remote data

virtual data centers


nt for development

vectors are dying

for hpc vendors in the future

there will be fewer 

Conclusion l.jpg


hpc users will need to yield more to what the industry can provide rather than vice-versa

vendor’s rule #1 is a cruel master

  • Login