greg astfalk woon yung chung woon-yung_chung@hp - PowerPoint PPT Presentation

Slide1 l.jpg
Download
1 / 37

high-end computing technology: where is it heading? greg astfalk woon yung chung woon-yung_chung@hp.com prologue this is not a talk about hewlett-packard ’ s product offering(s) the context is hpc (high performance computing) somewhat biased to scientific computing

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.

Download Presentation

greg astfalk woon yung chung woon-yung_chung@hp

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Slide2 l.jpg

high-end computing technology: where is it heading?

greg astfalk

woon yung chung

woon-yung_chung@hp.com


Prologue l.jpg

prologue

this is not a talk about hewlett-packard’s product offering(s)

the context is hpc (high performance computing)

somewhat biased to scientific computing

also applies to commercial computing


Backdrop l.jpg

backdrop

end-users of hpc systems have needs and “wants” from hpc systems

the computer industry delivers the hpc systems

there exists a gap between the two wrt

programming

processors

architectures

interconnects/storage

in this talk we (weakly) quantify the gaps in these 4 areas


End users programming wants l.jpg

end-users of hpc machines would ideally like to think and code sequentially

have a compiler and run-time system that produces portable and (nearly) optimal parallel code

regardless of processor count

regardless of architecture type

yes, i am being a bit facetious but the idea remains true

end-users’ programming “wants”


Parallelism methodologies l.jpg

parallelism methodologies

there exists 5 methodologies to achieve parallelism

automatic parallelization via compilers

explicit threading

pthreads

message-passing

mpi

pragma/directive

openmp

explicitly parallel languages

upc, et al.


Parallel programming l.jpg

parallel programming

parallel programming is a cerebral effort

if lots of neurons plus mpi constitutes “prime-time” then parallel programming has arrived

no major technologies on the horizon to change this status quo


Discontinuities l.jpg

discontinuities

the ease of parallel programming has not progressed at the same rate that parallel systems have become available

performance gains require compiler optimization or pbo

most parallelism requires hand-coding

in the real-world many users don’t use any compiler optimizations


Parallel efficiency l.jpg

parallel efficiency

mindful that the bounds on parallel efficiency are, in general, far apart

50% efficiency on 32 processors is good

10% efficiency on (100) processors is excellent

>2% efficiency on (1000) processors is heroic

a little communication can “knee over” the efficiency vs. processor count curve


Apps with sufficient parallelism l.jpg

apps with sufficient parallelism

few existing applications can utilize (1000), or even (100), processors with any reasonable degree of efficiency

to date have generally required heroic effort

new algorithms (i.e., data and control decompositions) or nearly complete are necessary

such large-scale parallelism will have “arrived” when msc/nastran and oracle exist on such systems and utilize the processors


Latency tolerant algorithms l.jpg

latency tolerance will be a increasingly important theme for the future

hardware will not solve this problem

more on this point later

developing algorithms that have significant latency tolerance will be necessary

this means thinking “outside the box” about the algorithms

simple modifications to existing algorithms generally won’t suffice

latency tolerant algorithms


Operating systems l.jpg

operating systems

development environments will move to nt

heavy-lifting will remain with unix

four unix’s to survive (alphabetically)

hp-ux

linux

aix 5l

solaris

linux will be important at the lower-end but will not significantly encroach on the high-end


End users proc arch wants l.jpg

end-users’ proc/arch “wants”

all things being equal high-end users would likely want a classic cray vector supercomputer

no caches

multiple pipes to memory

single word access

hardware support for gather/scatter

etc.

it is true however that for some applications contemporary risc processors perform better


Processors l.jpg

processors

the “processor of choice” is now, and will be, for some time to come the risc processor

risc processors have caches

caches are good

caches are bad

if your code fits in cache, you aren’t supercomputing! 


Risc processor performance l.jpg

a rule of thumb is that a risc processor, any risc processor, gets on average, on a sustained basis,

10% of its peak performance

the 3 on this is large

achieved performance varies with

architecture

application

algorithm

coding

dataset size

anything else you can think of

risc processor performance


Semiconductor processes l.jpg

semiconductor processes

semiconductor processes change every 2-3 years

assuming that “technology scaling” applies to subsequent generations then per generation

frequency increase of ~40%

transistor density increase of ~100%

energy per transition decrease of ~60%


Semiconductor processes17 l.jpg

semiconductor processes


What to do with gates l.jpg

what to do with gates

it is not a simple question of what the best use of the gates is

larger caches

multiple cores

specialized functional units

etc.

the impact of soft errors with decreasing design rule size will be a important topic

what happens if a alpha particles flips a bit in a register?


Processor futures l.jpg

processor futures

you can expect, for the short term, moore’s law like gains in processor’s peak performance

doubling of “performance” every 18-24 months

does not necessarily apply to application performance

moore’s law will not last forever

4-5 more turns (maybe?)


Customer spending m l.jpg

1995

1996

1997

1998

1999

2000

2001

2002

2003

Cisc

Ia64

Ia32

Risc

customer spending ($m)

$40,000

$35,000

$30,000

$25,000

$20,000

$15,000

$10,000

$5,000

$0

idc, february 2000

  • technology disruptions

  • risc crossed over cisc in 1996

  • itanium will cross over risc in 2004


Present high end architectures l.jpg

present high-end architectures

today’s high-end architecture is either

smp

ccnuma

cluster of smp nodes

cluster of ccnuma nodes

japanese vector system

all of these architectures work

efficiency varies with application type


Architectural issues l.jpg

architectural issues

of the choices available the smp is preferred, however

smp processor count is limited

cost of scalability is prohibitive

ccnuma addresses these limitations but induces its own

disparate latencies

better, but still limited, scalability

ras limitations

clusters too have pros and cons

huge latencies

low cost

etc.


Physics l.jpg

physics

limitations imposed by physics have led us to architectures that have a deep memory hierarchy

the algorithmist and programmer must deal with, and exploit, the hierarchy to achieve good performance

this is part of the cerebral effort of parallel programming we mentioned earlier


Memory hierarchy l.jpg

memory hierarchy

typical latencies for today’s technology


Balanced system ratios l.jpg

balanced system ratios

a “ideal” high-end system should be balanced wrt its performance metrics

for each peak flop/second

0.5–1 byte of physical memory

10–100 byte of disk capacity

4–16 byte/sec of cache bandwidth

1–3 byte/sec of memory bandwidth

0.1–1 bit/sec of interconnect bandwidth

0.02–0.2 byte/sec of disk bandwidth


Balanced system l.jpg

balanced system

applying the balanced system ratios to a unnamed contemporary 16 processor smp


Storage l.jpg

storage

data volumes are growing at a extremely rapid pace

disk capacity sold doubled from 1997 to 1998

storage is a increasingly large percent of the total server sale

disk technology is advancing too slowly

per generation, of 1-1.5 years;

access time decreases 10%

spindle bandwidth increases 30%

capacity increases 50%


Networks l.jpg

networks

only the standards will be widely deployed

gigabit ethernet

gigabyte ethernet

fibre channel (2x and 10x later)

sio

atm

dwdm backbones

the “last mile” problem remains with us

inter-system interconnect for clustering will not keep pace with the demands (for latency and bandwidth)


Vendor s constraints l.jpg

vendor’s constraints

rule #1: be profitable to return value to the shareholders

you don’t control the market size

you can only spend ~10% of your revenue on r&d

don’t fab your own silicon (hopefully)

you must be more than just a “technical computing” company

to not do this is to fail to meet rule #1 (see above)


Market sizes l.jpg

market sizes

according to the industry analysts the technical market is, depending on where you draw the cut-line, $4-5 billion annually

the bulk of the market is small-ish systems (data from forest baskett at sgi)


A perspective l.jpg

a perspective

commercial computing is not a enemy

without the commercial market’s revenue our ability to build hpc-like systems would be limited

the commercial market benefits from the technology innovation in the hpc market

is performance “left on the table” in designing a system to serve both the commercial and technical markets

yes


Slide32 l.jpg

why?

lack of a cold war

performance of hpc systems has been marginalized

in the mid-70s how many applications ran faster on a vax 11/780 than the cray-1

none

how many applications today run faster on a pentium than the cray t90?

some

current demand for hpc systems is elastic


Future prognostication l.jpg

future prognostication

computing in the future will be all about data and moving data

the growth in data volumes is incredible

richer media types (i.e., video) means more data

distributed collaborations imply moving data

e-whatever requires large, rapid data movement

more flops  more data


Data movement l.jpg

data movement

the scope of data movement encompasses:

register to functional unit

cache to register

cache to cache

memory to cache

disk to memory

tape to disk

system to system

pda to client to server

continent to continent

all of these are going to be important


Epilogue l.jpg

epilogue

for hpc in the future

it is going to be risc processors

smp and ccnuma architectures

smp processor count relatively constant

technology trends are reasonably predictable

mpi, pthreads and openmp for parallelism

latency management will be crucial

it will be all about data


Epilogue cont d l.jpg

epilogue (cont’d)

for the computer industry in the future

trending toward “e-everything”

e-commerce

apps-on-tap

brokered services

remote data

virtual data centers

visualization

nt for development

vectors are dying

for hpc vendors in the future

there will be fewer 


Conclusion l.jpg

conclusion

hpc users will need to yield more to what the industry can provide rather than vice-versa

vendor’s rule #1 is a cruel master


  • Login