performance n.
Download
Skip this Video
Loading SlideShow in 5 Seconds..
Performance PowerPoint Presentation
Download Presentation
Performance

Loading in 2 Seconds...

play fullscreen
1 / 10

Performance - PowerPoint PPT Presentation


  • 51 Views
  • Uploaded on

Performance. What hardware accelerators are you using/evaluating ? Cells in a Roadrunner configuration 8-way SPE threads w/ local memory, DMA & vector unit programming issues but tremendous flexibility Fast (25.6 GB/s) & large memory (4GB or larger)

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Performance' - duman


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
performance
Performance
  • What hardware accelerators are you using/evaluating?
  • Cells in a Roadrunner configuration
    • 8-way SPE threads w/ local memory, DMA & vector unit programming issues but tremendous flexibility
    • Fast (25.6 GB/s) & large memory (4GB or larger)
    • Augmented C language; also C++ & now Fortran; GNU & XL variants; OpenMP is new; OpenCL is being prototyped
    • Opterons can run bulk of code not needing acceleration; Cell-only clusters possible
performance1
Performance
  • What hardware accelerators are you using/evaluating? Several years ago…
    • GPUs (pre CUDA & Tesla)
      • Brook & Scout (LANL data-parallel language)
      • No 32bit at the time; limited memory; everything is a data-parallel problem
      • No ECC memory ; insufficient parity/ECC protection of data paths and logic
      • Others at LANL still working in this area including Tesla & CUDA)
    • Clearspeed (several years ago)
      • Earliest Clearspeeds before the Advance families
      • Augmented C language; 96 SIMD PEs
      • Everything is done as long SIMD data parallel and in synch
      • Low power
    • FPGAs (HDL, several years ago)
      • Programming is hard -- very hard
      • Logic space limited the number of 64bit ops
      • Fast SRAM but small; external DRAM modest size but no faster than CPUs
      • One algorithm at a time, so significant impact to use for multi-physics
      • Low power
performance2
Performance
  • Describe the applications that you are porting to accelerators?
    • MD (materials), laser-plasma PIC, IMC X-ray (particle) transport, GROMACS, n-body universe & galaxies, DNS turbulence & supernovea, HIV genealogy, nanowire long-time-scale MD
    • Ocean circulation, wildfires, discrete social simulations, clouds & rain, influenza spread, plasma turbulence, plasma sheaths, fluid instabilities

My personal observations:

    • Particle methods are generally easiest
    • Codes with good characteristics:
      • A few computationally intense “algorithms”
      • pre-existing or obvious “fine-grain” parallel work units
      • C language versus Fortran or highly OO C++
performance3
Performance
  • Describe the kinds of speed-ups are you seeing (provide the basis for the comparison)?
    • 5x to 10X over single-Opteron-core for code with high memory BW intensive and 5%-10% peak
    • 10x to 25x on particle methods, searches, etc.
  • How does it compare to scaling out (i.e., just using more X86 processors)? What are the bottlenecks to further performance improvements?
    • Scale out via more sockets is better – BUT!
      • Scaling efficiencies are a problem already for several LANL applications running at 4,000 to 10,000 cores; scale out of LANL-sized machines means $$$ for HW, space, & power
      • Scaling out by multi-core is not a clear winner
    • Memory BW and cache architectures often limit performance which Cells mostly get around
    • Memory BW per core is decreasing at “inverse Moore’s law” rate!
economics
Economics
  • Describe the programming effort required to make use of the accelerator.
    • ½ to 1 man-year to “convert” a code, mostly dealing with data structures and threaded parallelism designs.
    • Lack of debugging & similar tools are like the earliest days of parallel computing (LANL was leader then as well – remember early PVM Ethernet workstation “carpet” clusters in the mid-80’s before MPPs)
    • We like to see 1-2 programming experts (PhD-level or equiv) assigned to forefront-science code projects which have 1 to 4+ physics experts (PhD-level)
  • Amortization
    • Ready for the future – codes and skilled programmers. We expect our dual-level (MPI+threads) & SIMD-vectorization techniques used for Roadrunner to pay off on future multi-core and many-core chips as well.
    • It’s not just about running codes this year. Others will have to work through new forms of parallelism soon.
    • We can do science now that isn’t possible with most other machines
economics1
Economics
  • Compare accelerator cost to scaling out cost
    • Commodity-processor-only machines would have cost 2X what Roadrunner did in 2006-2007 (~$80M more)
    • Used 2X or more power (~$1M per MW)
    • Significantly larger nodes counts cause scaling & reliability issues
    • Accelerators or heterogeneous chips should be Greener
  • Ease of use issues
    • Newer Cell programming techniques (ALF, OpenMP) could make this easier.
    • A Cell cluster would be easier, but the PPE is really, really slow for non- SPU accelerated code segments.
    • Not for the faint of heart, but Top20 machines never are
futures
Futures
  • What is the future direction of hardware based accelerators?
    • Domain specific libraries can make them far more useful in those specific areas
    • Some may appear on Intel QPI or AMD HT.
    • Specialized cores will show up within commodity microprocessors – ignore them or use them
    • GPU-based systems will have to adopt ECC & partity protection
    • Convey appears to have the most viable FPGA approach (FPGA as compiler managed co-processor)
  • Software futures?
    • OpenCL looks promising but doesn’t address programming the specialized accelerator devices themselves
    • The uber-auto-wizard-compiler will never come
    • Heterogeneous compilers may come.
    • Debuggers & tools may come
  • What are your thoughts on what the vendors need to do to ensure wider acceptance of accelerators?
    • Create next generation versions and sell as mainstream products
steps in a cell conversion
Steps in a Cell Conversion
  • Compile & run on PowerPC PPE
  • Identify & isolate algorithm & data to run parallel on 8 “remote” SPEs
  • Compile scalar version of algorithm on SPE
    • Add SPE thread process control
    • Add DMAs
      • Use “blocking” DMAs at this stage just for functionality
      • Worry about data alignments
    • First on a single SPE, then on 8 SPEs
  • Optimize SPE code
    • SIMD, branchesmerges
    • Add asynch double/triple buffering of DMAs
  • For Roadrunner, connect to rest of code on Opteron via DaCS and “message relay”
roadrunner lanl addressing the shock moving through high performance computing
Roadrunner & LANL: addressing the shock moving through high-performance computing
  • Roadrunner is more than a petascale supercomputer for today’s use
    • provides a balanced platform to explore new algorithm design, programming models, and to refresh developer skills
  • LANL has been an early adopter of transformational technology*:
    • 1970s: HPC is scalarLANL adopts vector (Cray 1 w/ no OS)
    • 1980s: HPC is vector LANL adopts data parallel (big CM-2)
    • 2000s: HPC is multi-core clusters LANL adopts hybrid (Roadrunner)

*Credit to Scott Pakin, CCS-1, for this list idea

perspective fun or nightmare

MPI

(5B)

Non-accelerated code

Node

Memory

Node

(Opteron)

(5a)

(6)

(2)

(1)

PCIe

link

Cell

Memory

Serial PPC

Processor

Each SPE DMA multi-buffers

Cell data into local memory

(3)

(4)

until

done

Parallel SPE

Processors

Each SPE computes within

its local memory buffers

Simultaneously

(4)

Each SPE DMA multi-buffers

data back to Cell memory

Local Memories

8-way parallel

Non-accelerated code

Perspective: Fun or Nightmare?

Opteron

Cell PPC

Cell SPE (x8 parallel)

MPI

(1)

Host launches Cell code

DaCS

Host data pushed/pulled to Cell

(2)

Cell spawns parallel

threads on SPEs

(3)

Node may need to push/pull more data to/from Cell

& to/from cluster

or could be available for

concurrent work

during this time

(5b)

(5a)

DMA

MPI

DMA

Parallel threads completed

(6)

DaCS

Updated data pushed/pulled to Host

Cell code completed

MPI

How much can be automatedin compilers or languages?