1 / 7

Bert de Jong High Performance Software Development Molecular Science Computing Facility

Workshop on Parallelization of Coupled-Cluster Methods Panel 1: Parallel efficiency An incomplete list of thoughts. Bert de Jong High Performance Software Development Molecular Science Computing Facility. Overall hardware issues. Computer power per node has increased

wren
Download Presentation

Bert de Jong High Performance Software Development Molecular Science Computing Facility

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Workshop on Parallelization of Coupled-Cluster MethodsPanel 1: Parallel efficiencyAn incomplete list of thoughts Bert de Jong High Performance Software Development Molecular Science Computing Facility

  2. Overall hardware issues • Computer power per node has increased • Increase of single CPU has flattened out (but you never know!) • Multiple cores together tax out other hardware resources in a node • Bandwidth and latency for other major hardware resources are far behind • Affecting the flops we actually use • Memory • Very difficult to feed the CPU • Multiple cores further reduce bandwidth • Network • Data access considerably slower than memory • Speed of light is our enemy • Disk input/output • Slowest of them all, disks spin only so fast 2

  3. Dealing with memory • Amounts of data needed in coupled cluster can be huge • Amplitudes • Too large to store on a single node (except for T1) • Shared memory would be good, but will shared memory of 100s of terabytes be feasible and accessible? • Integrals • Recompute vs store (on disk or in memory) • Can we avoid access to memory when recomputing • Coupled cluster has one advantage: it can easily be formulated as matrix-multiplication • Can be very efficient: DGEMM on EMSL’s 1.5 GHz Itanium-2 system reached over 95% of peak efficiency • As long as we can get all the needed data in memory! 3

  4. Dealing with networks • With 10s of terabytes of data and distributed memory systems, getting data from remote nodes is inevitable • Can be no problem, as long as you can hide the communication behind computation • Fetch data while computing = one-sided communication • NWChem uses Global Arrays to accomplish this • Issues are • Low bandwidth and high latency relative to increasing node speed • Non-uniform network • Cabling a full fat tree can be cost prohibitive • Effect of network topology • Fault resiliency of network • Multiple cores need to compete for limited number of busses • Data contention increase with increasing node count • Data locality, data locality, data locality 4

  5. Dealing with spinning disks • Using local disk • Will only contain data needed by its own node • Can be fast enough if you put large number of spindles behind it • And, again, if you can hide behind computation (pre-fetch) • With 100,000s of disks, chances of failure become significant • Fault tolerance of computation becomes an issue • Using globally shared disk • Crucial when going to very large systems • Allows for large files shared by large numbers of nodes • Lustre file system of petabytes possible • Speed limited by number of access points (hosts) • Large number of reads and writes need to be handled by small number of hosts, creating lock and access contention 5

  6. What about beyond 1 petaflop? • Possibly 100,000s of multicore nodes • How does one create a fat enough network between that many nodes? • Possibly 32, 64, 128 or more cores per node • All cores simply cannot do the same thing anymore • Not enough memory bandwidth • Not enough network bandwidth • Heterogenous computing within a node (CPU+GPU) • Designate nodes for certain tasks • Communication • Memory access, put and get • Recompute integrals hopefully using cache only • DGEMM operations • Task scheduling will become an issue 6

  7. WR Wiley Environmental Molecular Sciences Laboratory A national scientific user facility integrating experimental and computational resources for discovery and technological innovation 7

More Related