1 / 16

Leveraging Hierarchy Is this our Undiscovered Country?

Leveraging Hierarchy Is this our Undiscovered Country?. John T. Daly. Undiscovered Country: Cost vs. Risk?. Technology Generation ~ 15 years. Parallel (IN). Data Movement. Exascale ?. Parallel ( OUT). Log(Performance). Concurrency. Vector. Latency Hiding. Time.

kamali
Download Presentation

Leveraging Hierarchy Is this our Undiscovered Country?

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Leveraging HierarchyIs this our Undiscovered Country? John T. Daly

  2. Undiscovered Country: Cost vs. Risk? Technology Generation ~ 15 years Parallel (IN) Data Movement Exascale? Parallel (OUT) Log(Performance) Concurrency Vector Latency Hiding Time

  3. Advanced Computing Systems (ACS) • HPC capability doubles every 14 months, but data doubles every 9 months • Innovative solutions required to bridge the gap • Partner with industry, academia and national labs to develop technology enablers for next generation computing • Generate a steady stream of capability; no “end goal” for scaling

  4. ACS: Bridge to research community Participatory Research Mission Problems Technical Challenges Agency Compute Mission Universities National Labs Government Industry CEC Mission Capability Technical Solutions Mirroring

  5. ACS: technical thrusts + end-to-end • Our HPC stakeholders • System integrator optimizes power, performance and reliability for a set number of dollars • System user optimizes usability, dependability and time-to-solution for a set number of deliverables • Point solutions in six technical thrusts: power efficiency, chip I/O, interconnects, productivity, file I/O and resilience • Innovative end-to-end solutions • AMOEBA: chip level data movement and packaging • MYRIAD(?): system level modeling and simulation

  6. Extreme is not necessarily “balanced” • Traditional HPC is an important part of ACS, but not the only part • Dynamic design space drives the need for simulation and abstract machine model • Goal: Scientific understanding in HPC Productivity Interconnect Traditional HPC and ACS too File I/O & Storage Chip I/O Also ACS, but maybe not traditional HPC Resilience Power Efficiency

  7. ! ? ! ! Future “convergence” ? • Today • Predictive science starts with an initial model and runs a numerical experiment to generate lots of data • Data analytics starts with lots of data and extracts features or information that characterize the data • Tomorrow • Predictive science uses in situ data analytics to reduce the data storage and post-processing requirements • Data analytics uses in situ predictive science to ask the question “what ought this data to look like?” ? ? vs. Advancing Intelligence Through Science ?

  8. Power Efficiency Resilience Energy is the next shared resource Productivity Chip I/O Interconnect File I/O • Off node communication is over budget  • Off chip communication is over budget   DOE Architectures and Technology for Extreme Scale Computing, San Diego, CA

  9. Data is the challenge of scale • Energy, performance and data integrity tapers are a function of the distance between the data and the processor • Data locality is key to computing at scale for optimizing right answers per Joule per second • Spatial locality allows me to grab more data in a single memory transaction • Temporal locality allows me to use the same data multiple times before I have to move it

  10. A role for NV in the hierarchy http://www.bit-tech.net/hardware/memory/2007/11/15/the_secrets_of_pc_memory_part_1/3

  11. Node architecture = “shops” of data • Byte/Word addressable memory up and down the stack, block synchronous between stacks • Control is data aggregator (e.g., gather/scatter) RAM/ NVRAM RAM/ NVRAM RAM Control Processor/Control Control RAM/ NVRAM RAM/ NVRAM RAM Control Processor/Control Control

  12. Exploiting Spatial Locality • Fractal Memory • Create a virtual mapping of data lines to space filling curves (e.g., Jin and Mellor Crummey, “Using Space-filling Curves for Computation Reordering”) • Use memory control logic to resolve mappings • Dynamic mapping by user via PM interface • Move work to data • Adaptive mesh refinement is a refine operation spawned at another memory component • Map memory references back to processor

  13. Exploiting Temporal Locality • Global one-sided memory model • Different processors updating same values in PDE solver creates race conditions • You’re going to get the wrong answer anyway, so checkpoint asynchronously and use QMU • Inherently resilient algorithms that avoid global synchronization • Reconfigurable hierarchy: “cache” vs. “scratch pad” • “Cache” is seamless and easy to use, but sometimes I’d like to be able to bypass it • “Scratch pad” avoids duplicating memory and can be higher performing, but it is harder to use • Is SSD going to work like “cache” or “scratch pad”?

  14. Motivating example: Exa-sorting • Many linear solution methods are already robust against errors and data race conditions (e.g. multigrid methods) • What about an application like sorting? • Gradient descent approach is robust under errors* and can be parallelized asynchronously • Suggests possibility for research in asynchronous parallel minimization approach for other classes of problems • How about non-linear solvers? • Analogy in minimization of the objective function via solution of the adjoint problem? • What about chaotic systems? Non-linear term * Joseph Sloan, David Kesler, Rakesh Kumar, and Ali Rahimi. “A Numerical Optimization-based Methodology for Application Robustification: Transforming Applications for Error Tolerance”. DSN2010, Chicago, July 2010.

  15. From the user/developer perspective • Domain specific language to serve as portable wrapper for domain user and SME • Support for globally addressable memory space • Easy one-sided and two-sided, synchronous and asynchronous access to remote data • Intuitive mechanism for lightweight thread creation and remote task invocation • Application control over dynamically reconfigurable memory (hardware cache, software cache and software scratch) at each level of the memory hierarchy (chip, nodeand storage) • Tools for monitoring memory and energy utilization, so I know when I’m swapping to DIMM! 

  16. Conclusions • Exascale arrives at the end of the technology generation bridging concurrency to data: risk or opportunity? • Traditional algorithms + architectures too expensive in power, performance and reliability if data leaves cache • Rethinking computation may yield large ROI • models of computation • “balanced architecture” • predictive science vs. data analytics • Required to facilitate new approaches • programming models and tools • simulation and modeling framework • vendor partnerships and technology investment

More Related