1 / 1

“Fault Resilience for HPC Applications on Exascale Systems” – Dan Quinlan, LLNL

“Fault Resilience for HPC Applications on Exascale Systems” – Dan Quinlan, LLNL. Create an automated compiler transformation to assist programmers in DOE for integrating memory-related fault resilience in their applications :

taro
Download Presentation

“Fault Resilience for HPC Applications on Exascale Systems” – Dan Quinlan, LLNL

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. “Fault Resilience for HPC Applications on ExascaleSystems” –Dan Quinlan, LLNL Create an automated compiler transformation to assist programmers in DOE for integrating memory-related fault resilience in their applications : Creating memory efficient fault resilience technique at compiler level Automatically introduce runtime fault resilience checks with some support for error correction capability Automated approach to addressing the resilience challenge of exascale computing Assist application sustainability in ExaScale environments where memory failures may occur every 2 hours [DARPA ExaScale Study 2008 Report] ASCR- Computer Science Highlight Impact Objectives Accomplishments 2011 Scientific Application ROSE Compiler Transformation Instrumented Application ( Fault Resilience Checks ) • Developed compiler transformation for instrumenting memory references in scientific kernels with fault resilience checks • Designed a library to support runtime detection of memory errors • Implemented a fault resilience technique with block parity algorithm Runtime Support ( Block Parity Algorithm ) Application Execution Memory Reference Hashmap No Error ( Normal output) Errors Corrected ( Normal output) Errors Detected ( Exception )

More Related