1 / 1

Integer ALU2

Crash Dump File (e.g ., crash state and inputs). Isolate Instructions First Affected by the Fault. Identify Erroneous Data. Identify Instructions that Change Erroneous Data. Diagnosing Intermittent Faults Using Software Techniques

nenet
Download Presentation

Integer ALU2

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Crash Dump File (e.g., crash state and inputs) Isolate Instructions First Affected by the Fault Identify Erroneous Data Identify Instructions that Change Erroneous Data Diagnosing Intermittent Faults Using Software Techniques Layali Rashid, Karthik Pattabiraman and Sathish GopalakrishnanThe University of British Columbia #6 #7 #5 Identify Faulty Instructions Instruction Scheduler Array_Addr • Run Fault-Free • Construct DDG • Diagnose Error Program Crash/ Error Detected IF/ID Stage MEM/ WB Stage EXE/ MEM Stage ID/ EXE Stage Crash Model Fault Model DEC 1 Potential Hardware Support Filtering SimpleScalar Simulator . . . Chip Chip Chip Core 1 Core 1 Core 1 Core 2 Core 2 Core 2 Core 3 Core 3 Core 3 Core 4 Core 4 Core 4 Intermittent Faults PC Transient Fault Dynamic Dependency Graph DEC 2 Integer ALU2 Core 5 Core 5 Core 5 Core 6 Core 6 Core 6 Core 7 Core 7 Core 7 Core 8 Core 8 Intermittent Error Faulty Instructions Isolate Fault-Prone Unit Actual IPS and CD I-Cache D-Cache Expected IPS and CD Reg File Program Execution Failure Diagnosis Technique Goals Overview of the Diagnosis Approach Intermittent Faults Isolate Fault-Prone Unit • Intermittent hardware faults are bursts of errors that occur at the same location and last from a few cycles to a few seconds. • Intermittent faults will be a significant concern in future processors. time • Requires no hardware support, • Provides formal guarantees of correctness and completeness, • Scalable, • Few false positives. Modeling Intermittent Faults Impact on Programs - Example • Use Dynamic Dependency Graph (DDG). Overview of the Diagnosis Approach - Example Operating Systems Directions Research Objective • Map tasks to cores based on the core's functioning units and the task's requirements. • Modify a program on the fly to avoid using malfunctioning units. • Provide feedback to instruction scheduler about the malfunctioning units, such that minimal performance overhead is encountered. 1 Intermittent Error • An intermittent fault affected 14-18, • Crash instruction: 27, • Erroneous data: 14, 17, 16, 19 and 21. 3 1 2 6 5 4 Enable Lines 7 Integer ALU1 2 • Back trace erroneous data in DDG. Modeling Intermittent Faults Impact on Programs - Results Research Motivation • The DDG model is more than two orders of magnitude faster than equivalent fault-injection experiments. • 89 to 93% of the faults' crash distances are within 100 nodes. • Diagnosis is vital in guiding fine-grained recovery techniques (e.g., hardware reconfiguration) and hence facilitating processor degraded performance. Conclusions • Diagnosis is vital in guiding fine-grained recovery. • Diagnosing intermittent faults using software techniques is possible. • Most intermittent faults cause program to crash shortly after the fault’s start. If core 8 malfunctions, then two possible recovery options would be available: The whole core 8 is disabled without fine-grained diagnosis, or 2. Part of core 8 is disabled with fine-grained diagnosis. 3 • Expected fault spans over nodes 14-19. • Actual fault affected nodes 14-18. Contact Information • Of the intermittent faults that are non-benign, 95% result in a program crash. • 91 to 95% of the faults cause program to crash within300 nodes of the fault’s start. Layali Rashid PhD Candidate Department of Electrical and Computer Engineering The University of British Columbia lrashid@ece.ubc.ca

More Related