1 / 61

Self-Learning and Adaptive Functional Fault Diagnosis A Look at What is Possible with “Data”

Self-Learning and Adaptive Functional Fault Diagnosis A Look at What is Possible with “Data”. Krishnendu Chakrabarty Department of Electrical & Computer Engineering Duke University Durham, NC. Acknowledgments. Fangming Ye, Duke University Zhaobo Zhang, Huawei Xinli Gu , Huawei

Download Presentation

Self-Learning and Adaptive Functional Fault Diagnosis A Look at What is Possible with “Data”

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Self-Learning and Adaptive Functional Fault DiagnosisA Look at What is Possible with “Data” Krishnendu Chakrabarty Department of Electrical & Computer Engineering Duke University Durham, NC

  2. Acknowledgments • Fangming Ye, Duke University • Zhaobo Zhang, Huawei • XinliGu, Huawei • Sponsors: Cisco (until 2011), Huawei (since 2011)

  3. ITC 2014 Tutorial • Tutorial 9: • TEST, DIAGNOSIS, AND ROOT-CAUSE IDENTIFICATION OF FAILURES FOR BOARDS AND SYSTEMS • Presenters: KRISHNENDU CHAKRABARTY, WILLIAM EKLOW, ZOE CONROY • The gap between working silicon and a working board/system is becoming more significant as technology scales and complexity grows. The result of this increasing gap is failures at the system level that cannot be duplicated at the component level. These failures are most often referred to as “NTFs” (No Trouble Founds). The problem will only get worse as technology scales and will be compounded as new packaging techniques (SiP, SoC, 3D) extend and expand Moore’s law. This tutorial will provide detailed background on NTFs and will present DFT, test, and root-cause identification solutions at the board/system level. 

  4. Automated Diagnosis: Wishlist Fixed board Failed board • Automated and accurate diagnosis • Reduced diagnosis and repair costs • Identify the most-effective syndromes • Accelerate product release • Self-learning Functional Test Automated diagnosis system • Report ambiguity • Develop new tests • Analyze and update current test set

  5. What Data Should We Collect? • Pass/fail information, extent of mismatch with expected values, performance marginalities, etc. • Counter values, BIST signatures, sensor data • Example: A segment of traffic-test log • ## Summary: Interfaces< r2d2 -- metro > counts - Fail(mismatch) • ……464. (00000247) ERR EG R2D2_ARIC_CP_DBUS_CRC_ERR • …… • Error: (0000010A) DIAGERR_ERRISO_INVALID_PKT_CNT: Packet count invalid

  6. What Can We Do With This Data? • Train machine-learning models for root-cause localization • Identify redundant syndromes (i.e., test outcome data) and do data pruning • Identify deficiencies of tests in terms of diagnostic ambiguity and provide guidance for test redesign • “Fill in the gaps” when the data is not sufficient for precise diagnosis

  7. Key Challenges and Solutions • How to improve diagnosis accuracy? • Support-vector machines and incremental learning [Ye et al., TCAD’14] • Multiple classifiers and weighted-majority voting[Ye et al. TCAD’13] • How to speed up diagnosis? • Decision trees [Ye et al. ATS’12] • How to do diagnosis using incomplete information? • Imputation methods [Ye et al. ATS’13] • What can we learn from past diagnosis? • Root-cause and syndrome analysis [Ye et al. ETS’13] • Knowledge discovery and knowledge transfer [Ye et al. ITC’14]

  8. Syndrome and Root-Cause • A segment of traffic-test log • Syndromes are test outcomes parsed from log • Root causes are replaced components, e.g. C37 • ## Summary: Interfaces< r2d2 -- metro > counts - Fail(mismatch) • ……464. (00000247) ERR EG R2D2_ARIC_CP_DBUS_CRC_ERR • …… • Error: (0000010A) DIAGERR_ERRISO_INVALID_PKT_CNT: Packet count invalid Syndrome Root cause

  9. Success Ratio • Success ratio = ¾ = 75% Predicted root cause Actual root cause

  10. Using MachineLearning (Support-Vector Machines) • A binary classifier based on statistical learning • Train the model with data • Using trained model for diagnosis Line2(optimal separation) Line1 (feasible separation) Margin Margin (a) (b)

  11. Example • Suppose we have 6 cases (successful debugged boards) for training • Let x1, x2, x3be three syndromes. If the syndrome manifests itself, we record it as 1, and 0 otherwise. • Suppose that the board has two candidate root causes A and B, and we encode them as y = 1 and y = −1 • Merge the syndromes and the known root causes into matrix A = [B|C], where the left (B) side refers to syndromes, while the right side (C) refers to the corresponding fault classes

  12. Example (Contd.) • SVM calculation: w1= 1.99, w2= 0, w3= 0 and b= −1.00 • Therefore, the classifier is x3 x2 x1 • Given a new failing system with syndrome [1 0 0], then f(x) > 0. The root cause for this failing system is A. • Given another new failing system with syndrome [0 1 0], f(x) < 0. The root cause for this failing system is B. y = 1 y = −1 Distance between classifier and support vector

  13. Results for Different Kernels Circa 2011 (Manufactured Boards) • 811 boards for training, 212 for test • Diagnostic accuracy under different kernel functions

  14. Incremental SVMs Initial training set Support vector extraction Combined Training Set Extracted support vectors New training cases

  15. New training (incoming) data IncrementalLearningFlow Chart Failing boards Preparation stage Extract fault syndromes and repair actions as additional training set S Learning stage Existing SVM model Existing support vectors S* Optimization problem for combined training set (S US*) Solve and update SVM model More new training data? Yes No Final SVM model Diagnosis stage for new systems Determine root cause based on the output of final SVM model

  16. Comparison of Training Time SVM model training time (seconds) Number of training cases in SVMs

  17. Comparison of Success Rates SVM model success rate (percentage) Number of training cases in SVMs

  18. Typical Diagnosis Systems • Number of syndromes (1,000 per board) • Diagnosis time (up to several hours per board) • Often requires manual diagnosis How to select Useful syndromes The complete set of syndromes

  19. Comparison of Diagnosis Procedures Decision Tree-Based Diagnosis Traditional Diagnosis • Start Diagnosis • Start Diagnosis • Observe ONE syndrome Confident about root cause? Observe ALL syndromes No Yes Predicted root cause Predicted root cause

  20. Decision Trees A 1 yes • Internal Nodes • Can branch to two child nodes • Represent syndromes • Terminal Nodes • Do not branch • Contain class information S 2 no yes A 2 yes A 1 S 1 S 4 yes no no S 3 A 4 no A 3

  21. Decision Trees A 1 A 1 yes yes • We may reach root causeA1 in two different testsequences. • Start from the mostdis-criminative syndrome S1 • If S1 manifests itself, we then consider syndrome S2 • If S2 manifests itself, wecan determine A1 to be the root cause S 2 S 2 no yes A 2 yes yes A 1 S 1 S 1 S 4 yes no no S 3 A 4 no A 3

  22. Decision Trees A 1 yes • We may reach root causeA1 in two different testsequences • Start from the mostdis-criminative syndrome S1 • If S1 pass, we will considersyndrome S3 • If S3 manifests itself, we will consider syndrome S4 • If S4 manifests itself, then we can determine A1to be the root cause S 2 no no A 2 yes yes A 1 A 1 S 1 S 1 S 4 S 4 yes yes no no yes S 3 S 3 A 4 no A 3

  23. Training Of Decision Trees (Syndrome Identification) • Goals: • Rank syndromes • Minimize ambiguity • Reduce tree depth • Three popular criteria can be used for training decision trees • Information Gain • GiniIndex • Twoing Class 1 Class 2 Class 1 Class 2

  24. Diagnosis Using Decision Trees • Training Data Preparation • Extract all the fault syndromes and the repair actions from historical data • DT Architecture Design • Design inputs, outputs, splitting criterion, pruning • DT Training • Generate a tree-based predictive model and assess the performance • DT-based Diagnosis • Traverse from the root node of DTs and obtain the root cause at the leaf node

  25. Diagnosis Using Decision Trees • Start Diagnosis • Observe the syndrome at the root of DTs Adaptive Diagnosis Select and observe the new syndrome based on the observation of current syndrome Leaf Node? Yes Predict Root Cause Generate root cause for the failing board No

  26. Experiments • Experiments performed on industrial boards currently in production • Tens of ASICs, hundreds of passive components • All the boards under analysis failed traffic test • A comprehensive functional test set for fault isolation, run through all components

  27. Comparison Of Different Decision-Tree Architectures Total number of syndromes used for diagnosis 5x 6x 6x

  28. Comparison Of Different Decision-Tree Architectures Average Number of syndromes used for diagnosis 17X 15X 18X

  29. Comparison Between DTs And SVMs • Success rates (SR) obtained for Board 3 • SR obtained by DTs are similar to SR obtained by SVMs

  30. Comparison Between DTs And ANNs • Success rates (SR) obtained for Board 3 • SR obtained by DTs are similar to SR obtained by ANNs

  31. Information-Theoretic Syndrome and Root-Cause Analysis for Guiding Diagnosis Analysis of diagnosis performance Feedback for guiding test improvement

  32. Problem Statement Syndrome set from team 1 Syndrome set from team 3 • Lack of diagnosis performance evaluation for individual root cause or individual syndrome • Redundant syndromes Reducedsyndrome set Redundant syndrome set Syndrome set from team 2 Complete set of syndromes

  33. Problem Statement • Lack of diagnosis performance evaluation for individual root cause or individual syndrome • Redundant syndromes • Ambiguous root-cause pairs Which root cause is hard to diagnosis? Complete set of root causes

  34. Analysis Framework Automated diagnosis system Syndromes Root-cause prediction Root-cause analysis Syndrome analysis Update Test design team Minimum-redundancy Maximum-relevance Class-relevance analysis (precision, recall) Synthetic boards Add/Drop tests A reduced set of syndromes A set of redundant syndromes Root causes with high ambiguity Root causes with low ambiguity Feedback

  35. Syndrome Analysis • Problem • Which syndrome is useful for diagnosis? And, which is not? • Method • Select useful syndromes • Minimum-redundancy maximum-relevance (mRMR) method

  36. mRMR Method (demo)

  37. mRMR Method (demo)

  38. mRMR Method (demo)

  39. mRMR Method (demo)

  40. Experimental Setup • Dataset • Industrial boards in high-volume production • Each board has tens of ASICs, hundreds of passive components • A comprehensive functional test run though all components

  41. Experimental Setup • Selected diagnosis systems • Support-vector machines • Artificial neural networks • Decision trees • Majority-weight voting

  42. Results (Syndrome Analysis) Support-vector machines Artificial neural networks Decision trees Weighted-majority voting

  43. Results (Syndrome Analysis) Support-vector machines

  44. Root-Cause Analysis • Problem • Which root-cause is hard to isolate from another root-cause? • Which root-cause is hard to isolate from other root causes? • Method • Screen root causes of high ambiguity • Statistical metrics (e.g., precision, recall)

  45. Metrics for Root-Cause Analysis Actual root cause Predicted root cause TP (True Positive) Root cause A Root cause A FP (False Positive) FN (False Negative) Other root causes (Root cause B, C, …) Other root causes (Root cause B, C, …) TN (True Negative)

  46. Metrics for Root-Cause Analysis Actual root cause Predicted root cause TP (True Positive) Root cause A Root cause A FP (False Positive) Other root causes (Root cause B, C, …) Other root causes (Root cause B, C, …)

  47. Metrics for Root-Cause Analysis Actual root cause Predicted root cause TP (True Positive) Root cause A Root cause A FN (False Negative) Other root causes (Root cause B, C, …) Other root causes (Root cause B, C, …)

  48. Root-Cause Analysis (demo) Predicted root cause Actual root cause

  49. Root-Cause Analysis (demo) Predicted root cause Actual root cause

More Related