A Combinatorial Group Testing Method for FPGA Fault Location

A Combinatorial Group Testing MethodforFPGA Fault Location Ronald F. DeMara, Carthik A. SharmaUniversity of Central Florida

Introduction Field Programmable Gate Arrays • Gate-array-based reconfigurable architecture • Matrix of Logic Cells (Look-Up Tables) surrounded by peripheral I/O cells • Capabilities: • Runtime reconfiguration • On-chip processor core & Millions of gate-equivalent logic elements • Millions of FPGA devices produced annually: most SRAM-based • Used in mission-critical applications • Remote systems & Hazardous Environments • Space Applications – Satellites, probes, and shuttles

Group Testing Algorithms • Origin – World War II Blood testing • Problem: Test samples from millions of new recruits • Solution: Test blocks of sample before testing individual samples • Problem Definition • Identify subset Q of defectives from set P • Minimize number of tests • Test v-subsets of P • Form suitable blocks

Previous Work • Pre-compiled Column-Based Dual FPGA architecture [Mitra04] • Autonomous detection, repair by shifting pre-compiled columns • Isolation using distributed CED-checkers and “blind” reconfiguration attempts • Overview of Combinatorial Group Testing and Applications [Du00] • Provides taxonomy and general algorithms for applying CGT • Examples of CGT applications: DNA clone library filtering, vaccine screening, computer fault diagnosis, etc. • CGT Enhanced Circuit Diagnosis [Kahng04] • Present doubling, halving etc for circuit fault diagnosis using BIST, CGT • Requires ability to test resources individually • Chinese Remainder Sieve technique [Eppstein05] • Efficient non-adaptive and two-stage CGT based on prime number driven test formation • Improved algorithms for practical problem sizes (n < 1080) with small number of defectives (d < 4)

Fault-Handling Techniques Device Failure Characteristics Duration: Transient: SEU Permanent: SEL, Oxide Breakdown, Electron Migration, LPD Device Configuration Processing Datapath Device Configuration Processing Datapath Target: BIST CGT-Based Repetitive Readback Approach: TMR STARS CED Dueling Methods Duplex Output Comparison Supplementary Testbench Duplex Output Comparison Detection: Cartesian Intersection Isolation: Bitwise Comparison Majority Vote Repetitive Intersections Fast Run-time Location Worst-case Clock Period Dilation Diagnosis: unnecessary Evolutionary Algorithm using Intrinsic Fitness Evaluation Recovery: Replicate in Spare Resource Select Spare Resource Invert Bit Value Ignore Discrepancy

Isolation Problem Outline Objectives • Locate faulty logic and/or interconnect resource: a single stuck-at fault model is assumed • Online Fault Isolation: device not entirely removed from service Features • Runtime Reconfiguration: FPGA resources configured dynamically • Utilize Runtime Inputs: avoid special test-vectors, improve availability Constraints • Use pre-designed configurations: defined by target application • Subsets under test have constant resource utilization range for a given isolation problem • Resource grouping influences fault articulation: resource-mapping and input vector might mask hardware faults • Do not use specialized “block designs” • Runtime reconfiguration limited to column-swapping • “Non-reasonable” algorithm: “tests” may be repeated without gaining new isolation information

Fault Location Using Dueling The set of all competing configurations is represented by S. Set Ck represents the resources utilized by configuration k. Each competing configuration k, 1 < k < |S| has a unique binary Usage MatrixUk, 1 < k < p. Elements Uk[i,j], 1 < i < m, 1 < j n, where m and n represent the rows and columns in the device layout respectively. Elements Uk[i,j] = 1 denote the usage of resource (i, j) by Ck. The History MatrixH, with elements H[i,j]1 < i < m, 1 < j < n, is an integer matrix used to represent the relative fitness of individual resources. H[i,j] provides instantaneous relative fitness values of resources.

Dueling Example H [i,j] @ t = 0 U2 U1 • H [i,j] changes after C1 and C2 are loaded • U1 and U2are corresponding Usage Matrices • (3,3) is identified as the faulty resource H [i,j] @ t = 2

Modified Halving Initially all H[i,j] = 0 Selection Process can be Adaptive Fitness Augmentation can be non-linear Columns can be swapped with any other Columns

FPGA Arrangement for Dueling • Configurations in Population • C = CL CR • CL = subset of left-half configurations • CR = subset of right-half configurations • |CL|=|CR |= |C|/2

Temporary stasis in isolation due to insufficient design diversity Isolation Progress without Halving • Without Halving • Initially |S| = 20,000 • Resource Utilization = 40% • Number of suspected faulty elements constant at 36 after 23 iterations • No subsequent improvement due to lack of differentiating information between competing configurations

Symptoms of stasis invoke halving procedure for fast isolation Dueling with Modified Halving • Dueling with Halving • Halving works by swapping half the used columns with unused ones • Halving progressively reduces the size of the set of suspected faulty elements • Isolation proceeds till a single faulty element is isolated • Fault isolated after 19 iterations

Effect of Total Number of Elements • Increased Problem Size • Number of Elements = (Number of Rows x Number of Columns • As the size of the array containing the fault increases, the increase in the required number of iterations is minimal • For 1 mill. elements, only 27.4 iterations required.

Effect of Population Size • Population Size • Single fault in S is assumed • As pop. size increases, isolation expected to be faster • Increased pop. size implies more initial designs • A population size of 30 seems to be an ideal tradeoff between ease of isolation, and the difficulty of generating increased number of individuals. Increased population size provides minimal added benefit

Effect of Resource Utilization • Moderate resource utilization ideal for isolation • Rate of isolation progress low with extreme utilization characteristics • Isolation takes longer when less than 20% or greater than 80% of the available resources are utilized. 20 40

Future Work • Conducting Tests using Benchmark Circuits • ISCAS89 s38584 with 11448 gates: sequential logic • ISCAS85 circuits with max 3513 gates: combinational logic • Compression/ Signal Processing algorithms, such as the Lempel-Ziv (LZ) compression scheme [Mitra04] • Development of an architecture to enable column-swapping • Multi-layer Runtime Reconfigurable Architecture (MRRA) being prototyped

Backup Slides • On following pages …

Online Dueling Evaluation • Objective • Isolate faults by successive intersection between sets of FPGA resources used by configurations • Analyze complexity of Isolation process • Variables • Total resources available • Measured in number of LUTs • Number of Competing Configurations • Number of initial “Seed” designs in CRR process • Degree of Articulation • Some inputs may not manifest faults, even if faulty resource used by individual • Resource Utilization Factor • Percentage of FPGA resources required by target application/design • Number of Iterations for Isolation • Measure of complexity and time involved in isolating fault

Discrepancy MirrorCircuit Fault Coverage

Perpetually Articulating Inputs with Equiprobable Distribution Intermittently Articulating Inputs with Equiprobable Distribution Influence of LUT utilization • expected number of pairings grows sub-linearly in number of resources • utilization below 20% or above 80% implicates (or exonerates) a smaller sub-set of resources • 50% utilization, the expected number of pairings for 1,000, 10,000, and 100,000 resources are 11.1, 14.9, and 17.6 • at 90% utilization mean value of 258 pairings are required to isolate the faulty resource.

Accommodating Multi-bit Word Widths • Proof of concept • The present circuit works efficiently • Demonstrates important Dueling-enabled isolation method • Strategies • Use an array of detectors • attempt to minimize points of failure as word-width increases • Number of logic resources used is acceptable for smaller circuits • Create new circuit or scheme, combining fault tolerant coding-based methods with single-fault secure circuit • Current research focused on improving detector by investigating codes, and fault-secure circuits

Pull-down Resistor Considerations • Proof of concept • The present circuit works in a verifiable correct manner • Can utilize synthesized (digital) pull-down resistor which simulate the behavior of analog resistors • Demonstrates Dueling-enabled isolation method • Can be utilized without implementation problems for Custom-VLSI designs • Alternative Approach • Alternate detector circuits for FPGA implementation are under investigation • Avoid using Tri-state buffers, pull-down resistors and use native digital components available on FPGAs

Conceptual Innovation novel fitness assessment via pairwise discrepancy without any pre-conceived oracle for correctness (emergent behavior) Competitive Runtime Reconfiguration (CRR) Evolutionary Computation strategies effective for more than just repair phase: continually detect,rank, and isolate faults entirely within the underlying data throughput flow diverse alternatives working a-priori fault detection by robust consensus over time no test vectors device remains online during repair fault isolation is model-free and self-calibrating completely-repaired criteria can be ignored graceful degredation via ranking of alternatives no reconfiguration when fault-free performance readily adjustable failures in population memory covered checking logic part of individual hence also competes for correctness

 = RS:  = (Hamming Distance) States Transitions during lifetime of ith Half-Configuration Configuration Health States • Discrepancy Operator • Baseline Discrepancy Operator is dyadic operator with binary output: • Z(Ci) is FPGA data throughput output of configuration Ci WTA: (Equivalence)

Initialization Population partitioned into functionally-identical yet L=R physically-distinct is half-configurations NO either L's or R's L=R fitness Repair < Threshold ? Fitness discrepancy Selection Detection YES Adjustment free choose apply functional inputs invoke update fitness of only FPGA configuration(s) to compute FPGA Genetic PRIMARY L and R based on labeled L and R outputs using L, R LOOP Operators detection results only once L, R results and only on L or R • Consensus Based Evaluation • Discrepancy Operator: CL CR • Four Fitness States : • Pristine Suspect Under Repair Refurbished Adjust Controls detection mode, overlap interval, ... Procedural Flow underConsensus-Based Evaluation • Initialization • Partition P into sub-populations of size |P|/2 to designate physical FPGA left-half or right-half resource utilization • Regeneration Genetic Operators recover based on Reintroduction Rate Operators only appliedonce then offspring returned to “service” without concern about increasing fitness

GA Parameters & Experiments • GA parameters • Population size : 20 individuals Crossover rate : 5% • Mutation rate : up to 80% per bit • GA operators • External-Module-Crossover • Internal-Module-Crossover • Internal-Module-Mutation Speciation • Two-point crossover between individuals from same sub-group • Crossover points chosen to prevent intra-CLB crossover • Breeding occurs exclusively among members of sub-populations • Maintains non-interfering resource use among L, R Demonstrate … • Fault Isolation Characteristics • Regenerative Experiments Experiments … • Objective fitness function replaced by the Consensus-based Evaluation Approach and Relative Fitness • Elimination of additional test vectors

Impact of Fault on Viable Individuals • Existence of Positive Test Vector • Input Ip comprises a positive test vector iff Cv(Ip) Cf(Ip) = 1 where Cv denotes a viable configuration andCf denotes a faulty configuration • So if a discrepancy is visible then some Ip exists which manifests the fault • Minimal Case whenIpis Unique • Ipis unique if fault is observable under exactly one test vector • Probability Mass Function for EncounteringIpin Minimal Case • Consider Ew=600 yielding 99.5% coverage for a module with input space W=64 • The number of input occurrences, 0  i  600, that randomly encounter Ip to identify the fault is governed by the probability density function: p.m.f.(i)= where where D is the length of Ew

Isolation of a single faulty individual with 1-out-of-64 impact • Outliers are identified after EW iterations have elapsed • Expected D.V. = (1/64)*600 = 9.375 from individual impacted by fault • Isolated individual’s DV differs from the average DV by 3after 1 or more observation intervals of length EW

Isolation of a single faulty L individual with 10-out-of-64 impact Compare with 1-out-of-64 fault impact • Expected DV of (10/64)*600 = 93.75 for faulty configuration • One isolation will be complete approx. once in every 93.75/5 = 19 Sliding Windows • Fault Isolation achieved is 100%

Isolation of 8 faulty individuals L4&R4 with 1-out-of-64 impact • Expected isolations do not occur approx. 40% of the time • Average discrepancy value of the population is higher • Outlier isolation difficult • Multiple faulty individual,Discrepancies scattered

Regeneration Performance Parameters: Difference (vs. Hamming Distance) Evaluation Window, Ew = 600 Suspect Threshold: DVS = 1-6/600=99% Repair Threshold: DVR = 1-4/600 = 99.3% Re-introduction rate: r = 0.1 Repairs evolved in-situ, in real-time, without additional test vectors, while allowing device to remain partially online.

Multilayer Runtime Reconfiguration Architecture (MRRA) • Develop MRRA fast reconfiguration paradigm for the CRR approach • Validate with real hardware platform along with detailed performance analysis • First general-purpose framework for a wide variety of applications requiring dynamic reconfiguration • Extend existing theories on reconfiguration

Loosely Coupled Solution The Virtex-II Pro is mounted on a development board which can then be interfaced with a WorkStation running Xilinx EDK and ISE. The entire system operates on a 32-bit basis

For further info … EH Websitehttp://cal.ucf.edu

A Combinatorial Group Testing Method for FPGA Fault Location

A Combinatorial Group Testing Method for FPGA Fault Location

Presentation Transcript

Combinatorial testing

Fault-Based Testing

CitLab : a Laboratory for Combinatorial Interaction Testing

A Theory of Fault Based Testing

Combinatorial Testing Of ACTS: A Case Study

Testing Method

Combinatorial Group Testing Methods for the BIST Diagnosis Problem

Combinatorial Testing Strategies

IPR: In-Place Reconfiguration for FPGA Fault Tolerance

History of Combinatorial Testing

Fault Tolerant Facility Location

Performance Evaluation of Two Allocation Schemes for Combinatorial Group Testing Fault Isolation

A Combinatorial Fusion Method for Feature Mining

Fault-Tolerant Facility Location

Non-Adaptive Fault Diagnosis for All-Optical Networks via Combinatorial Group Testing on Graphs

Which one is the Proper Fault Location Method ?

PLANS FOR FPGA IRRADIATION and TESTING

Combinatorial Methods for Event Sequence Testing

Fault Location Techniques for Distribution Systems

Fault-Based Testing