System reliability 3 l.jpg
This presentation is the property of its rightful owner.
Sponsored Links
1 / 14

System Reliability - 3 PowerPoint PPT Presentation


System Reliability - 3. Processor Problems Power-on aspects Run-time issues Hardware-based Fault Tolerance Fault-tolerant structures Matching structures to requirements. Processor Problems. Techniques to answer these questions: Is it OK to use the processor system at all?

Download Presentation

System Reliability - 3

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


System reliability 3 l.jpg

System Reliability - 3

Processor Problems

Power-on aspects

Run-time issues

Hardware-based Fault Tolerance

Fault-tolerant structures

Matching structures to requirements


Processor problems l.jpg

Processor Problems

  • Techniques to answer these questions:

    • Is it OK to use the processor system at all?

    • Is it safe to continue using it in its normal running mode?

    • If problems are detected, can we recover?

  • We want as far as possible graceful recovery from failures

  • Power-on Aspects

    • CPU Tests

      • Verify processor correctly executes its range of instructions

      • Ensure condition flags can be set and read correctly

      • Check on-chip registers etc can be written and read correctly

      • Verify numeric coprocessors

    • You can write your own code to perform these test insofar as applicable to your application code


Processor problems3 l.jpg

Processor Problems

  • Power-on Aspects

    • ROM Tests

      • ROM contents could become corrupted by EM pulses / radiation

      • Flash can be “accidentally” reprogrammed

    • Write tests to check for validity on power-up

  • Run-time Issues

    • Stack overflow

      • Inhibit interrupts until current one is serviced (not always feasible)

      • Service the interruptr before the next arrives (not always possible)

      • Monitor with software

      • Hardware-based monitoring (eg ARM)


Processor problems4 l.jpg

Processor Problems

  • Run-time Issues

    • Corruption of critical variables

      • Causes by electrical noise, power supply fluctuations, …

      • Use check summing

      • Make a (check summed copy), compare with original

      • Make 3 copies, use majority voting

    • Instruction pointer corruption

      • PC may point to unprogrammed locations

        • Fill with No-op codes

      • or it may point to programmed memory locations

        • More difficult to deal with


Hardware based fault tolerance l.jpg

Hardware-based Fault Tolerance

operational requirements

continuous operation required

non-continuous operation acceptable

reduced performance acceptable

must be returned quickly to full service

full performance needed

stops & locks out safely

fail-operational

high-availability

fail-active

fail-safe

no repair possible

automated repair permissible

automatic re-config

manual repair without re-config

manual repair with auto re-config

manual repair with manual recovery

Automatic reconfig with automatic recovery


Hardware based fault tolerance6 l.jpg

Hardware-based Fault Tolerance

  • Fault-tolerant structures

    • Passive redundant systems

Sensor 1

Processor 1

Processor 2

Majority voter

Control actuator

Processor 3

Sensor 2

Processor 4


Hardware based fault tolerance7 l.jpg

Hardware-based Fault Tolerance

  • Fault-tolerant structures

    • Active redundant systems

      • Standby processor could be running “cold” or “hot”, in the latter case perhaps fully synchronised with the active processor

Processor 1 (active)

Processor 3 (fault detection)

Sensors

Processor 2 (standby)

output

Bus switch


Hardware based fault tolerance8 l.jpg

Hardware-based Fault Tolerance

  • Fault-tolerant structures

    • Hybrid redundant systems

Majority voter

Processor 1 (active)

Processor 2 (active)

Sensor 1

Switching circuit

Bus switch

Processor 3 (active)

output

Processor 4 (spare)


Hardware based fault tolerance9 l.jpg

Hardware-based Fault Tolerance

  • Fault-tolerant structures

    • Very high availability systems

      • Active redundant structures

      • DB servers etc

      • Very fast “failover” times

      • Ready availability of disk data to standby systems

    • Hot-swap systems

      • Replacement of boards while system is on-line, without disturbing operations

      • CompactPCI standard: four levels of capability

        • Basic hot-swap

        • Full hotswap

        • High-availability hot-swap (single processor)

        • High-availability hot-swap (multiple processor)


Hardware based fault tolerance10 l.jpg

Hardware-based Fault Tolerance

  • Fault-tolerant structures

    • CompactPCI - Basic hot-swap

Devices, networks

Host Processor single board computer (SBC)

Hot-swap I/O boards

PCI bus


Hardware based fault tolerance11 l.jpg

Hardware-based Fault Tolerance

  • Fault-tolerant structures

    • CompactPCI - Full & high-availability (single processor) hot-swap

      • Full: interface to system software allowing it to oversee swap of boards

      • HA: system monitors/controls boards with platform management software

Platform management controller SBC

Devices, networks

Hot-swap I/O boards

PCI bus

management bus

Host Processor SBC


Hardware based fault tolerance12 l.jpg

Hardware-based Fault Tolerance

  • Fault-tolerant structures

    • CompactPCI - Full & high-availability (multiple processor) hot-swap

      • Distributed computing:

        • load balancing

        • Re-allocation of work from a failed processor

      • Fault diagnosis, repair managed by the host SBC

Primary (host) SBC

Devices, networks

Dedicated I/O controller

SBC 1

Devices, networks

SBC 2

PCI bus

SBC n


Slide13 l.jpg

Hardware-based Fault Tolerance

operational requirements

continuous operation required

non-continuous operation acceptable

reduced performance acceptable

must be returned quickly to full service

full performance needed

stops & locks out safely

fail-operational

high-availability

fail-active

fail-safe

no repair possible

1

automated repair permissible

2

automatic re-config

3

manual repair without re-config

4

manual repair with auto re-config

5

manual repair with manual recovery

6

Automatic reconfig with automatic recovery

7


Hardware based fault tolerance matching structures to requirements l.jpg

Hardware-based Fault Tolerance Matching structures to Requirements


  • Login