system reliability 3
Download
Skip this Video
Download Presentation
System Reliability - 3

Loading in 2 Seconds...

play fullscreen
1 / 14

System Reliability - 3 - PowerPoint PPT Presentation


  • 301 Views
  • Uploaded on

System Reliability - 3. Processor Problems Power-on aspects Run-time issues Hardware-based Fault Tolerance Fault-tolerant structures Matching structures to requirements. Processor Problems. Techniques to answer these questions: Is it OK to use the processor system at all?

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'System Reliability - 3' - medwin


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
system reliability 3

System Reliability - 3

Processor Problems

Power-on aspects

Run-time issues

Hardware-based Fault Tolerance

Fault-tolerant structures

Matching structures to requirements

processor problems
Processor Problems
  • Techniques to answer these questions:
    • Is it OK to use the processor system at all?
    • Is it safe to continue using it in its normal running mode?
    • If problems are detected, can we recover?
  • We want as far as possible graceful recovery from failures
  • Power-on Aspects
    • CPU Tests
      • Verify processor correctly executes its range of instructions
      • Ensure condition flags can be set and read correctly
      • Check on-chip registers etc can be written and read correctly
      • Verify numeric coprocessors
    • You can write your own code to perform these test insofar as applicable to your application code
processor problems3
Processor Problems
  • Power-on Aspects
    • ROM Tests
      • ROM contents could become corrupted by EM pulses / radiation
      • Flash can be “accidentally” reprogrammed
    • Write tests to check for validity on power-up
  • Run-time Issues
    • Stack overflow
      • Inhibit interrupts until current one is serviced (not always feasible)
      • Service the interruptr before the next arrives (not always possible)
      • Monitor with software
      • Hardware-based monitoring (eg ARM)
processor problems4
Processor Problems
  • Run-time Issues
    • Corruption of critical variables
      • Causes by electrical noise, power supply fluctuations, …
      • Use check summing
      • Make a (check summed copy), compare with original
      • Make 3 copies, use majority voting
    • Instruction pointer corruption
      • PC may point to unprogrammed locations
        • Fill with No-op codes
      • or it may point to programmed memory locations
        • More difficult to deal with
hardware based fault tolerance
Hardware-based Fault Tolerance

operational requirements

continuous operation required

non-continuous operation acceptable

reduced performance acceptable

must be returned quickly to full service

full performance needed

stops & locks out safely

fail-operational

high-availability

fail-active

fail-safe

no repair possible

automated repair permissible

automatic re-config

manual repair without re-config

manual repair with auto re-config

manual repair with manual recovery

Automatic reconfig with automatic recovery

hardware based fault tolerance6
Hardware-based Fault Tolerance
  • Fault-tolerant structures
    • Passive redundant systems

Sensor 1

Processor 1

Processor 2

Majority voter

Control actuator

Processor 3

Sensor 2

Processor 4

hardware based fault tolerance7
Hardware-based Fault Tolerance
  • Fault-tolerant structures
    • Active redundant systems
      • Standby processor could be running “cold” or “hot”, in the latter case perhaps fully synchronised with the active processor

Processor 1 (active)

Processor 3 (fault detection)

Sensors

Processor 2 (standby)

output

Bus switch

hardware based fault tolerance8
Hardware-based Fault Tolerance
  • Fault-tolerant structures
    • Hybrid redundant systems

Majority voter

Processor 1 (active)

Processor 2 (active)

Sensor 1

Switching circuit

Bus switch

Processor 3 (active)

output

Processor 4 (spare)

hardware based fault tolerance9
Hardware-based Fault Tolerance
  • Fault-tolerant structures
    • Very high availability systems
      • Active redundant structures
      • DB servers etc
      • Very fast “failover” times
      • Ready availability of disk data to standby systems
    • Hot-swap systems
      • Replacement of boards while system is on-line, without disturbing operations
      • CompactPCI standard: four levels of capability
        • Basic hot-swap
        • Full hotswap
        • High-availability hot-swap (single processor)
        • High-availability hot-swap (multiple processor)
hardware based fault tolerance10
Hardware-based Fault Tolerance
  • Fault-tolerant structures
    • CompactPCI - Basic hot-swap

Devices, networks

Host Processor single board computer (SBC)

Hot-swap I/O boards

PCI bus

hardware based fault tolerance11
Hardware-based Fault Tolerance
  • Fault-tolerant structures
    • CompactPCI - Full & high-availability (single processor) hot-swap
      • Full: interface to system software allowing it to oversee swap of boards
      • HA: system monitors/controls boards with platform management software

Platform management controller SBC

Devices, networks

Hot-swap I/O boards

PCI bus

management bus

Host Processor SBC

hardware based fault tolerance12
Hardware-based Fault Tolerance
  • Fault-tolerant structures
    • CompactPCI - Full & high-availability (multiple processor) hot-swap
      • Distributed computing:
        • load balancing
        • Re-allocation of work from a failed processor
      • Fault diagnosis, repair managed by the host SBC

Primary (host) SBC

Devices, networks

Dedicated I/O controller

SBC 1

Devices, networks

SBC 2

PCI bus

SBC n

slide13

Hardware-based Fault Tolerance

operational requirements

continuous operation required

non-continuous operation acceptable

reduced performance acceptable

must be returned quickly to full service

full performance needed

stops & locks out safely

fail-operational

high-availability

fail-active

fail-safe

no repair possible

1

automated repair permissible

2

automatic re-config

3

manual repair without re-config

4

manual repair with auto re-config

5

manual repair with manual recovery

6

Automatic reconfig with automatic recovery

7

ad