













| Technology<br>node (nm) | Year<br>introduced | Relative SEU<br>rate in<br>FITs/kbit |   | Mbits/processor | Relative<br>uncorrected<br>SEU rate /<br>microproces<br>sor |   |
|-------------------------|--------------------|--------------------------------------|---|-----------------|-------------------------------------------------------------|---|
| 250                     | 1998               | 3.2                                  |   | 1.52            | 5.0                                                         |   |
| 180                     | 1999               | 3.0                                  | ¥ | 1.52            | 4.3                                                         | ¥ |
| 130                     | 2000               | 2.4                                  | ¥ | 3.28            | 7.9                                                         | ↑ |
| 90                      | 2002               | 1.0                                  | ¥ | 33.6            | 33.6                                                        | ↑ |
| 65                      | 2006               | 0.7                                  | ¥ | 44.3            | 30.5                                                        | 1 |
| 40                      | 2008               | 0.94                                 | 1 | 71              | 67                                                          | ↑ |



### Outline

- Hardware reliability trends
- Case study: Experimental evaluation of error handling mechanisms in a jet-engine controller
- Layered fault tolerance (from lecture 13)
- Error detection techniques
- Fault detection vs. error detection



#### **Purpose of different layers**

- Hardware layer serves as a first line of defense that should
  - Correct as many errors as is economically feasible
  - Detect errors that cannot be corrected
- Software layer serves as a second line of defense that should
- Correct errors detected, but not corrected by the hardware layer.
- Detect errors that are not detected or corrected by the hardware layer
   Ensure appropriate failure semantics for the node for any error that cannot be corrected.
- System layer serves as a third line of defense that should
   Detect and ensures that are not detected as exceeded by the
- Detect and correct any errors that are not detected or corrected by the software and hardware layers





# On-line error detection techniques mentioned in the course book (1) (Hardware layer techniques)

#### • Watchdog timers (p. 130 in course book)

- Hardware layer technique supported by software
- Bus monitoring (p. 130 in course book)
  - Checking the range of addresses generated by a CPU

#### Examples

- Checking that the CPU use an even address when reading a 32 or 64-bit word.
   Checking CPU (or program) memory accesses using a memory management unit (MMU).
- Power supply monitoring (pp. 130-131 in the course book)



#### Generic principles for error detection

- Duplication and comparison, consistency checking and information redundancy are examples of generic principles for error detection<sup>1</sup>.
- These principles can be used at all abstraction layers, i.e., the *hardware*, *software* and *system layers*.

<sup>1</sup>Note: Information redundancy can also used to *correct* errors.





# Self-checking node supporting software implemented message comparison



- The processors executes the same programs and exchange copies of outgoing messages via the inter processor links
- They compare the message copies and stops execution if the copies do not match.
- An error counter stores the number of mismatches that has occurred.
- The node is restarted after a mismatch only if the value of the error counter is below a predefined threshold
- The bus guardian protects the bus from erratic behavior (e.g., babbling idiot) of the network interfaces

# Watchdog timers Watchdog timers are used to detect slow programs and programs that hang in

- The principle is simple:
  - When a program starts to execute, either the program itself or the operating system starts a hardware timer.
  - The timer must be reset by the program within a given deadline, otherwise the timer will send an interrupt signal to the CPU.
  - The interrupt signal causes the CPU to take appropriate recovery actions, such as
    restarting the program or rebooting the node.
- Watchdog timers are common in embedded real-time systems.
- They are used to "transform" timing failures into signalled failures or silent failures.



- Watchdog timers are sometimes used in conjunction with other error detection mechanisms to simplify the implementation of recovery.
- This works as follows:
  - When an error is detected, the error detection mechanism stores an error flag in a designated memory area (preferably a "crash-proof" memory)
  - The error detection mechanism then forces a program hang, which
     subsequently causes the watchdog timer to raise an interrupt.
  - The interrupt invokes a recovery routine which reads the error flag and then initiates the appropriate recovery actions.
  - Restart and recovery could be done for an entire node, a single program, or a group of programs.

Dept. of Computer Science and Engineering Chalmers University of Technology

#### Restarting a node in a distributed system

- Restarting a node in a distributed system involves an elaborate set of actions, including
  - recovering the node's view of the system state
  - reintegrating the node into the set of operational nodes
- These actions are handle by a system-layer mechanism called a node membership service.

# CPU Exceptions (Hardware layer technique) Modern central processing units (CPUs) are equipped with hardware implemented error detection mechanisms called hardware exceptions. The number and type of hardware exceptions varies depending on the CPU design When a hardware exception is raised, the CPU stops the program execution and jumps to an exception routine The handling of exceptions is very similar to how a CPU responds to interrupt signals Some examples of common hardware exceptions is given in the next two slides

## **Examples of CPU exceptions (1)**

**Bus error:** detects errors during read and write accesses to the main memory. This exception is raised (triggered) when the CPU attempts to access an address to which no memory or any I/O device is connected.

Address error: detects when the CPU makes an attempt to access memory using an odd numbered address; only even numbered addresses are allowed in many CPUs.

**Illegal opcode**: detects if the CPU during an instruction fetch reads a value from memory (or the instruction cache) that doesn't correspond to a valid instruction. This error can occur if the program counter is erroneously loaded with an address pointing to a data area rather than a program code area.

# Examples of CPU exceptions (2) Privilege violation: detects if a user program attempts to execute an instruction which is allowed only for programs that execute in the superuser mode (privileged mode), such as the operating system or device drivers. User programs normally executes in user mode (normal mode).

Division by zero: detects if a program tries to divide a number with zero.

Spurious interrupt: detects if an interrupt is signalled but no interrupt vector is provided by the interrupting device. (The interrupt vector tells the CPU which device it was that raised the interrupt signal and thereby indicates which interrupt service routine that the CPU shall execute.)

#### Operating System and Complier Generated Software Assertions (Software layer techniques)

#### Operating system assertions:

- Examples:
  - Integrity checks of data structures used by the operating system
- Execution time monitoring of application and system processes
- Compiler generated run-time assertions:
  - Examples:
  - · Value range overflow checking
  - Loop iteration bound overflow checking
  - Type checking of constrained variables
- When an error is detected by any of these mechanisms, typically a "trap" or software interrupt instruction is executed, which initiates appropriate recovery actions.



Dept. of Computer Science and Engineering Chalmers University of Technology







## **Overview of Lecture 15**

- Time-Triggered Systems
- Read before the lecture:
  - Lecture slides
  - The Time-Triggered Architecture (see reading instructions)