EDA122 / DIT061 Fault-Tolerant Computer Systems, 2012 (7,5 hp)

Course PM

Content

 

Created: 2012-09-01

Change history:

2013-08-19: Date and place for third exam stated.

2012-10-19: Problems to be solved during exercise session 10 is updated.

2012-10-16: Lecture plan finalized. Content descriptions for lectures 12, 13, 14, 15, 16 and 17 updated. Problems solved during exercise session 9 is updated.

2012-09-15: The lecture "More on N-version programming and Recovery blocks" moved back to Oct 5;  the lectures on Safety Assessment moved forward to Sept 24 and Oct 1. 

 

Teachers

Lecture and Exercises


Course description

The course gives an introduction to fault-tolerant and safety-critical computer systems. Fault-tolerance is used in a wide range of critical embedded, enterprise and server applications. The course covers four major areas: 1) Design  principles for centralized and distributed fault-tolerant computer systems, 2) Dependability analysis of fault-tolerant systems, 3) Techniques and processes for assessment of safety critical systems, and  4)  Standards and terminology. The design principles are illustrated through system examples from areas such as space, aviation, road vehicles and transaction processing.


Course literature

The course book is available at Cremona.  All other course literature will be made available on the course homepage.


Lecture plan (final)

Lecture slides will, if possible, be posted on the course homepage no later than 24 hours before the lecture.
 
Lecture no. Course Week Date Time Room Content
1 1 Tuesday, Sept 4 08.00-09.45 HC1 Introduction: Basic concepts in fault-tolerant computing, hardware redundancy, voting redundancy, basic terminology.
2 1 Thursday, Sept 6 10.00-11.45 HC1 Hardware redundancy: Voting redundancy, Standby redundancy, Active redundancy

System example: HP Non-stop Architecture.

3 1 Friday, Sept 7 15.15-17.00 HC1 Reliability modeling: Basic concepts in reliability theory, reliability block diagrams, fault trees
4 2 Monday, Sept 10 13.15-15.00 HC1 Case study: Ariane 501 disaster.

Software redundancy: Design diversity, N-version programming, Recovery blocks.

5 2 Thursday, Sept 13 10.00-11.45 HC1 Reliability modeling: Markov chain models
6 3 Monday, Sept 17 13.15-15.00 HC1 Availability modeling: Markov chain models, Birth-death processes.

Safety modeling. 

7 3 Thursday, Sept 20 10.00-11.45 HC1 Generalized Stochastic Petri Net Models

Design diversity in the flight control system for Airbus A330/A340

8 4 Monday, Sept 24 13.15-15.00 HC1 Safety assessment: Hazard and Risk Analysis, FMEA, FTA.

Technical Management: Life-cycle models, IEC 61508 

9 4 Thursday, Sept 27 10.00-11.45 HC1 Guest lecture: FT in space applications, Torbjörn Hult, Ruag Space AB
10 5 Monday, Oct 1 13.15-15.00 HC1 Safety assessment: Allocation of safety integrity levels, Hardware reliability prediction, Safety case.

Technical Management: Life-cycle models, ISO 26262

11 5 Thursday, Oct 4 10.00-15.00 HC1 Guest lecture: Functional safety, certification and standards, Jan Jacobson, SP Technical Research Institute of Sweden
12 5 Friday, Oct 5 15.15-17.00 HC1 Software redundancy: Experimental evaluations of N-version programming and Recovery blocks.
Study of field failures in high-performance computing systems.
13 6 Monday, Oct 8 13.15-15.00 HC1 FT in distributed systems: Consensus,  Byzantine failures. Layered fault tolerance
14 6 Thursday, Oct 11 10.00-11.45 HC1 Reliability trends for integrated circuits. Error detection. Experimental evaluation of error detection mechanisms in a jet-engine controller.
15 6 Friday, Oct 12 15.15-17.00 HC1 FT in distributed systems: The Time-Triggered Architecture
16 7 Monday, Oct 15 13.15-15.00 HC1 Guest lecture: Fault-tolerance in JAS-Gripen, Lars Holmlund, Saab Aerosystems.
17 7 Thursday, Oct 18 10.00-11.45 HC1 Clock synchronization in time-triggered systems. More on error detection techniques.

Course summary.

 


Exercise plan (preliminary)

Exercise no. Course
W
eek
Date Time Room Content Problems
1 2 Monday,  Sept 10 15.15-17.00 HC1 Reliability modeling: Reliability block diagrams, fault trees. 2.2, 2.3, 2.6, 2.7
2 2 Friday,
Sept 14
15.15-17.00 HC1 Reliability modeling: Markov chains 3.1, 3.2, Variant of 5.6
3 3 Monday,  Sept 17  15.15-17.00 HC1

Availability modeling.

3.12, 3.11, 5.2
4 3 Friday, Sept 21 15.15-17.00 HC1 Introduction to laboratory class 1 Lab-PM
5 4 Monday,  Sept 24  15.15-17.00 HC1 Probabilistic safety analysis. 3.8, 3.9
6 4 Friday,  Sept 28 15.15-17.00 HC1 Generalized Stochastic Petri Net Models

Introduction to laboratory class 2

Lab-PM
7 5 Monday,
Oct 1
15.15-17.00 HC1 Dependability modeling 5.9, 5.10, Exam problems
8 6 Monday,
Oct 8
15.15-17.00 HC1 Failure rate function, FMEA FMEA and reliability analysis,

1.1, Exam problem

9 7 Monday,
Oct 15
15.15-17.00 HC1 Exam problems Old Exams

2004-08-23(problem 1)

2010-01-11 (problem 3)

2004-08-23(problem 1)

 

10 7 Friday,
Oct 19
15.15-17.00 HC1 Exam problems Old Exam

2011-10-19

(problem1,2,3)


Laboratory classes


Examination

Participation in the laboratory classes and approved laboratory reports.

Written exam. Grades: failed, 3, 4, 5.

First exam: Tuesday, October 23, 2012, 14.00 - 18.00, HA, HB, HC

Second exam:  Monday, January 15, 2013, 14.00 - 18.00, Mechanical engineering building, Hörsalsvägen 5

Third exam: Tuesday, August 22, 2012, 14.00 - 18.00, VV