## Modeling Run-Time Distributions in Passively Replicated Fault-Tolerant Systems

##### Doctoral thesis

##### Permanent lenke

http://hdl.handle.net/11250/259424##### Utgivelsesdato

2007##### Metadata

Vis full innførsel##### Samlinger

##### Sammendrag

Many real-time applications will have strict reliability requirements in addition to the timing requirements. To fulfill these reliability requirements, it may be necessary to use a fault-tolerance strategy.
An active replication strategy, where several instances of the task is run in parallel, is the preferred choice for many real-time systems, as the parallel execution of the task instances gives a high probability that at least some of the instances finish successfully before the deadlines, even if others should fail. However, enabling several parallel executions of single tasks increase the need for processing power, which is costly and increases the requirements to space and energy consumption.
In a passive replication strategy, only one instance of a task is run at a time. If the task fails, a backup is readied, and the task is rerun on the backup. This requires fewer resources than active replication strategies, but the extra time needed for the rerun of the task can increase the probability of deadline misses. Thus, analyzing the timing of these systems is necessary.
Analysis using the worst-case execution times for the tasks in the fault tolerant system can often give very conservative results, especially if the tasks’ normal execution times rarely approaches the worst case times.
The analysis of the run-time distributions of the tasks in passively replicated fault tolerant systems can be a useful tool for deciding whether a passive replication strategy is suitable for the system or not. Unlike worst-case execution time analysis, the distributions can also show the improvement in reliability for systems where the passive replication strategy does not work in the worst case scenario. This improvement may be so good that it justifies the use of the replication strategy.
In this work, mathematical models for run-time distributions of tasks in several classes of passive replication systems are developed. These models give the run-time distributions as functions of parameter distributions of the modeled system, like the fault-free runtime of a task, the fault detection time distribution, and the distribution of the time between fault detection and the start of the rerun of the task.
The different fault detection mechanisms used in passive replication systems lead to different structure of the mathematical models. Also, whether the replicas are homogeneous or inhomogeneous affect the model structure. Many other differences in the modeled systems’ structure will lead to differences in the parameter distributions, but not in the structure of the mathematical models.
Models for systems using homogeneous and inhomogeneous replicas, with watchdogs, timeouts, and acceptance tests as fault detectors are developed. One of the goals of the work has been to show the steps used to develop the models in a way so the same steps can be used to develop run-time models for systems that are not presented in the work.
The use of the models is shown with several examples, and the example results are compared to results obtained from discrete event system simulation.

##### Består av

Tjora, Åsmund; Skavhaug, Amund. Fault Tolerance Methods in Component-Based Real-Time Systems. .Tjora, Åsmund; Skavhaug, Amund. A General Mathematical Model for Run-Time Distributions in a Passively Replicated Fault Tolerant System. Euromicro Conference on Real-Time Systems in Porto, Portugal, 2003.

Tjora, Åsmund; Skavhaug, Amund. Assessing Reliability of Real-Time Distributed Systems. 1st ERCIM Workshop on Software - Intensive Dependable Embedded Systems in Porto, Portugal, 2005.

Tjora, Åsmund; Skavhaug, Amund; Heegaard, Poul E. A Mathematical Model for Run - Time Distributions in a Fault Tolerant System with Nonhomogeneous Passive Replicas. ERCIM/DECOS Workshop on Dependable Embedded Systems in Gdansk, Poland, 2006.

Tjora, Åsmund; Skavhaug, Amund. Run-time Distributions in Passively Replicated Systems Using Time out and Acceptance Fault Detection. ERCIM/DECOS Workshop on Dependable Embedded Systems in Lubeck, Germany, 2007.