Simulation of the response time distribution of fault-tolerant multi-tier cloud services
Journal article, Peer reviewed
MetadataShow full item record
Original versionJournal of Simulation 2016 10.1057/s41273-016-0042-9
We are considering the problem of obtaining the response time distribution of fault-tolerant multi-tier services. In the provision of software-as-a-service applications, the service provider is obliged to ensure a certain quality of service. Herein, we regard upper bounds on the response time. The services consist of multiple components with different functionality, which are prone to failures, and fail according to a certain failure time distribution. However, due to redundancy, a failure will not necessarily bring the service down, but rather increase the response time. A fundamental difficulty with estimating the response time distribution while considering failures is related to the disparity in the time scales of the time between failures and service times. To overcome this issue, we propose an approach based on a decomposition, which combines an analytic model of the failure process and a discrete event simulation model to sample the response time distribution. In an experimental study, we compare this simulation-based approach with an analytic approach, and illustrate how this approach can be utilised by service providers as decision support. We also show that in certain cases, the analytic approach might provide a safe bound on the response time.