Vis enkel innførsel

dc.contributor.advisorElster, Anne Cathrinenb_NO
dc.contributor.authorHagen, Knut Imarnb_NO
dc.date.accessioned2014-12-19T13:31:43Z
dc.date.available2014-12-19T13:31:43Z
dc.date.created2010-09-03nb_NO
dc.date.issued2007nb_NO
dc.identifier347465nb_NO
dc.identifierntnudaim:3329nb_NO
dc.identifier.urihttp://hdl.handle.net/11250/250460
dc.description.abstractThis thesis focuses on fault-tolerance for MPI codes on computational clusters. When an application runs on a very large cluster with thousands of processors, there is likely that a process crashes due to a hardware or software failure. Fault-tolerance is the ability of a system to respond gracefully to an unexpected hardware or software failure. A test application which is meant to run for several weeks on several nodes is used in this thesis. The application is a seismic MPI application, written in Fortran90. This application was provided by Statoil, who wanted a fault-tolerant implementation. The original test application had no degree of fault-tolerance --if one process or one node crashed, the entire application also crashed. In this thesis, a collection of fault-tolerant techniques are analysed, including checkpointing, MPI Error handlers, extending MPI, replication, fault detection, atomic clocks and multiple simultaneous failures. Several MPI implementations are described, like MPICH1, MPICH2, LAM/MPI and Open MPI. Next, some fault-tolerant products which are developed at other universities are described, like FT-MPI, FEMPI, MPICH-V including its five protocols, the fault-tolerant functionality of Open MPI, and MPI Error handlers. A fault-tolerant simulator which simulates the application's behaviour is developed. The simulator uses two fault-tolerance methods: FT-MPI and MPI Error handlers. Next, our test application is similarly made fault-tolerant with FT-MPI using three proposed approaches: MPI_Reduce(), MPI_Barrier(), and the final and current implementation: MPI Loop. Tests of the MPI Loop implementation are run on a small and a large cluster to verify the fault-tolerant behaviour. The seismic application survives a crash of n-2 nodes/processes. Process number 0 must stay alive since it acts as an I/O server, and there must be at least one process left to compute data. Processes can also be restarted rather than left out, but the test application needs to be modified to support this.nb_NO
dc.languageengnb_NO
dc.publisherInstitutt for datateknikk og informasjonsvitenskapnb_NO
dc.subjectntnudaimno_NO
dc.subjectSIF2 datateknikkno_NO
dc.subjectKomplekse datasystemerno_NO
dc.titleFault-tolerance for MPI Codes on Computational Clustersnb_NO
dc.typeMaster thesisnb_NO
dc.source.pagenumber97nb_NO
dc.contributor.departmentNorges teknisk-naturvitenskapelige universitet, Fakultet for informasjonsteknologi, matematikk og elektroteknikk, Institutt for datateknikk og informasjonsvitenskapnb_NO


Tilhørende fil(er)

Thumbnail
Thumbnail
Thumbnail

Denne innførselen finnes i følgende samling(er)

Vis enkel innførsel