Fault-tolerance for MPI Codes on Computational Clusters

Hagen, Knut Imar

dc.contributor.advisor	Elster, Anne Cathrine	nb_NO
dc.contributor.author	Hagen, Knut Imar	nb_NO
dc.date.accessioned	2014-12-19T13:31:43Z
dc.date.available	2014-12-19T13:31:43Z
dc.date.created	2010-09-03	nb_NO
dc.date.issued	2007	nb_NO
dc.identifier	347465	nb_NO
dc.identifier	ntnudaim:3329	nb_NO
dc.identifier.uri	http://hdl.handle.net/11250/250460
dc.description.abstract	This thesis focuses on fault-tolerance for MPI codes on computational clusters. When an application runs on a very large cluster with thousands of processors, there is likely that a process crashes due to a hardware or software failure. Fault-tolerance is the ability of a system to respond gracefully to an unexpected hardware or software failure. A test application which is meant to run for several weeks on several nodes is used in this thesis. The application is a seismic MPI application, written in Fortran90. This application was provided by Statoil, who wanted a fault-tolerant implementation. The original test application had no degree of fault-tolerance --if one process or one node crashed, the entire application also crashed. In this thesis, a collection of fault-tolerant techniques are analysed, including checkpointing, MPI Error handlers, extending MPI, replication, fault detection, atomic clocks and multiple simultaneous failures. Several MPI implementations are described, like MPICH1, MPICH2, LAM/MPI and Open MPI. Next, some fault-tolerant products which are developed at other universities are described, like FT-MPI, FEMPI, MPICH-V including its five protocols, the fault-tolerant functionality of Open MPI, and MPI Error handlers. A fault-tolerant simulator which simulates the application's behaviour is developed. The simulator uses two fault-tolerance methods: FT-MPI and MPI Error handlers. Next, our test application is similarly made fault-tolerant with FT-MPI using three proposed approaches: MPI_Reduce(), MPI_Barrier(), and the final and current implementation: MPI Loop. Tests of the MPI Loop implementation are run on a small and a large cluster to verify the fault-tolerant behaviour. The seismic application survives a crash of n-2 nodes/processes. Process number 0 must stay alive since it acts as an I/O server, and there must be at least one process left to compute data. Processes can also be restarted rather than left out, but the test application needs to be modified to support this.	nb_NO
dc.language	eng	nb_NO
dc.publisher	Institutt for datateknikk og informasjonsvitenskap	nb_NO
dc.subject	ntnudaim	no_NO
dc.subject	SIF2 datateknikk	no_NO
dc.subject	Komplekse datasystemer	no_NO
dc.title	Fault-tolerance for MPI Codes on Computational Clusters	nb_NO
dc.type	Master thesis	nb_NO
dc.source.pagenumber	97	nb_NO
dc.contributor.department	Norges teknisk-naturvitenskapelige universitet, Fakultet for informasjonsteknologi, matematikk og elektroteknikk, Institutt for datateknikk og informasjonsvitenskap	nb_NO

Tilhørende fil(er)

Filnavn:: 347465_COVER01.pdf
Størrelse:: 47.53Kb
Format:: PDF

Åpne

Filnavn:: 347465_ATTACHMENT01.zip
Størrelse:: 556.9Kb
Format:: Ukjent

Åpne

Filnavn:: 347465_FULLTEXT01.pdf
Størrelse:: 1.073Mb
Format:: PDF

Åpne

Denne innførselen finnes i følgende samling(er)

Institutt for datateknologi og informatikk [6828]

Vis enkel innførsel