Algorithmic and Scheduling Techniques for Heterogeneous and Distributed Computing
Abstract
The computing and communication resources of high performance computing systems are becoming heterogeneous, are exhibiting performance fluctuations and are failing in an unforeseeable manner. The Master-Slave (MS) paradigm, that decomposes the computational load into independent tasks, is well-suited for operating in these environments due to its loose synchronization requirements. The application tasks can be computed in any order, by any slave, and can be resubmitted in case of slave failures. Although, the MS paradigm naturally adapts to dynamic and unreliable environments, it nevertheless suffers from a lack of scalability.
This thesis providesmodels, techniques and scheduling strategies that improve the scalability and performance of MS applications. In particular, we claim that deploying multiple masters may be necessary to achieve scalable performance. We address the problem of finding the most profitable locations on a heterogeneous Grid for hosting a given number of master processes, such that the total task throughput of the system is maximized. Further, we provide distributed scheduling strategies that better adapt to system load fluctuations than traditional MS techniques. Our strategies are especially efficient when communication is expensive compared to computation (which constitutes the difficult case).
Furthermore, this thesis investigates also the suitability ofMS scheduling techniques for the parallelization of stencil code applications. These applications are usually parallelized with domain decompositionmethods, that are highly scalable, but rather impractical for dealing with heterogeneous, dynamic and unreliable environments. Our experimental results with two scientific applications show that traditional MS tasking techniques can successfully be applied to stencil code applications when the master is used to control the parallel execution. If the master is used as a data access point, then deploying multiple masters becomes necessary to achieve scalable performance.
Has parts
Banino, Cyril. A Distributed Procedure for Bandwidth-Centric Scheduling of Independent-Task Applications. Proceedings of the 19th IEEE International Parallel and Distributed Processing Symposium, IPDPS 2005: 48a, 2005.Banino-Rokkones, Cyril; Beaumont, Olivier; Natvig, Lasse. Master-Slave Tasking on Asymmetric Networks. Lecture Notes in Computer Science, Proceedings of the 12th International Euro-Par Conference, Euro-Par 2006. 4128: 167-176, 2006.
Rosenvinge, Einar M.R.; Elster, Anne C.; Banino, Cyril. Online Task Scheduling on Heterogeneous Clusters. Proceedings of PARA 2004: 7th International Conference on Applied Parallel Computing, Lecture Notes in Computer Science. (ISSN 1611-3349). 3732/2006: 1141-1150, 2006.
Banino-Rokkones, Cyril; Amundsen, Jørn; Smørgrav, Eivind. Parallelizing Lattice Gauge Theory Models on Commodity Clusters. Proceedings of IEEE International Conference on Cluster Computing: 1-8, 2006.