Performance Modeling of Heterogeneous Systems
MetadataShow full item record
As the complexity of parallel computers grows, constraints posed by the construction of larger systems require both greater, and increasingly non-linear, parameter sets to model their behavior realistically. These heterogeneous characteristics create a trade-off between the complexity and accuracy of performance models, creating challenges in utilizing them for design decisions. In this thesis, we take a bottom-up approach to realistically model software and hardware interactions, by composing system models from simpler, linear models, which allow parts of the analysis to be automated. We associate empirically benchmarked platform performance metrics with the core elements in a variant of bulk-synchronous execution, aiming to quantify application performance, and associated potential for computation and communication overlap on SMP clusters. The original bulk-synchronous performance model is introduced, and we identify areas of computation and communication where its abstractions impede realistic models of contemporary hardware. These are addressed independently, using experimental evidence to develop a representation collecting computation kernel characteristics and pairwise communications in matrices, to combine into a system model. As bulk-synchronous execution strongly depends on periodic, global synchronization, we develop a cost model for it by combining latency measurements with a parametric representation of signalling patterns, and experimentally verify the resulting predictions for three common algorithms. We describe a design to implement the BSPLib programming interface, combining threads and message-passing parallelism to achieve overlap on commodity cluster platforms, implementing its one-sided communication primitives using out-of-band control messages. We augment and validate the cost model of one adapted synchronization algorithm with the corresponding bandwidth requirement, completing a framework for modeling BSPLib program performance. Finally, we test the utility of this framework as a proof-of-concept for guiding software performance adaptations, using two cases. First, we use the latency terms to automatically generate synchronization operations, using model predictions to generate customized patterns with respect to platform topology, showing that the resulting algorithms equal or outperform the system defaults. Second, the strong scaling characteristics of a 5-point stencil code is compared for three implementations. Experiments show the performance overhead of our implementation, but also its capability for predicting program cost, including parameter values to optimize for balanced overlapping of computation and communication.