ImageCL and Other Techniques and Tools for Optimizing Applications Utilizing Heterogeneous Computing
Abstract
Several technological and economic trends have recently caused a rise in the use and adoption of systems with heterogeneous and parallel hardware. These systems include different computing devices with different architectures, such as multi-core CPUs, GPUs and other accelerators that are used together. However, while heterogeneous hardware provides a lot of compute power for a low energy cost in theory, harnessing this power in practice can be challenging. In this dissertation, we address this challenge. Our aim is to make the use of heterogeneous hardware easier by investigating and highlighting problem areas, and developing tools and techniques to solve them.
Image processing has become increasingly popular and important. The combination of complex algorithms and increasing amounts of data makes image processing applications ideal for the potential high performance of heterogeneous hardware. For these reasons, we have chosen it as the use case for the techniques and tools we develop. Furthermore, it allows these tools and techniques to be integrated with the FAST image processing framework, developed by our collaborators.
The focus of this dissertation is on two central challenges in high performance computing. These are 1) performance portability – making it possible to write code once, and run it with good performance on different devices without retuning – and 2) load balancing – dividing the load between devices with different architectures and capabilities evenly.
To solve these challenges, we initially explore the space of potential optimizations targeting image processing on modern heterogeneous hardware. The focus is primarily on stencil-like applications, which includes an important subset of image processing applications, on GPUs. We find two main classes of optimizations; those related to the distribution of work to threads, and those related to the use of various memory spaces. A novel optimization technique is also developed. Here we combine the registers of multiple threads on a GPU to form a 0-level, manually managed, shared cache for the participating threads, achieving speedups of up to 2.04 compared to using shared memory, on a Nvidia GTX 680 GPU.
With potential optimizations established, we investigate how high-level languages, and auto-tuning can be used to improve performance, and especially, performance portability. To demonstrate our ideas, ImageCL, a simplified version of OpenCL, along with a source-to-source compiler, is developed. While designed with image processing in mind, the ImageCL language, and the underlying techniques, are general. From a single, high-level language input, the compiler can generate multiple different OpenCL implementations with different combinations of the optimization previously identified applied. Auto-tuning can then be used to pick the best implementation for a given device. We develop an auto-tuner for this purpose, which uses a performance model to guide and speed up the search. Thus, it becomes possible to write an algorithm once, and automatically generate high performance code for different devices. When evaluated using simple image processing benchmarks on different CPUs and GPUs, ImageCL and the auto-tuner compares favorably to other state of the art approaches (Halide, HIPACC, OpenCV), achieving speedups of up to 4.57.
To address the second challenge, we develop a load balancing scheme for cases were several devices with different architectures are used together. In such cases, the optimal work partitioning is typically application dependent. Our scheme uses performance models to predict the performance of an application on the different devices, based on analyzing performance counter values, that are measured during a test run on one device. Compared to an oracle, our approach is only 3% slower when dividing work between a CPU and a GPU, using synthetic benchmarks. A version of the ImageCL compiler which can generate code for multiple devices on a node, or multiple nodes in a cluster, is also developed to be used with the load balancing technique.
We use machine learning to build the performance models of both the auto-tuner and load-balancer automatically, removing the need for manually derived analytical models. Our performance models achieve mean relative errors as low as 3.8% and 2.4% for the auto-tuner and load-balancer, respectively. We thereby successfully demonstrate the suitability of machine learning for this task.
By applying the techniques and tools developed in this dissertation, we show how developing high performing applications for heterogeneous computing can be made easier. The techniques and tools provided include our machine learning based autotuner, the ImageCL compiler and the optimizations it can apply, and the machine learning-based load balancing techniques for the multi-device version of ImageCL. Although we use image processing as a use case, most of the techniques we develop can be applied or generalized to other domains, in particular where stencil applications are used. Together, these tools and techniques diminish the challenges of programming heterogeneous hardware, and allows it to fulfill its promise of increased performance.