Principles, Techniques, and Tools for Explicit and Automatic Parallelization
Abstract
The end of Dennard scaling also brought an end to frequency scaling as a means to improve performance. Chip manufacturers had to abandon frequency and superscalar scaling as processors became increasingly power constrained. An architecture’s power budget became the limiting factor to performance gains, and computations had to be performed more energy-efficiently. Designers turned to chip multiprocessors (CMPs) and developers began to employ specialized architectures, such as Graphics Processing Units (GPUs) and Field ProgrammableGate Arrays (FPGAs), to further improve performance while meeting the power envelope. The exploitation of parallelism in an energyefficient manner became the primary way forward.
Until the end of Dennard scaling, programs experienced transparent performance gains with every new processor generation. However, CMPs, GPUs, and FPGAs rely on the static extraction of parallelism to improve performance, and programs need to be modified to take advantage of these architectures. Thus, performance gains are no longer achieved transparently, and developers and tools are forced to face new, as well as long-neglected challenges in program parallelization. These challenges include the detection and encoding of potential parallelism in automatic approaches, application portability issues on GPUs, and performance portability issues on CMPs. It is essential to address these challenges, as the continuous increase in computer performance now solely relies on the exploitation of parallelism.
This thesis consists of three parts, each addressing one of the aforementioned challenges in program parallelization. The first part addresses the detection and encoding of potential parallelism in automatic approaches. It presents the Regionalized Value State Dependence Graph (RVSDG) as an alternative intermediate representation for optimizing and parallelizing compilers. The RVSDG exposes the hierarchical structure of programs and explicitly models the dependencies between computations, permitting the explicit encoding of concurrent operations and program structures, such as conditionals, loops, and functions. This helps to expose the inherent parallelism in programs and its structures by employing well-known methods for the extraction of instruction level parallelism.
The second part addresses application portability issues on GPUs. A GPU’s specialized architecture is optimized for highly regular data-parallel applications, but compromises program performance for workloads with irregular control flow, potentially leading to redundant code execution. We propose a control flow restructuring method to effectively eliminate repeated code execution on GPUs and potentially improve performance.
The third part addresses performance portability on CMPs. This issue arises as developers overfit their application to a specific architecture, which results in suboptimal performance for different program inputs or different architectures. We improve performance analysis for OpenMP programs by addressing the scalability challenges of the grain graph visualization method. We present an aggregation method for grain graphs that hierarchically groups related nodes into a single node. This aggregated graph can then be navigated by progressively uncovering nodes with performance issues, while hiding unrelated regions of the graph. This enhances productivity by enabling developers to understand performance problems of highly-parallel OpenMP programs more easily.
The insights and techniques developed by addressing these three challenges may result in improved methods and tools for the exploitation of parallelism. The RVSDG is a promising IR for parallelizing compilers, as it permits the encoding of concurrent computations. The grain graph offers a familiar structural view to developers along with the performance issues of a particular program.
In the future, it is necessary to cast these ideas into mature tools to make them applicable in practice and foster further research.
Has parts
Paper 1: Reissmann, Nico; Meyer, Jan Christian; Själander, Magnus. RVSDG: An Intermediate Representation for Optimizing Compilers. This article is awaiting publication and is not included in NTNU OpenPaper 2: Bahmann, Helge; Reissmann, Nico; Jahre, Magnus; Meyer, Jan Christian. Perfect Reconstructability of Control Flow from Demand Dependence Graphs. ACM Transactions on Architecture and Code Optimization (TACO) 2015 ;Volum 11.(4) http://doi.org/10.1145/2693261
Paper 3: Reissmann, Nico; Falch, Thomas Løfsgaard; Bjørnseth, Benjamin Andreassen; Bahmann, Helge; Meyer, Jan Christian; Jahre, Magnus. Efficient control flow restructuring for GPUs. I: International Conference on High Performance Computing & Simulation (HPCS). IEEE 2016 s. 48-57 - © 2016 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works http://doi.org/10.1109/HPCSim.2016.7568315
Paper 4: Reissmann, Nico; Muddukrishna, Ananya. Diagnosing Highly-Parallel OpenMP Programs with Aggregated Grain Graphs. Lecture Notes in Computer Science 2018 ;Volum 11014 LNCS. s. 106-119 https://doi.org/10.1007/978-3-319-96983-1_8