AIO: An Abstraction for Performance Analysis Across Diverse Accelerator Architectures
Original version
10.1109/ISCA59077.2024.00043Abstract
Specialization is the key approach for continued performance growth beyond the end of Dennard scaling. Academics and industry are hence continuously proposing new accelerator architectures, including conventional Domain-Specific Accelerators (DSAs) and emerging Processing in Memory (PIM) accelerators. We are thus fast approaching an era in which earlystage accelerator analysis is critical for maintaining the productivity of software developers, system software designers, and computer architects — to ensure that they focus time-consuming implementation and optimization efforts on the most favorable class of accelerators for the problem at hand. Unfortunately, existing approaches fall short because they either adopt a level of abstraction that is too high — and therefore are unable to account for key performance phenomena — or too low — because they focus on details that do not generalize across diverse accelerators. Our Architecture-Independent Operation (AIO) abstraction addresses this issue by leveraging that accelerators typically focus on data-level parallelism, and an AIO is hence a key piece of algorithm-level data-parallel work that remains the same across diverse accelerators. To demonstrate that the AIO abstraction can be accurate and useful, we create the AccMe performance model which predicts kernel performance by estimating the number of clock cycles spent on compute, memory, and invocation overhead while accounting for overlap between compute and memory cycles as well as finite memory bandwidth. We demonstrate that AccMe can be accurate, i.e., it yields an average performance prediction error of $5.6 \%$ across our diverse kernels and accelerators. This is a significant improvement over the $\mathbf{2 0. 6 \%}$ average error of curve-fitted Roofline which provides the best-case accuracy of Roofline’s operational intensity abstraction. We further demonstrate that AccMe is useful through three case studies that illustrate (i) how developers can use AccMe for accelerator selection under uncertainty; (ii) how system software can use AccMe for scheduling — and thereby improve throughput by $2.8 \times$ on average compared to Roofline-driven scheduling; and (iii) how computer architects can use AccMe for architectural exploration.