Acceleration with OmpSs and Neon/OpenCL on ARM Processor
MetadataShow full item record
In this thesis, the application kernels 2D-Convolution and Merge Sort are implemented in OmpSs, NEON and OpenCL on an Arndale development board containing an Exynos 5 SoC. The SoC contains an ARM Cortex A15 dual processor and a Mali T604 GPU. A scheme for measuring whole-board energy consumption is then created, where performance and energy efficiency metrics are used to evaluate the various implementations. The frequency is also scaled for the different CPU implementations to see how different frequencies affect these metrics. NEON vectorization is exploited by using vector extractions on the 2D-Convolution kernel to improve locality. For Merge Sort, NEON is exploited by performing in-register sorting with a bitonic sorting network. With OpenCL, a bitonic and odd-even sorting network is used. Different scheduling policies in the OmpSs implementations are used to find the best performing policy. Vectorization with NEON gives the highest performance on both applications, and highest energy efficiency for Merge Sort. Vectorization with NEON result in high performance at the expense of high power consumption. The OpenCL implementation for 2D-Convolution gave a high performance and low power consumption, and achieved the highest energy efficiency. For the OmpSs implementations, the choice of scheduling policy proved to affect performance. Scaling the frequency on the applications shows that there is a balance point between frequency and energy efficiency, where a too high frequency on two cores result in a larger increase in power than performance, and a too low frequency result in a larger decrease in performance than power. The results indicate that this difference increases with the amount of cores.