Acceleration with OmpSs and Neon/OpenCL on ARM Processor

Lillesand, Trond Inge

Lillesand, Trond Inge

Master thesis

View/Open

655637_COVER01.pdf (Locked)

655637_FULLTEXT01.pdf (Locked)

655637_ATTACHMENT01.zip (Locked)

URI

http://hdl.handle.net/11250/253392

Date

2013

Metadata

Show full item record

Collections

Institutt for datateknologi og informatikk [6822]

Abstract

The Mont-Blanc Project, led by the Barcelona Supercomputing Center, aims to achieve exascale computing performance with a new global standard in energy efficiency by integrating low-power ARM-based technology into a system of energy-efficient compute nodes. Application kernel research plays a crucial role in understanding the interplay between performance and energy efficiency in a High Performance Computing (HPC) system. In this thesis, the Mont Blanc targeted application kernels 2D-Convolution and Merge Sort are explored, and implemented in OmpSs, NEON, and OpenCL on an Arndale development board containing an Exynos 5 System-on-Chip (SoC). The SoC, which contains a dual-core ARM Cortex A15 processor and a Mali T604 GPU, serves as a compute node in the Mont-Blanc project.

Due to the lack of access to energy counting registers in the Exynos 5, a scheme for measuring whole-board energy consumption was created. Performance and energy efficiency metrics were used to evaluate the various implementations. The frequency was also scaled for the different CPU implementations to see how different frequencies affect these metrics.

NEON vectorization was exploited by using vector extractions on the 2D-Convolution kernel to improve locality. For Merge Sort, NEON was exploited by implementing in-register sorting with a bitonic sorting network, similar to the approach taken by Chhugani et al. (2008) with SSE, but applied to NEON instead. Implementations of sorting networks and convolution kernels in OpenCL were also explored. Various scheduling policies in the OmpSs implementations were used to get a sense of how they affected performance.

The in-register merge sort scheme with NEON gave the highest performance and energy efficiency compared to the OpenCL implementations, although a direct comparison may not be entirely appropriate, as the quality and circumstances of the implementations likely differ.

However, vectorization with NEON resulted in high performance at the expense of high power consumption, but with a high energy efficiency, and demonstrates the power of locality improvement combined with vector operations.

The OpenCL implementation for 2D-Convolution demonstrated high performance and low power consumption, and achieved the highest energy efficiency in this particular case. For the OmpSs implementations, the choice of scheduling policy proved to affect performance. Scaling the frequency on the applications shows that there is a balance point between frequency and energy efficiency, where an excessively high frequency tends to result in a larger increase in power than performance, and an excessively low frequency results in a larger decrease in performance than power. The results indicate that this differential effect increases with the amount of cores.

Publisher

Institutt for datateknikk og informasjonsvitenskap