Mali OpenCL SDK v1.1.0
 All Classes Files Functions Variables Macros Pages
64-bit Integers and Atomics

A tutorial to demonstrate the use of the long data type (64-bit integer) including atomics.

Introduction

In this tutorial we will look at using the long data types in OpenCL. We will also touch on how and why to use atomics in OpenCL. In the process, we introduce one of the 64-bit atomic extensions that Mali-T600 series GPUs support.

Long Data Type

In the OpenCL Embedded Profile, 64-bit integer (i.e. long, ulong) types are optional (including the corresponding vector data types and operations). However, Mali-T600 series GPUs implement OpenCL Full Profile, where support for 64-bit integer types is required. 64-bit integers are supported and fully hardware accelerated on Mali-T600 series GPUs.

The long data types are used for calculations that require very large integers. Example use cases include:

  • Fixed point arithmetic.
  • Encryption/decryption.
  • Hashing.
  • 64-bit arithmetic.

64-bit Atomics

This tutorial requires atomic operations for accumulating values across kernels. Atomic operations for 32-bit integers are a part of the core OpenCL 1.1 Full Profile and therefore supported by all Full Profile implementations (including Mali-T600 series GPUs). However, we require atomics for 64-bit integers, which is an optional extension in OpenCL 1.1 (cl_khr_int64_base_atomics). All Mali-T600 GPUs implement the extension for 64-bit atomics.

Note
Both 32-bit and 64-bit atomics are optional extensions in the OpenCL Embedded Profile.

Implementation

Unless otherwise noted, all code snippets come from 64_bit_integer.cl.

Image Size

We have included a 512x512 input bitmap for use with this sample (to keep the size of the installer small). However, you are more likely to see performance improvements (when compared to C code running on a CPU) when larger images are used. There is some start-up overhead associated with using OpenCL. This overhead can outweigh the benefits of parallel processing when the input data sizes are small.

This sample has been coded to allow any input bitmap to be used. Simply change input.bmp in the assets directory of the sample to the input image of your choice. You will see larger calculation performance improvements when larger images are used.

64-bit Arithmetic

Some face detection techniques, such as the Robust Real-time Object Detection (Viola and Jones, 2001) framework need to calculate the variance of an example sub-window with this formula:

Variance = ((Σ p) / N )2 - 1/N * Σ(p2)

Where p is the pixels values and N is the total number of pixels.

For this example, we are only calculating the sum of pixel values and the sum of the pixel values squared. We calculate these values in an OpenCL kernel.

On Mali-T600 series GPUs, we recommend the use of vectors 128-bit wide. For more information about vectorization, see Vectorizing your OpenCL code.

If we consider that the maximum pixel value at 8-bits per pixel is 255. Squaring this value (255 * 255 = 65025) fits inside a ushort (16-bit type, maximum value 65535). We use ushort8 because 8 * 16-bits = 128-bits, the preferred vector width. See Querying for Hardware Support for more information on preferred vector width.

However, the sum of the squares and the sum of pixels can overflow a short and an int. Therefore, we convert them to ulong and sum all the values in the vector until we get a single value that can be added to the accumulator (sumOfPixels and squareOfPixels respectively).

/* Load 8 pixels (char) and convert them to shorts to calculate the square.*/
ushort8 pixelShort = convert_ushort8(vload8(i, imagePixels));
/* Square of 255 < 2 ^ 16. */
ushort8 newSquareShort = pixelShort * pixelShort;
/*
* Convert original pixel value and the square to longs to sum
* all the vectors together and add the final values to the
* respective accumulators.
*/
ulong8 pixelLong = convert_ulong8(pixelShort);
ulong8 newSquareLong = convert_ulong8(newSquareShort);
/*
* Use vector data type suffixes (.lo and .hi) to get smaller vector types,
* until we obtain one single value.
*/
ulong4 sumLongPixels1 = pixelLong.hi + pixelLong.lo;
ulong2 sumLongPixels2 = sumLongPixels1.hi + sumLongPixels1.lo;
ulong sumLongPixels3 = sumLongPixels2.hi + sumLongPixels2.lo;
ulong4 sumLongSquares1 = newSquareLong.hi + newSquareLong.lo;
ulong2 sumLongSquares2 = sumLongSquares1.hi + sumLongSquares1.lo;
ulong sumLongSquares3 = sumLongSquares2.hi + sumLongSquares2.lo;

As all the kernels are accessing the accumulators at the same time, memory accessing conflicts can happen. This can lead to race conditions and data being lost.

To avoid this, we use atom_add to add an integer value to a value referenced by a pointer. This guarantees that no other kernel executing on the same device can read or write that memory during the addition operation. Atomic operation also exist for other functions (e.g. multiply, subtract, increment, decrement). This means that this operation becomes very expensive, so we want to use it only when necessary.

atom_add(sumOfPixels, sumLongPixels3);
atom_add(squareOfPixels, sumLongSquares3);

To enable atom_add for 64-bit integers, we add this pragma to the kernel code:

#pragma OPENCL EXTENSION cl_khr_int64_base_atomics : enable

Running the Sample

From a command prompt in the root of the SDK, run:

  1. From a command prompt in the root of the SDK, run:

    cd samples/64_bit_integer
    make install

    This compiles the 64-bit integer sample code and copies all the files it needs to run to the bin folder in the root directory of the SDK.

  2. Copy this folder to the board.
  3. Navigate to the folder on the board and run the 64-bit integer binary:

    ./64_bit_integer
  4. You should see output similar to:

    Profiling information:
    Queued time: 0.079ms
    Wait time: 0.263749ms
    Run time: 0.228437ms
    Square of the pixel values = 914239116
    Sum of the pixel values = 9365464

Find solutions for Common Issues.

More Information

For more information have a look at the code in 64_bit_integer.cpp and 64_bit_integer.cl.