![]() |
Mali OpenCL SDK v1.1.0
|
A tutorial to demonstrate the use of the long data type (64-bit integer) including atomics.
In this tutorial we will look at using the long data types in OpenCL. We will also touch on how and why to use atomics in OpenCL. In the process, we introduce one of the 64-bit atomic extensions that Mali-T600 series GPUs support.
In the OpenCL Embedded Profile, 64-bit integer (i.e. long, ulong) types are optional (including the corresponding vector data types and operations). However, Mali-T600 series GPUs implement OpenCL Full Profile, where support for 64-bit integer types is required. 64-bit integers are supported and fully hardware accelerated on Mali-T600 series GPUs.
The long data types are used for calculations that require very large integers. Example use cases include:
This tutorial requires atomic operations for accumulating values across kernels. Atomic operations for 32-bit integers are a part of the core OpenCL 1.1 Full Profile and therefore supported by all Full Profile implementations (including Mali-T600 series GPUs). However, we require atomics for 64-bit integers, which is an optional extension in OpenCL 1.1 (cl_khr_int64_base_atomics). All Mali-T600 GPUs implement the extension for 64-bit atomics.
Unless otherwise noted, all code snippets come from 64_bit_integer.cl.
We have included a 512x512 input bitmap for use with this sample (to keep the size of the installer small). However, you are more likely to see performance improvements (when compared to C code running on a CPU) when larger images are used. There is some start-up overhead associated with using OpenCL. This overhead can outweigh the benefits of parallel processing when the input data sizes are small.
This sample has been coded to allow any input bitmap to be used. Simply change input.bmp in the assets directory of the sample to the input image of your choice. You will see larger calculation performance improvements when larger images are used.
Some face detection techniques, such as the Robust Real-time Object Detection (Viola and Jones, 2001) framework need to calculate the variance of an example sub-window with this formula:
Variance = ((Σ p) / N )2 - 1/N * Σ(p2)
Where p is the pixels values and N is the total number of pixels.
For this example, we are only calculating the sum of pixel values and the sum of the pixel values squared. We calculate these values in an OpenCL kernel.
On Mali-T600 series GPUs, we recommend the use of vectors 128-bit wide. For more information about vectorization, see Vectorizing your OpenCL code.
If we consider that the maximum pixel value at 8-bits per pixel is 255. Squaring this value (255 * 255 = 65025) fits inside a ushort (16-bit type, maximum value 65535). We use ushort8 because 8 * 16-bits = 128-bits, the preferred vector width. See Querying for Hardware Support for more information on preferred vector width.
However, the sum of the squares and the sum of pixels can overflow a short and an int. Therefore, we convert them to ulong and sum all the values in the vector until we get a single value that can be added to the accumulator (sumOfPixels and squareOfPixels respectively).
As all the kernels are accessing the accumulators at the same time, memory accessing conflicts can happen. This can lead to race conditions and data being lost.
To avoid this, we use atom_add to add an integer value to a value referenced by a pointer. This guarantees that no other kernel executing on the same device can read or write that memory during the addition operation. Atomic operation also exist for other functions (e.g. multiply, subtract, increment, decrement). This means that this operation becomes very expensive, so we want to use it only when necessary.
To enable atom_add for 64-bit integers, we add this pragma to the kernel code:
From a command prompt in the root of the SDK, run:
From a command prompt in the root of the SDK, run:
This compiles the 64-bit integer sample code and copies all the files it needs to run to the bin folder in the root directory of the SDK.
Navigate to the folder on the board and run the 64-bit integer binary:
You should see output similar to:
Find solutions for Common Issues.
For more information have a look at the code in 64_bit_integer.cpp and 64_bit_integer.cl.