Mali OpenCL SDK v1.1.0
 All Classes Files Functions Variables Macros Pages
Vectorizing your OpenCL code

How to vectorize the hello world example to increase performance.

Vectorization Basics

Performance Benefits

If there is hardware support for vectors on your device, multiple arithmetic operations can happen simultaneously. For example, hardware which has 128-bit vector hardware could do arithmetic operations of four 32-bit integers simultaneously.

Often, OpenCL kernels do the same operations on multiple pieces of data. In those cases the code can usually be vectorized such that each kernel does arithmetic calculation on more than one piece of the data, and those arithmetic operations are done simultaneously. Obviously, doing more calculations simultaneously leads to a performance benefit.

Querying for Hardware Support

OpenCL devices can advertise their preferred vector widths for the different OpenCL data types. You can use this information to select a kernel that is optimised for the platform you are running on.

For example, one device may only have hardware support for scalar integers while another could have hardware support for integer vectors of width 4. Two versions of the kernel could be written, one using scalars and one using vectors, with the correct version being selected at runtime.

Here is an example of querying for the preferred integer vector width on a particular device (taken from hello_world_vector.cpp):

/*
* Query the device to find out it's prefered integer vector width.
* Although we are only printing the value here, it can be used to select between
* different versions of a kernel.
*/
cl_uint integerVectorWidth;
clGetDeviceInfo(device, CL_DEVICE_PREFERRED_VECTOR_WIDTH_INT, sizeof(cl_uint), &integerVectorWidth, NULL);
cout << "Prefered vector width for integers: " << integerVectorWidth << endl;

The same is possible for the other OpenCL data types.

Mali Hardware Support

Each Mali-T600 series GPU core has a minimum of two 128-bit wide ALUs (Arithmetic logic units), which are vector capable.

Most operations in the ALUs (e.g. floating point add, floating point multiply, integer addition, integer multiply), can operate on 128-bit vector data (e.g. char16, short8, int4, float4).

Use the querying method above to determine the correct vector size to use for your data type.

We recommend the use of vectors wherever possible when using a Mali-T600 series GPU.

Vectorizing The Code

Other than the following changes in this tutorial, the code for the vector and non-vector code is exactly the same.

  1. Modify the kernel to use vectors

    Our basic hello world example looked like this (hello_world_opencl.cpp):

    __kernel void hello_world_opencl(__global int* restrict inputA,
    __global int* restrict inputB,
    __global int* restrict output)
    {
    /*
    * Set i to be the ID of the kernel instance.
    * If the global work size (set by clEnqueueNDRangeKernel) is n,
    * then n kernels will be run and i will be in the range [0, n - 1].
    */
    int i = get_global_id(0);
    /* Use i as an index into the three arrays. */
    output[i] = inputA[i] + inputB[i];
    }

    Each instance of this kernel does a single integer addition in one operation. We can vectorise this code to do multiple integer additions in a single operation.

    Because of the vector hardware capabilities of Mali-T600 series GPUs, these vector operations can be exectuted in the same time as a single integer addition.

    We can vectorise like so (hello_world_vector.cl):

    __kernel void hello_world_vector(__global int* restrict inputA,
    __global int* restrict inputB,
    __global int* restrict output)
    {
    /*
    * We have reduced the global work size (n) by a factor of 4 compared to the hello_world_opencl sample.
    * Therefore, i will now be in the range [0, (n / 4) - 1].
    */
    int i = get_global_id(0);
    /*
    * Load 4 integers into 'a'.
    * The offset calculation is implicit from the size of the vector load.
    * For vloadN(i, p), the address of the first data loaded would be p + i * N.
    * Load from the data from the address: inputA + i * 4.
    */
    int4 a = vload4(i, inputA);
    /* Do the same for inputB */
    int4 b = vload4(i, inputB);
    /*
    * Do the vector addition.
    * Store the result at the address: output + i * 4.
    */
    vstore4(a + b, i, output);
    }
  2. Reduce the number of kernel instances

    Because each kernel instance is now doing multiple additions, we must reduce the number of kernel instances accordingly (hello_world_vector.cpp):

    /*
    * Each instance of our OpenCL kernel now operates on 4 elements of each array so the number of
    * instances needed is the number of elements in the array divided by 4.
    */
    size_t globalWorksize[1] = {arraySize / 4};
    /* Enqueue the kernel */
    if (!checkSuccess(clEnqueueNDRangeKernel(commandQueue, kernel, 1, NULL, globalWorksize, NULL, 0, NULL, &event)))
    {
    cleanUpOpenCL(context, commandQueue, program, kernel, memoryObjects, numberOfMemoryObjects);
    cerr << "Failed enqueuing the kernel. " << __FILE__ << ":"<< __LINE__ << endl;
    return 1;
    }

    The reduction factor is based on the width of the vectors. For example, if we'd used int8's in the kernel instead of int4's, we'd reduce the global work size by a factor 8.

More Information

For more information have a look at the basic OpenCL code in hello_world_opencl.cpp and hello_world_opencl.cl and the vectorized version in hello_world_vector.cpp and hello_world_vector.cl.

Main Tutorial: Hello World.

Previous section: From C to OpenCL.

Running the Samples