Mali OpenCL SDK v1.1.0
 All Classes Files Functions Variables Macros Pages
From C to OpenCL

How to move from a C/C++ 'for loop' to an OpenCL kernel.

The Algorithm

In this tutorial we are going to look at a very simple algorithm. Adding two arrays of numbers together element-by-element and storing the results in a third array:

Cn = An + Bn

Despite the simplistic nature of this algorithm, it can benefit from being implemented in OpenCL. Each element of the array can be calculated independently because there are no dependencies between elements in the array. This means that each element can easily be calculated in parallel. This kind of workload is ideal for OpenCL.

Unless otherwise noted, all code snippets below come from the OpenCL implementation in hello_world_opencl.cpp.

C/C++ Implementation

If we assume that we have three arrays (hello_world_c.cpp):

/* Number of elements in the arrays of input and output data. */
int arraySize = 1000000;
/* Arrays to hold the input and output data. */
int* inputA = new int[arraySize];
int* inputB = new int[arraySize];
int* output = new int[arraySize];

Then an implementation in C/C++ is trivial (hello_world_c.cpp):

for (int i = 0; i < arraySize; i++)
{
output[i] = inputA[i] + inputB[i];
}

Discounting any compiler optimisations, this code will execute sequentially on a CPU. This means that each element of the array in calculated in series. An artificial dependency has been created between each calculation.

The runtime for this code is proportional to the size of the arrays.

OpenCL Implementation

The Basics

  1. Move the parallelisable code into an OpenCL kernel

    Take the parallel portion of the code (in our case, the for loop) and move it into and OpenCL kernel. For our code, with no optimisations, it looks like this (hello_world_opencl.cl):

    __kernel void hello_world_opencl(__global int* restrict inputA,
    __global int* restrict inputB,
    __global int* restrict output)
    {
    /*
    * Set i to be the ID of the kernel instance.
    * If the global work size (set by clEnqueueNDRangeKernel) is n,
    * then n kernels will be run and i will be in the range [0, n - 1].
    */
    int i = get_global_id(0);
    /* Use i as an index into the three arrays. */
    output[i] = inputA[i] + inputB[i];
    }
  2. Run multiple instances of the kernel

    There is no loop inside the kernel, so to make the code operate on all elements of the array, we must run several instances of the same kernel:

    /*
    * Each instance of our OpenCL kernel operates on a single element of each array so the number of
    * instances needed is the number of elements in the array.
    */
    size_t globalWorksize[1] = {arraySize};
    /* Enqueue the kernel */
    if (!checkSuccess(clEnqueueNDRangeKernel(commandQueue, kernel, 1, NULL, globalWorksize, NULL, 0, NULL, &event)))
    {
    cleanUpOpenCL(context, commandQueue, program, kernel, memoryObjects, numberOfMemoryObjects);
    cerr << "Failed enqueuing the kernel. " << __FILE__ << ":"<< __LINE__ << endl;
    return 1;
    }

    This submits arraySize number of kernel instances to the OpenCL device.

    Each instance is assigned a unique ID which we use to pick which element of the array each instance operates on (see the kernel above).

    Because we've not specified any dependencies between kernels, the OpenCL device is free to run the instances of the kernel in parallel. The only limit on parallelism now is the device capabilities.

    The runtime for this code is proportional to the size of the arrays divided by the number of kernel instances that can operate in parallel.

OpenCL Setup

There is some OpenCL setup required before the code above can be run. Take a look at hello_world_opencl.cpp for more information.

Memory

Because operations are now happening on the GPU rather than the CPU, we need to understand the location of any data we use. It is important to know whether the data is the GPU or CPU memory space.

In a desktop system the GPU and CPU have their own memories which are separated by a relatively slow bus. This can mean that sharing memory between the CPU and GPU is an expensive operation.

On most embedded systems with a Mali-T600 series GPU, the CPU and GPU share a common memory. It is therefore possible to share memory between the CPU and GPU relatively cheaply.

Because of these system differences, OpenCL supports many ways to allocate and share memory between devices.

Here is one way to share memory between devices which aims to eliminate copying memory from one device to another (in a shared memory system):

  1. Ask the OpenCL implementation to allocate some memory

    In this sample we need three blocks of memory (two inputs and the output).

    We use arrays in a C/C++ implementation. To allocate the arrays, we would do (hello_world_c.cpp):

    /* Number of elements in the arrays of input and output data. */
    int arraySize = 1000000;
    /* Arrays to hold the input and output data. */
    int* inputA = new int[arraySize];
    int* inputB = new int[arraySize];
    int* output = new int[arraySize];

    In OpenCL we use memory buffers which are just blocks of memory with a certain size. To allocate the buffers, we do:

    /* Number of elements in the arrays of input and output data. */
    cl_int arraySize = 1000000;
    /* The buffers are the size of the arrays. */
    size_t bufferSize = arraySize * sizeof(cl_int);
    /*
    * Ask the OpenCL implementation to allocate buffers for the data.
    * We ask the OpenCL implemenation to allocate memory rather than allocating
    * it on the CPU to avoid having to copy the data later.
    * The read/write flags relate to accesses to the memory from within the kernel.
    */
    bool createMemoryObjectsSuccess = true;
    memoryObjects[0] = clCreateBuffer(context, CL_MEM_READ_ONLY | CL_MEM_ALLOC_HOST_PTR, bufferSize, NULL, &errorNumber);
    createMemoryObjectsSuccess &= checkSuccess(errorNumber);
    memoryObjects[1] = clCreateBuffer(context, CL_MEM_READ_ONLY | CL_MEM_ALLOC_HOST_PTR, bufferSize, NULL, &errorNumber);
    createMemoryObjectsSuccess &= checkSuccess(errorNumber);
    memoryObjects[2] = clCreateBuffer(context, CL_MEM_WRITE_ONLY | CL_MEM_ALLOC_HOST_PTR, bufferSize, NULL, &errorNumber);
    createMemoryObjectsSuccess &= checkSuccess(errorNumber);
    if (!createMemoryObjectsSuccess)
    {
    cleanUpOpenCL(context, commandQueue, program, kernel, memoryObjects, numberOfMemoryObjects);
    cerr << "Failed to create OpenCL buffer. " << __FILE__ << ":"<< __LINE__ << endl;
    return 1;
    }

    Although this looks much more complex, there are only three OpenCL API calls. The difference is that here we are checking for errors (which is good practise) and in the C/C++ we are not.

  2. Map the memory to a local pointer

    Now the memory has been allocated but only the OpenCL implementation knows where it is. To access the buffers on the CPU we map them to a pointer:

    /* Map the memory buffers created by the OpenCL implementation to pointers so we can access them on the CPU. */
    bool mapMemoryObjectsSuccess = true;
    cl_int* inputA = (cl_int*)clEnqueueMapBuffer(commandQueue, memoryObjects[0], CL_TRUE, CL_MAP_WRITE, 0, bufferSize, 0, NULL, NULL, &errorNumber);
    mapMemoryObjectsSuccess &= checkSuccess(errorNumber);
    cl_int* inputB = (cl_int*)clEnqueueMapBuffer(commandQueue, memoryObjects[1], CL_TRUE, CL_MAP_WRITE, 0, bufferSize, 0, NULL, NULL, &errorNumber);
    mapMemoryObjectsSuccess &= checkSuccess(errorNumber);
    if (!mapMemoryObjectsSuccess)
    {
    cleanUpOpenCL(context, commandQueue, program, kernel, memoryObjects, numberOfMemoryObjects);
    cerr << "Failed to map buffer. " << __FILE__ << ":"<< __LINE__ << endl;
    return 1;
    }

    These pointers can now be used as normal C/C++ pointers.

  3. Initialise the data on the CPU

    Because we have pointers to the memory, this step is the same as on the CPU:

    for (int i = 0; i < arraySize; i++)
    {
    inputA[i] = i;
    inputB[i] = i;
    }
  4. Un-map the buffers

    To allow the OpenCL device to use the buffers we must un-map them from the CPU:

    /*
    * Unmap the memory objects as we have finished using them from the CPU side.
    * We unmap the memory because otherwise:
    * - reads and writes to that memory from inside a kernel on the OpenCL side are undefined.
    * - the OpenCL implementation cannot free the memory when it is finished.
    */
    if (!checkSuccess(clEnqueueUnmapMemObject(commandQueue, memoryObjects[0], inputA, 0, NULL, NULL)))
    {
    cleanUpOpenCL(context, commandQueue, program, kernel, memoryObjects, numberOfMemoryObjects);
    cerr << "Unmapping memory objects failed " << __FILE__ << ":"<< __LINE__ << endl;
    return 1;
    }
    if (!checkSuccess(clEnqueueUnmapMemObject(commandQueue, memoryObjects[1], inputB, 0, NULL, NULL)))
    {
    cleanUpOpenCL(context, commandQueue, program, kernel, memoryObjects, numberOfMemoryObjects);
    cerr << "Unmapping memory objects failed " << __FILE__ << ":"<< __LINE__ << endl;
    return 1;
    }
  5. Map the data to the kernels

    We have to tell the kernels which data to use for its inputs before we schedule it to run.

    Here, we map the memory objects to the parameters of the OpenCL kernel:

    bool setKernelArgumentsSuccess = true;
    setKernelArgumentsSuccess &= checkSuccess(clSetKernelArg(kernel, 0, sizeof(cl_mem), &memoryObjects[0]));
    setKernelArgumentsSuccess &= checkSuccess(clSetKernelArg(kernel, 1, sizeof(cl_mem), &memoryObjects[1]));
    setKernelArgumentsSuccess &= checkSuccess(clSetKernelArg(kernel, 2, sizeof(cl_mem), &memoryObjects[2]));
    if (!setKernelArgumentsSuccess)
    {
    cleanUpOpenCL(context, commandQueue, program, kernel, memoryObjects, numberOfMemoryObjects);
    cerr << "Failed setting OpenCL kernel arguments. " << __FILE__ << ":"<< __LINE__ << endl;
    return 1;
    }
  6. Run the kernels

    For the kernel code and how to schedule it, see The Basics.

  7. Get the results

    Once the calculations are complete we map the output buffer in the same way we mapped the input buffers. We can then read the results using the pointer as normal and then unmap the buffer as before.

More Information

For more information have a look at the C/C++ code in hello_world_c.cpp and the OpenCL version in hello_world_opencl.cpp and hello_world_opencl.cl.

Main Tutorial: Hello World.

Next section: Vectorizing your OpenCL code.

Running the Samples