![]() |
Mali OpenCL SDK v1.1.0
|
How to move from a C/C++ 'for loop' to an OpenCL kernel.
In this tutorial we are going to look at a very simple algorithm. Adding two arrays of numbers together element-by-element and storing the results in a third array:
Cn = An + Bn
Despite the simplistic nature of this algorithm, it can benefit from being implemented in OpenCL. Each element of the array can be calculated independently because there are no dependencies between elements in the array. This means that each element can easily be calculated in parallel. This kind of workload is ideal for OpenCL.
Unless otherwise noted, all code snippets below come from the OpenCL implementation in hello_world_opencl.cpp.
If we assume that we have three arrays (hello_world_c.cpp):
Then an implementation in C/C++ is trivial (hello_world_c.cpp):
Discounting any compiler optimisations, this code will execute sequentially on a CPU. This means that each element of the array in calculated in series. An artificial dependency has been created between each calculation.
The runtime for this code is proportional to the size of the arrays.
Move the parallelisable code into an OpenCL kernel
Take the parallel portion of the code (in our case, the for loop) and move it into and OpenCL kernel. For our code, with no optimisations, it looks like this (hello_world_opencl.cl):
Run multiple instances of the kernel
There is no loop inside the kernel, so to make the code operate on all elements of the array, we must run several instances of the same kernel:
This submits arraySize number of kernel instances to the OpenCL device.
Each instance is assigned a unique ID which we use to pick which element of the array each instance operates on (see the kernel above).
Because we've not specified any dependencies between kernels, the OpenCL device is free to run the instances of the kernel in parallel. The only limit on parallelism now is the device capabilities.
The runtime for this code is proportional to the size of the arrays divided by the number of kernel instances that can operate in parallel.
There is some OpenCL setup required before the code above can be run. Take a look at hello_world_opencl.cpp for more information.
Because operations are now happening on the GPU rather than the CPU, we need to understand the location of any data we use. It is important to know whether the data is the GPU or CPU memory space.
In a desktop system the GPU and CPU have their own memories which are separated by a relatively slow bus. This can mean that sharing memory between the CPU and GPU is an expensive operation.
On most embedded systems with a Mali-T600 series GPU, the CPU and GPU share a common memory. It is therefore possible to share memory between the CPU and GPU relatively cheaply.
Because of these system differences, OpenCL supports many ways to allocate and share memory between devices.
Here is one way to share memory between devices which aims to eliminate copying memory from one device to another (in a shared memory system):
Ask the OpenCL implementation to allocate some memory
In this sample we need three blocks of memory (two inputs and the output).
We use arrays in a C/C++ implementation. To allocate the arrays, we would do (hello_world_c.cpp):
In OpenCL we use memory buffers which are just blocks of memory with a certain size. To allocate the buffers, we do:
Although this looks much more complex, there are only three OpenCL API calls. The difference is that here we are checking for errors (which is good practise) and in the C/C++ we are not.
Map the memory to a local pointer
Now the memory has been allocated but only the OpenCL implementation knows where it is. To access the buffers on the CPU we map them to a pointer:
These pointers can now be used as normal C/C++ pointers.
Initialise the data on the CPU
Because we have pointers to the memory, this step is the same as on the CPU:
Un-map the buffers
To allow the OpenCL device to use the buffers we must un-map them from the CPU:
Map the data to the kernels
We have to tell the kernels which data to use for its inputs before we schedule it to run.
Here, we map the memory objects to the parameters of the OpenCL kernel:
Run the kernels
For the kernel code and how to schedule it, see The Basics.
Get the results
Once the calculations are complete we map the output buffer in the same way we mapped the input buffers. We can then read the results using the pointer as normal and then unmap the buffer as before.
For more information have a look at the C/C++ code in hello_world_c.cpp and the OpenCL version in hello_world_opencl.cpp and hello_world_opencl.cl.
Main Tutorial: Hello World.
Next section: Vectorizing your OpenCL code.