![]() |
Mali OpenCL SDK v1.1.0
|
How to vectorize the hello world example to increase performance.
If there is hardware support for vectors on your device, multiple arithmetic operations can happen simultaneously. For example, hardware which has 128-bit vector hardware could do arithmetic operations of four 32-bit integers simultaneously.
Often, OpenCL kernels do the same operations on multiple pieces of data. In those cases the code can usually be vectorized such that each kernel does arithmetic calculation on more than one piece of the data, and those arithmetic operations are done simultaneously. Obviously, doing more calculations simultaneously leads to a performance benefit.
OpenCL devices can advertise their preferred vector widths for the different OpenCL data types. You can use this information to select a kernel that is optimised for the platform you are running on.
For example, one device may only have hardware support for scalar integers while another could have hardware support for integer vectors of width 4. Two versions of the kernel could be written, one using scalars and one using vectors, with the correct version being selected at runtime.
Here is an example of querying for the preferred integer vector width on a particular device (taken from hello_world_vector.cpp):
The same is possible for the other OpenCL data types.
Each Mali-T600 series GPU core has a minimum of two 128-bit wide ALUs (Arithmetic logic units), which are vector capable.
Most operations in the ALUs (e.g. floating point add, floating point multiply, integer addition, integer multiply), can operate on 128-bit vector data (e.g. char16, short8, int4, float4).
Use the querying method above to determine the correct vector size to use for your data type.
We recommend the use of vectors wherever possible when using a Mali-T600 series GPU.
Other than the following changes in this tutorial, the code for the vector and non-vector code is exactly the same.
Modify the kernel to use vectors
Our basic hello world example looked like this (hello_world_opencl.cpp):
Each instance of this kernel does a single integer addition in one operation. We can vectorise this code to do multiple integer additions in a single operation.
Because of the vector hardware capabilities of Mali-T600 series GPUs, these vector operations can be exectuted in the same time as a single integer addition.
We can vectorise like so (hello_world_vector.cl):
Reduce the number of kernel instances
Because each kernel instance is now doing multiple additions, we must reduce the number of kernel instances accordingly (hello_world_vector.cpp):
The reduction factor is based on the width of the vectors. For example, if we'd used int8's in the kernel instead of int4's, we'd reduce the global work size by a factor 8.
For more information have a look at the basic OpenCL code in hello_world_opencl.cpp and hello_world_opencl.cl and the vectorized version in hello_world_vector.cpp and hello_world_vector.cl.
Main Tutorial: Hello World.
Previous section: From C to OpenCL.