Mali OpenCL SDK v1.1.0
 All Classes Files Functions Variables Macros Pages
Sobel Filter

The Sobel image filter is a simple convolution filter used primarily for edge detection algorithms.

Example Result

sobel_input.bmp
Input image
sobel_output.bmp
Output image

The Algorithm

One technique for doing edge detection on an image is to find the gradient of the image. Areas with large gradients correspond to areas of large change in colour/intensity of the image. These areas are typically edges.

If you convolve the two Sobel operators with an image, you get two outputs:

  • the gradient in the X direction (dX)
  • the gradient in the Y direction (dY)

The gradients are then typically combined to give a total gradient image.

The Sobel operators are:

dX:

 -1  0  1
 -2  0  2
 -1  0  1

dY:

  1  2  1
  0  0  0
 -1 -2 -1

Convolution is well suited to parallelization. Each output pixel only depends on the constant input pixels, not on outputs from any of the other pixels. Therefore, each pixel can be calculated independently and simultaneously.

For more details see Wikipedia.

Implementation

Image Size

We have included a 512x512 input bitmap for use with this sample (to keep the size of the installer small). However, you are more likely to see performance improvements (when compared to C code running on a CPU) when larger images are used. There is some start-up overhead associated with using OpenCL. This overhead can outweigh the benefits of parallel processing when the input data sizes are small.

This sample has been coded to allow any input bitmap to be used. Simply change input.bmp in the assets directory of the sample to the input image of your choice. You will see larger calculation performance improvements when larger images are used.

Padding

It is important to note that we have not considered padding here. The output image is two pixels smaller in both dimensions. Because every output requires pixels around it, it is impossible to calculate the output for the edge pixels. In this example we are simply leaving the edge output pixels as the values they are initialized to.

Sometimes it can be desirable to have the size of the output-signal be the same as the size of the input-signal, in which case "padding" must be applied to the input to take into account for the fact that the filter-application, by its nature, reduces the size. Strategies for padding differ, but for images, a common choice is to repeat the boundary-values (i.e. the outmost set of pixels) on all sides or (in some cases) on just some sides.

Implementation Specifics

We implement Sobel filtering on a single 8-bit channel for simplicity. To do Sobel filtering on RGB images you can run the sobel filter on each channel seperately and then combine the results.

In this sample we take an RGB image, convert it to a 8-bit luminance image and send it to the GPU.

Each Sobel calculation gives an output for the centre pixel of the mask. The output value of the centre pixel is the sum of the pixel values in a 3x3 grid around the pixel, multiplied by the Sobel mask. This can be split into three stages by doing the multiplication and summations for each row of the grid separately (the approach used in this sample).

The Code

Note
We consider vectorization the main optimisation technique used in this tutorial. To show the importance of using vectors, an OpenCL implementation of the Sobel filter which doesn't include vectors is included in the sobel_no_vectors sample(sobel_no_vectors.cpp and sobel_no_vectors.cl). The vectorized version is much faster on Mali-T600 series GPUs.

Unless otherwise noted, all code snippets come from the OpenCL kernel found in sobel.cl.

  1. Choosing the size of the kernel

    We are using vector types in the kernel and so we are actually outputting 16 results per kernel. See below for more details of vectorising. The kernel applies the 3x3 Sobel filter to a 18x3 window in the input image to produce 16x1 results in the two ouput images representing the dx and dy components of the gradient. We adjust the pointers into the data to reflect this:

    /*
    * Each kernel calculates 16 output pixels in the same row (hence the '* 16').
    * column is in the range [0, width] in steps of 16.
    * row is in the range [0, height].
    */
    const int column = get_global_id(0) * 16;
    const int row = get_global_id(1) * 1;
    /* Offset calculates the position in the linear data for the row and the column. */
    const int offset = row * width + column;

    And when we enqueue the kernel in sobel.cpp, we reduce the worksize accordingly:

    /*
    * Each instance of the kernel operates on a 16 * 1 portion of the image.
    * Therefore, the global work size must be width / 16 by height / 1 work items.
    */
    size_t globalWorksize[2] = {width / 16, height / 1};
  2. Loading the input data

    Mali-T600 series GPUs have 128-bit vector registers and can do arithmetic on vector types. Therefore, we use OpenCL vectors to make more efficient use of the hardware, leading to higher performance.

    Here we do vector loads from one row of the data:

    /*
    * First row of input.
    * In a scalar Sobel calculation you would load 1 value for leftLoad, middleLoad and rightLoad.
    * In the vector case we load 16 values for each.
    * leftLoad, middleLoad and rightLoad load 16-bytes of data from the first row.
    * The data they load overlaps. e.g. for the first column and row, leftLoad is 0->15, middleLoad is 1->16 and rightLoad is 2->17.
    * So we're actually loading 18-bytes of data from the first row.
    */
    uchar16 leftLoad = vload16(0, inputImage + (offset + 0));
    uchar16 middleLoad = vload16(0, inputImage + (offset + 1));
    uchar16 rightLoad = vload16(0, inputImage + (offset + 2));
  3. Converting the data

    On a Mali-T600 series GPU, expanding and contracting data types is a free operation. Here, we convert the data from 8-bits per pixel to 16-bits per pixel:

    /*
    * Convert the data from unsigned chars to shorts (8-bit unsigned to 16-bit signed).
    * The calculations can overflow 8-bits so we require larger intermediate storage.
    * Additionally, the values can become negative so we need a signed type.
    */
    short16 leftData = convert_short16(leftLoad);
    short16 middleData = convert_short16(middleLoad);
    short16 rightData = convert_short16(rightLoad);
  4. Doing the calculation

    Then we carry out the calculation on 16 pixels. Each vector calculation can be done as a single operation on a Mali-T600 series GPU:

    /*
    * Calculate the results for the first row.
    * Looking at the Sobel masks above for the first line of input,
    * the dX calculation is the sum of 1 * leftData, 0 * middleData, and -1 * rightData.
    * The dY calculation is the sum of 1 * leftData, 2 * middleData, and 1 * rightData.
    * This is what is being calculated below, except we have removed the
    * unnecessary calculations (multiplications by 1 or 0) and we are calculating 16 values at once.
    * This pattern repeats for the other 2 rows of data.
    */
    short16 dx = rightData - leftData;
    short16 dy = rightData + leftData + middleData * (short)2;

    We do this for the other two rows, accumulating the results in dx and dy.

  5. Storing the results

    Finally we contract and store the data. We use a vector store to write out all 16 results at once:

    /*
    * Store the results.
    * The range of outputs from our Sobel calculations is [-1020, 1020].
    * In order to output this as an 8-bit signed char we must divide it by 8 (or shift right 3 times).
    * This gives the range [-128, 128]. Depending on what type of output you require,
    * (signed/unsigned, seperate/combined gradients) it is possible to do more of the calculations on the GPU using OpenCL.
    * In this sample we're assuming that the application requires signed uncombined gradient outputs.
    */
    vstore16(convert_char16(dx >> 3), 0, outputImageDX + offset + width + 1);
    vstore16(convert_char16(dy >> 3), 0, outputImageDY + offset + width + 1);

    Because the data is returned as two seperate gradient images we combine them on the CPU before writing them out as a bitmap.

Running the Sample

  1. From a command prompt in the root of the SDK, run:

    cd samples/sobel
    make install

    This compiles the Sobel sample code and copies all the files it needs to run to the bin folder in the root directory of the SDK.

  2. Copy this folder to the board.
  3. Navigate to the folder on the board and run the Sobel binary:

  4. You should see output similar to:

    Profiling information:
    Queued time: 0.067ms
    Wait time: 0.057835ms
    Run time: 0.164923ms

    An output image should be created on the board called output.bmp.

Find solutions for Common Issues.

More Information

For more information have a look at the code in sobel.cpp and sobel.cl.