Mali OpenCL SDK v1.1.0
 All Classes Files Functions Variables Macros Pages
FIR Filter

Floating point FIR (Finite Input Response) image filter for pixelization and/or noise reduction.

Example Result

fir_float_input.bmp
Input image
fir_float_output.bmp
Output image

The Algorithm

The FIR filter is designed to calculate averages from a finite input.

To simplify the concept of FIR filtering, consider the one-dimensional signal [17 76 17 84 29], to which we will apply a 3x1 filter with the coefficients [3/15 9/15 3/15]. To keep the output signal values less than the common coefficient denominator, all the coefficient numerators summed together should be less than or equal to the common coefficient denominator.

  1. The filter is overlaid on top of the start of the signal and "reads" the values [17 76 17].
  2. The output is the sum of the signal-values multiplied with the filter coefficients.

    [17 76 17] → [(17 * 3/15) + (76 * 9/15) + (17 * 3/15)] → [52.4] → [52]. (Using integers for simplicity).

  3. The "kernel" (i.e. the filter as applied to the signal, sometimes also known as the "window") is then slid one step to the right where it reads [76 17 84].
  4. The ouput is now: [(76 * 3/15) + (17 * 9/15) + (84 * 3/15)] → [42.2] → [42].
  5. Next it reads [17 84 29] → [59.6] → [59]
  6. Since we can no longer slide the kernel along the signal without going out of bounds we stop there and the full output signal is [52 42 59].

Implementation

Image Size

We have included a 512x512 input bitmap for use with this sample (to keep the size of the installer small). However, you are more likely to see performance improvements (when compared to C code running on a CPU) when larger images are used. There is some start-up overhead associated with using OpenCL. This overhead can outweigh the benefits of parallel processing when the input data sizes are small.

This sample has been coded to allow any input bitmap to be used. Simply change input.bmp in the assets directory of the sample to the input image of your choice. You will see larger calculation performance improvements when larger images are used.

Padding

It is important to note that we have not considered padding here. The output image is two pixels smaller in both dimensions. Because every output requires pixels around it, it is impossible to calculate the output for the edge pixels. In this example we are simply leaving the edge output pixels as the values they are initialized to.

Sometimes it can be desirable to have the size of the output-signal be the same as the size of the input-signal, in which case "padding" must be applied to the input to take into account for the fact that the filter-application, by its nature, reduces the size. Strategies for padding differ, but for images, a common choice is to repeat the boundary-values (i.e. the outmost set of pixels) on all sides or (in some cases) on just some sides.

Mali-T600 Series Hardware

Mali-T600 series GPU pipelines provide true IEEE-754 single-precision floating-point math in hardware. We recommend to use vectors of 128-bit wide. For more information about vectorization, see Vectorizing your OpenCL code.

In this sample, the calculations use 32-bit floating point numbers. One 128-bit vector can fit four 32-bit floating point numbers. Therefore, using float4's makes maximum use of the hardware.

We recommend the use of vectors wherever possible when using a Mali-T600 series GPU.

Implementation Specifics

We implement FIR filtering on a single 8-bit channel for simplicity. To do FIR filtering on RGB images you can run the FIR filter on each channel separately and then combine the results. In this sample we take an RGB image, convert it to a 8-bit luminance image and send it to the GPU.

Each FIR calculation gives an output for the centre pixel of the mask. The output value of the centre pixel is the sum of the pixel values in a 3x3 grid around the pixel, multiplied by the coefficients. This can be split into three stages by doing the summations for each row of the grid separately. This sample does the calculations one row at a time. However, instead of operating on one pixel at a time, it uses vectors of 4 pixels at once.

The Filter

The kernel applies a 3x3 FIR filter with constant coefficients (weights) to a 6x3 window in an input luminance image to produce 4x1 windows in the output image.

The input and output images are represented using arrays of floating point numbers.

The FIR coefficients are defined as constant floats and organized as follows:

FW_UL FW_UM FW_UR
FW_CL FW_CM FW_CR
FW_BL FW_BM FW_BR

In a real application, these values can be derived in a number of different ways depending on the intended result. For this sample, the values are random.

Our common coefficient denominator is 256, so the output pixel fits within a char. And as explained in The Algorithm section, the sums of the coefficient numerators must be less than 256.

The coefficients being used are:

[30 5  6 ]
[19 30 9 ] / 256
[15 5  40]

The Code

Unless otherwise noted, all code snippets come from the OpenCL kernel found in fir_float.cl.

  1. Choosing the size of the kernel

    We are using vector types in the kernel and so we are actually outputting 4 results per kernel. See above for more details of vectorising. We adjust the pointers into the data to reflect this:

    /*
    * Each kernel calculates 4 output pixels in the same row (hence the '* 4').
    * column is in the range [0, width] in steps of 4.
    * row is in the range [0, height].
    */
    const int column = get_global_id(0) * 4;
    const int row = get_global_id(1);
    /* Offset calculates the position in the linear data for the row and the column. */
    const int offset = row * width + column;

    And when we enqueue the kernel in fir_float.cpp, we reduce the worksize accordingly:

    /*
    * Each instance of the kernel operates on a 4 * 1 portion of the image.
    * Therefore, the global work size must be width / 4 by height / 1 work items.
    */
    size_t globalWorksize[2] = {width / 4, height / 1};
  2. Loading the input data

    Here we do vector loads from one row of the data:

    /*
    * Access the first row in the 6x3 window to apply FW_U coefficients.
    * data1 can be constructed from the other vectors without doing an additional load.
    */
    float4 data0 = vload4(0, input + offset);
    float4 data2 = vload4(0, input + offset + 2);
    float4 data1 = (float4)(data0.s12, data2.s12);
  3. Applying the filter

    Then we carry out the calculation on 4 pixels. Each vector calculation can be done as a single operation on Mali-T600 series GPU:

    accumulator += data0 * FW_UL;
    accumulator += data1 * FW_UM;
    accumulator += data2 * FW_UR;

    We apply the same pattern to the second and third row, to accumulate the result in the accumulator variable:

    /* Access the second row in the 6x3 window and repeat the process, but with FW_C coefficients. */
    data0 = vload4(0, input + offset + width);
    data2 = vload4(0, input + offset + width + 2);
    data1 = (float4)(data0.s12, data2.s12);
    accumulator += data0 * FW_CL;
    accumulator += data1 * FW_CM;
    accumulator += data2 * FW_CR;
    /* Access the third row in the 6x3 window and repeat the process, but with FW_B coefficients. */
    data0 = vload4(0, input + offset + width * 2);
    data2 = vload4(0, input + offset + width * 2 + 2);
    data1 = (float4)(data0.s12, data2.s12);
    accumulator += data0 * FW_BL;
    accumulator += data1 * FW_BM;
    accumulator += data2 * FW_BR;
  4. Storing the result

    Finally store the data. We use a vector store to write out the 4 results at once:

    /* Store the accumulator. */
    vstore4(accumulator, 0, output + offset);

Running the Sample

  1. From a command prompt in the root of the SDK, run:

    cd samples/fir_float
    make install

    This compiles the FIR float sample code and copies all the files it needs to run to the bin folder in the root directory of the SDK.

  2. Copy this folder to the board.
  3. Navigate to the folder on the board and run the FIR float binary:

  4. You should see output similar to:

    Profiling information:
    Queued time: 0.057ms
    Wait time: 0.071017ms
    Run time: 0.475355ms

    An output image should be created on the board called output.bmp.

Find solutions for Common Issues.

More Information

For more information have a look at the code in fir_float.cpp and fir_float.cl.