Mali OpenCL SDK v1.1.0
 All Classes Files Functions Variables Macros Pages
SGEMM

SGEMM (Single-Precision General Matrix Multiplication) OpenCL sample.

Example Result

By adding these two lines at the beginning of sgemmInitialize function in sgemm.cpp, we can generate the same random numbers as in this example:

const unsigned seed = 11;
srand(seed);

Input matrix A:

[0.852691    0.004421  -0.103067 -0.191788]
[-0.23658    0.0336409  0.15781   0.582199]
[-0.0814268 -0.857794  -0.63804  -0.0184786]
[0.793476    0.459307   0.955647 -0.306809]

Input matrix B:

[0.0529994  0.507535 -0.55821  -0.849519]
[-0.929501  0.914186  0.464341 -0.652125]
[0.409218  -0.125776 -0.273086  0.731335]
[-0.371732  0.43648  -0.8001    0.233541]

Input matrix C:

[-0.380438 -0.188046  0.665832  -0.503661]
[-0.262456 -0.278552 -0.5179    -0.965873]
[0.459781   0.720241 -0.22676   -0.719225]
[-0.277435 -0.126954 -0.0564545 -0.142268]

Output matrix C:

[0.0321557  0.347259 -0.225749 -0.897793]
[-0.221897  0.117096 -0.413021  0.333833]
[0.584754  -0.681301 -0.186507  0.0857027]
[0.0925018  0.5558   -0.250792 -0.360579]

The Algorithm

This sample performs one matrix multiplication and one matrix addition using single precision computation in the expression:

C = αAB + βC

Which is written in terms of matrix components as:

Cij = α∑k AikBkj + βCij

Where A,B and C are matrices of size 1024 x 1024 and α and β are scalar constants.

Implementation

Unless otherwise noted, all code snippets come from the OpenCL kernel found in sgemm.cl.

This kernel vectorizes the summation of matrices A B and C and additionally computes a vector of elements in the resulting matrix C following the equation:

matrixC = alpha * (matrixA * matrixB) + beta * matrixC
  1. Choosing the size of the kernel

    As we are working with symmetric matrices, we adjust the globalWorksize to be the size of the matrices, so each work-item works on a full row/column.

    /*
    * Each kernel outputs one element in matrixC,
    * therefore the total number of work items must be the number of elements (matrixOrder * matrixOrder).
    * To accomplish this we use a global worksize split into 2 dimensions both of matrixOrder size.
    */
    size_t globalWorksize[2] = {matrixOrder, matrixOrder};

    Then the data is accessed by retrieving the pointer to the row/column with get_global_id.

    const int i = get_global_id(1);
    const int j = get_global_id(0);
    float4 sum = (float4)0.0f;
    float4 matrixBColumn;
  2. Loading the input data and doing the calculation

    Based on the previous description, each work-item will work on one row of matrixA and one column of matrixB to compute the value of one element in the result matrixC.

    For the first part of the for loop, values from a column of matrixB are loaded in sets of 4 floats:

    /*
    * Load 4 values from a column of data from matrixB, and 4 values from a row in matrixA,
    * then multiply them together. Repeat until all values in the column/row have been multiplied.
    * We only want the sum of the calculation so we can add the result of each calculation to the last.
    */
    for (int k = 0; k < matrixOrder; k+=4)
    {
    matrixBColumn.x = matrixB[bOffset];
    bOffset += matrixOrder;
    matrixBColumn.y = matrixB[bOffset];
    bOffset += matrixOrder;
    matrixBColumn.z = matrixB[bOffset];
    bOffset += matrixOrder;
    matrixBColumn.w = matrixB[bOffset];
    bOffset += matrixOrder;

    The sum variable accumulates the multiplication of the 4 values of a column from matrixB that have been loaded before, and the next 4 values in a row from matrixA, which are loaded during the multiplication.

    MatrixA pointer is moved to the next 4 values once the loop finish an iteration.

    sum += vload4 (0, matrixA) * matrixBColumn;
    matrixA += 4;
    }

    The for statement will loop until completing a row in matrixA and a column in matrixB.

    We are using vector types in the kernel because Mali-T600 series GPUs have 128-bit vector registers and can do arithmetic on vector types. Therefore, we use OpenCL vectors to make more efficient use of the hardware, leading to higher performance.

  3. Storing the result

    Lastly we do the final calculation and store the result in matrixC:

    /*
    * Sum the 4 results to get the single output of multiplying a row of matrixA by a column of matrixB.
    * Then carry out the final calculation.
    */
    matrixC[i * matrixOrder + j] = alpha * (sum.x + sum.y + sum.z + sum.w) + beta * matrixC[i * matrixOrder + j];

Running the Sample

  1. From a command prompt in the root of the SDK, run:

    cd samples/sgemm
    make install

    This compiles the SGEMM sample code and copies all the files it needs to run to the bin folder in the root directory of the SDK.

  2. Copy this folder to the board.
  3. Navigate to the folder on the board and run the SGEMM binary:

  4. You should see output similar to:

    Profiling information:
    Queued time: 0.094ms
    Wait time: 0.074073ms
    Run time: 1316.41ms

Find solutions for Common Issues.

More Information

For more information have a look at the code in sgemm.cpp and sgemm.cl.