![]() |
Mali OpenCL SDK v1.1.0
|
SGEMM (Single-Precision General Matrix Multiplication) OpenCL sample.
By adding these two lines at the beginning of sgemmInitialize function in sgemm.cpp, we can generate the same random numbers as in this example:
Input matrix A:
[0.852691 0.004421 -0.103067 -0.191788] [-0.23658 0.0336409 0.15781 0.582199] [-0.0814268 -0.857794 -0.63804 -0.0184786] [0.793476 0.459307 0.955647 -0.306809]
Input matrix B:
[0.0529994 0.507535 -0.55821 -0.849519] [-0.929501 0.914186 0.464341 -0.652125] [0.409218 -0.125776 -0.273086 0.731335] [-0.371732 0.43648 -0.8001 0.233541]
Input matrix C:
[-0.380438 -0.188046 0.665832 -0.503661] [-0.262456 -0.278552 -0.5179 -0.965873] [0.459781 0.720241 -0.22676 -0.719225] [-0.277435 -0.126954 -0.0564545 -0.142268]
Output matrix C:
[0.0321557 0.347259 -0.225749 -0.897793] [-0.221897 0.117096 -0.413021 0.333833] [0.584754 -0.681301 -0.186507 0.0857027] [0.0925018 0.5558 -0.250792 -0.360579]
This sample performs one matrix multiplication and one matrix addition using single precision computation in the expression:
C = αAB + βC
Which is written in terms of matrix components as:
Cij = α∑k AikBkj + βCij
Where A,B and C are matrices of size 1024 x 1024 and α and β are scalar constants.
Unless otherwise noted, all code snippets come from the OpenCL kernel found in sgemm.cl.
This kernel vectorizes the summation of matrices A B and C and additionally computes a vector of elements in the resulting matrix C following the equation:
matrixC = alpha * (matrixA * matrixB) + beta * matrixC
Choosing the size of the kernel
As we are working with symmetric matrices, we adjust the globalWorksize to be the size of the matrices, so each work-item works on a full row/column.
Then the data is accessed by retrieving the pointer to the row/column with get_global_id.
Loading the input data and doing the calculation
Based on the previous description, each work-item will work on one row of matrixA and one column of matrixB to compute the value of one element in the result matrixC.
For the first part of the for loop, values from a column of matrixB are loaded in sets of 4 floats:
The sum variable accumulates the multiplication of the 4 values of a column from matrixB that have been loaded before, and the next 4 values in a row from matrixA, which are loaded during the multiplication.
MatrixA pointer is moved to the next 4 values once the loop finish an iteration.
The for statement will loop until completing a row in matrixA and a column in matrixB.
We are using vector types in the kernel because Mali-T600 series GPUs have 128-bit vector registers and can do arithmetic on vector types. Therefore, we use OpenCL vectors to make more efficient use of the hardware, leading to higher performance.
Storing the result
Lastly we do the final calculation and store the result in matrixC:
From a command prompt in the root of the SDK, run:
This compiles the SGEMM sample code and copies all the files it needs to run to the bin folder in the root directory of the SDK.
Navigate to the folder on the board and run the SGEMM binary:
You should see output similar to:
Find solutions for Common Issues.
For more information have a look at the code in sgemm.cpp and sgemm.cl.